Introduction to data.frame
From a data science perspective, the most important class of objects is the data frame - Chambers (2020)
- Data ordered in rows and columns - just like a spreadsheet
- Technical implementation in R:
data.frame
is a list of vectors- each vector is one column
- vectors are atomic - each value in a column has the same data type
- different columns can have different data types
= data.frame(plotID = seq(3),
df soil_ph = c(5.5, 5.4, 6.1),
soil_temperature = c(10, 11, 12),
forest_type = c("coniferous", "coniferous", "deciduous"))
class(df)
[1] "data.frame"
df
plotID soil_ph soil_temperature forest_type
1 1 5.5 10 coniferous
2 2 5.4 11 coniferous
3 3 6.1 12 deciduous
Once you deal with actual data you want to analyse, it is rare that you want to build a data.frame
from scratch like the example above. Instead, you have some file prepared on your computer e.g. a .csv
file you want to get into R. Learn more about external data and how to get them into R in the next lesson.
For now, we simply want to have a larger data.frame
to show and test some functions. Here, I use data about the districts of Muenster that will be part of assignments/Ex04_trees.qmd later in the course.
= read.csv(file = "data/muenster_districts.csv")
data
# show the first few rows of the df
head(data)
id district district_group area
1 15 Bahnhof Innenstadtring 363089.2
2 16 Albachten Münster-West 12965147.7
3 17 Angelmodde Münster-Südost 5016879.5
4 18 Kreuz Innenstadtring 1014690.9
5 19 Berg Fidel Münster-Hiltrup 4781065.8
6 20 Düesberg Mitte-Süd 2185161.9
# show the last few rows of the df
tail(data)
id district district_group area
40 40 Amelsbüren Münster-Hiltrup 43373127
41 41 Mecklenbeck Münster-West 6244230
42 42 Uppenberg Mitte-Nordost 3400206
43 43 Nienberge Münster-West 27773029
44 44 Sentrup Münster-West 6627422
45 45 Hiltrup-Mitte Münster-Hiltrup 5984811
# get a short summary of the structure
str(data)
'data.frame': 45 obs. of 4 variables:
$ id : int 15 16 17 18 19 20 21 22 23 24 ...
$ district : chr "Bahnhof" "Albachten" "Angelmodde" "Kreuz" ...
$ district_group: chr "Innenstadtring" "Münster-West" "Münster-Südost" "Innenstadtring" ...
$ area : num 363089 12965148 5016880 1014691 4781066 ...
data.frame subsetting
You can think about data.frames
as 2-d vectors. Subsetting of data.frames
hence requires two values, one for the row subset, one for the column subset:
# row 1, column 2
1,2] data[
[1] "Bahnhof"
# the first row, empty means "everything"
1,] data[
id district district_group area
1 15 Bahnhof Innenstadtring 363089.2
# the first 3 rows, column 3 and 4
seq(3), c(3,4)] data[
district_group area
1 Innenstadtring 363089.2
2 Münster-West 12965147.7
3 Münster-Südost 5016879.5
If you want to extract a column of a data.frame
you can use the $
operator. The resulting object is a vector, and hence only has one dimension. This is important to recognize, because a subset of a column by the $
operator again needs only one value (see example below).
<- data$area
district_area summary(district_area)
Min. 1st Qu. Median Mean 3rd Qu. Max.
164740 760916 3157262 6736010 6399041 43373127
# subsetting a column accessed with $ only needs one value
$area[5] data
[1] 4781066
# You can also define new columns with the $
$city <- "Muenster"
datahead(data)
id district district_group area city
1 15 Bahnhof Innenstadtring 363089.2 Muenster
2 16 Albachten Münster-West 12965147.7 Muenster
3 17 Angelmodde Münster-Südost 5016879.5 Muenster
4 18 Kreuz Innenstadtring 1014690.9 Muenster
5 19 Berg Fidel Münster-Hiltrup 4781065.8 Muenster
6 20 Düesberg Mitte-Süd 2185161.9 Muenster
Subsetting a data.frame
by column names also works:
# read: from data, rows 1 to 5, and columns with the name Datum and pH.Wert
seq(5), c("district", "district_group")] data[
district district_group
1 Bahnhof Innenstadtring
2 Albachten Münster-West
3 Angelmodde Münster-Südost
4 Kreuz Innenstadtring
5 Berg Fidel Münster-Hiltrup
Of course logical operators also work for data.frame
subsetting:
# read: from data, only the rows where area is larger than 20 Million, and all the columns
$area > 20000000, ] data[data
id district district_group area city
16 30 Wolbeck Münster-Südost 20706535 Muenster
36 36 Handorf Münster-Ost 30696425 Muenster
38 38 Sprakel Münster-Nord 22417558 Muenster
40 40 Amelsbüren Münster-Hiltrup 43373127 Muenster
43 43 Nienberge Münster-West 27773029 Muenster
# read: from data, only rows where area is smaller than 500000, and only the column with the name district
$area < 500000, "district"] data[data
[1] "Bahnhof" "Mauritz-West" "Aegidii" "Dom" "Martini"
[6] "Hansaplatz" "Buddenturm" "Überwasser"