Introduction to data.frame

From a data science perspective, the most important class of objects is the data frame - Chambers (2020)

  • Data ordered in rows and columns - just like a spreadsheet
  • Technical implementation in R:
    • data.frame is a list of vectors
    • each vector is one column
    • vectors are atomic - each value in a column has the same data type
    • different columns can have different data types

data.frames are a named list of vectors
df = data.frame(plotID = seq(3),
                soil_ph = c(5.5, 5.4, 6.1),
                soil_temperature = c(10, 11, 12),
                forest_type = c("coniferous", "coniferous", "deciduous"))

class(df)
[1] "data.frame"
df
  plotID soil_ph soil_temperature forest_type
1      1     5.5               10  coniferous
2      2     5.4               11  coniferous
3      3     6.1               12   deciduous

Once you deal with actual data you want to analyse, it is rare that you want to build a data.frame from scratch like the example above. Instead, you have some file prepared on your computer e.g. a .csv file you want to get into R. Learn more about external data and how to get them into R in the next lesson.

For now, we simply want to have a larger data.frame to show and test some functions. Here, I use data about the districts of Muenster that will be part of assignments/Ex04_trees.qmd later in the course.

data = read.csv(file = "data/muenster_districts.csv")


# show the first few rows of the df
head(data)
  id   district  district_group       area
1 15    Bahnhof  Innenstadtring   363089.2
2 16  Albachten    Münster-West 12965147.7
3 17 Angelmodde  Münster-Südost  5016879.5
4 18      Kreuz  Innenstadtring  1014690.9
5 19 Berg Fidel Münster-Hiltrup  4781065.8
6 20   Düesberg       Mitte-Süd  2185161.9
# show the last few rows of the df
tail(data)
   id      district  district_group     area
40 40    Amelsbüren Münster-Hiltrup 43373127
41 41   Mecklenbeck    Münster-West  6244230
42 42     Uppenberg   Mitte-Nordost  3400206
43 43     Nienberge    Münster-West 27773029
44 44       Sentrup    Münster-West  6627422
45 45 Hiltrup-Mitte Münster-Hiltrup  5984811
# get a short summary of the structure
str(data)
'data.frame':   45 obs. of  4 variables:
 $ id            : int  15 16 17 18 19 20 21 22 23 24 ...
 $ district      : chr  "Bahnhof" "Albachten" "Angelmodde" "Kreuz" ...
 $ district_group: chr  "Innenstadtring" "Münster-West" "Münster-Südost" "Innenstadtring" ...
 $ area          : num  363089 12965148 5016880 1014691 4781066 ...

data.frame subsetting

You can think about data.frames as 2-d vectors. Subsetting of data.frames hence requires two values, one for the row subset, one for the column subset:

# row 1, column 2
data[1,2]
[1] "Bahnhof"
# the first row, empty means "everything"
data[1,]
  id district district_group     area
1 15  Bahnhof Innenstadtring 363089.2
# the first 3 rows, column 3 and 4
data[seq(3), c(3,4)]
  district_group       area
1 Innenstadtring   363089.2
2   Münster-West 12965147.7
3 Münster-Südost  5016879.5

If you want to extract a column of a data.frame you can use the $ operator. The resulting object is a vector, and hence only has one dimension. This is important to recognize, because a subset of a column by the $ operator again needs only one value (see example below).

district_area <- data$area
summary(district_area)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
  164740   760916  3157262  6736010  6399041 43373127 
# subsetting a column accessed with $ only needs one value
data$area[5]
[1] 4781066
# You can also define new columns with the $
data$city <- "Muenster"
head(data)
  id   district  district_group       area     city
1 15    Bahnhof  Innenstadtring   363089.2 Muenster
2 16  Albachten    Münster-West 12965147.7 Muenster
3 17 Angelmodde  Münster-Südost  5016879.5 Muenster
4 18      Kreuz  Innenstadtring  1014690.9 Muenster
5 19 Berg Fidel Münster-Hiltrup  4781065.8 Muenster
6 20   Düesberg       Mitte-Süd  2185161.9 Muenster

Subsetting a data.frame by column names also works:

# read: from data, rows 1 to 5, and columns with the name Datum and pH.Wert
data[seq(5), c("district", "district_group")]
    district  district_group
1    Bahnhof  Innenstadtring
2  Albachten    Münster-West
3 Angelmodde  Münster-Südost
4      Kreuz  Innenstadtring
5 Berg Fidel Münster-Hiltrup

Of course logical operators also work for data.frame subsetting:

# read: from data, only the rows where area is larger than 20 Million, and all the columns
data[data$area > 20000000, ]
   id   district  district_group     area     city
16 30    Wolbeck  Münster-Südost 20706535 Muenster
36 36    Handorf     Münster-Ost 30696425 Muenster
38 38    Sprakel    Münster-Nord 22417558 Muenster
40 40 Amelsbüren Münster-Hiltrup 43373127 Muenster
43 43  Nienberge    Münster-West 27773029 Muenster
# read: from data, only rows where area is smaller than 500000, and only the column with the name district
data[data$area < 500000, "district"]
[1] "Bahnhof"      "Mauritz-West" "Aegidii"      "Dom"          "Martini"     
[6] "Hansaplatz"   "Buddenturm"   "Überwasser"  

References

Chambers, John, M. 2020. “S, R, and Data Science.” The R Journal 12 (1): 462. https://doi.org/10.32614/RJ-2020-028.