Introduction to data.frame

From a data science perspective, the most important class of objects is the data frame - Chambers (2020)

  • Data ordered in rows and columns - just like a spreadsheet
  • Technical implementation in R:
    • data.frame is a list of vectors
    • each vector is one column
    • vectors are atomic - each value in a column has the same data type
    • different columns can have different data types
Figure 1: data.frames are a named list of vectors
df = data.frame(plotID = seq(3),
                soil_ph = c(5.5, 5.4, 6.1),
                soil_temperature = c(10, 11, 12),
                forest_type = c("coniferous", "coniferous", "deciduous"))

class(df)
[1] "data.frame"

You can still access the individual column vectors by using the $ operator.

df$soil_ph
[1] 5.5 5.4 6.1

data.frame subsetting

One of the most important skills you will need in R, is the ability to subset a data.frame in order to get the information you need for your analysis. Rarely, you need the whole dataframe at once for something. Usually, you only need certain columns, e.g. in order to calculate the average soil temperature for all the locations you would only need the values of the soil_temperature column (mean(df$soil_temperature)). Let’s have a more detailed look at the following question:

What is the highest soil pH we measured in coniferous forests?

Of course, just by looking at our little example data.frame, the answer should be 5.5 - But how do we get there with R code? To do this, we have to go step by step and reduce the data.frame logically. If we want to get the maximum soil pH of coniferous forest, we first have to get all the pH values we gathered in coniferous forests. In general, there are only two options when reducing a data.frame:

  1. subsetting columns, i.e. reducing the data.frame in width
  2. subsetting rows i.e. reducing the data.frame in length

Think about which columns you need and how to get to specific rows in this column. Your mental image could look something like this:

Figure 2: data.frames subsetting with rows and columns

We want to end up with the two cells marked in blue. The column we need to reduce the data.frame to is of course the soil_ph column which we can access easily with the $ operator, however, to get the two specific rows is a bit more tricky. Luckily, with the data.frame, we have more information available that we can use. Combined with the use of logical operators we could reduce the data.frame to the rows where in the forest_type column, the word “coniferous” appears.

Figure 3: data.frames subsetting with logical operators
df$soil_ph
[1] 5.5 5.4 6.1
df$forest_type == 'coniferous'
[1]  TRUE  TRUE FALSE
max(df[df$forest_type == 'coniferous',]$soil_ph)
[1] 5.5

References

Chambers, John, M. 2020. “S, R, and Data Science.” The R Journal 12 (1): 462. https://doi.org/10.32614/RJ-2020-028.