Introduction to data.frame
From a data science perspective, the most important class of objects is the data frame - Chambers (2020)
- Data ordered in rows and columns - just like a spreadsheet
- Technical implementation in R:
data.frame
is a list of vectors- each vector is one column
- vectors are atomic - each value in a column has the same data type
- different columns can have different data types
= data.frame(plotID = seq(3),
df soil_ph = c(5.5, 5.4, 6.1),
soil_temperature = c(10, 11, 12),
forest_type = c("coniferous", "coniferous", "deciduous"))
class(df)
[1] "data.frame"
df
plotID soil_ph soil_temperature forest_type
1 1 5.5 10 coniferous
2 2 5.4 11 coniferous
3 3 6.1 12 deciduous
Once you deal with actual data you want to analyse, it is rare that you want to build a data.frame
from scratch like the example above. Instead, you have some file prepared on your computer e.g. a .csv
file you want to get into R. Learn more about external data and how to get them into R in the next lesson.
For now, we simply want to have a larger data.frame
to show and test some functions. Here, I use the temperature data from the Aasee that was also part of assignments/Ex02_second.qmd.
= read.csv(file = "data/2021-06_aasee.csv")
data
# show the first few rows of the df
head(data)
Datum Wassertemperatur pH.Wert Sauerstoffgehalt
1 2021-05-31 23:57 17.98 8.05 10.53
2 2021-06-01 00:09 17.66 8.04 9.64
3 2021-06-01 00:19 18.03 8.12 11.30
4 2021-06-01 00:27 18.08 8.14 11.32
5 2021-06-01 00:39 18.06 8.12 11.06
6 2021-06-01 00:49 18.01 8.10 10.91
# show the last few rows of the df
tail(data)
Datum Wassertemperatur pH.Wert Sauerstoffgehalt
4220 2021-06-30 22:57 23.73 8.78 17.80
4221 2021-06-30 23:09 23.70 8.72 17.66
4222 2021-06-30 23:18 23.68 8.73 17.72
4223 2021-06-30 23:29 23.64 8.81 18.38
4224 2021-06-30 23:39 23.62 8.76 17.93
4225 2021-06-30 23:49 23.63 8.77 17.82
# get a short summary of the structure
str(data)
'data.frame': 4225 obs. of 4 variables:
$ Datum : chr "2021-05-31 23:57" "2021-06-01 00:09" "2021-06-01 00:19" "2021-06-01 00:27" ...
$ Wassertemperatur: num 18 17.7 18 18.1 18.1 ...
$ pH.Wert : num 8.05 8.04 8.12 8.14 8.12 8.1 8.1 8.1 8.1 8.1 ...
$ Sauerstoffgehalt: num 10.53 9.64 11.3 11.32 11.06 ...
data.frame subsetting
You can think about data.frames
as 2-d vectors. Subsetting of data.frames
hence requires two values, one for the row subset, one for the column subset:
# row 1, column 2
1,2] data[
[1] 17.98
# the first row, empty means "everything"
1,] data[
Datum Wassertemperatur pH.Wert Sauerstoffgehalt
1 2021-05-31 23:57 17.98 8.05 10.53
# the first 3 rows, column 3 and 4
seq(3), c(3,4)] data[
pH.Wert Sauerstoffgehalt
1 8.05 10.53
2 8.04 9.64
3 8.12 11.30
If you want to extract a column of a data.frame
you can use the $
operator. The resulting object is a vector, and hence only has one dimension. This is important to recognize, because a subset of a column by the $
operator again needs only one value (see example below).
<- data$Wassertemperatur
temperature summary(temperature)
Min. 1st Qu. Median Mean 3rd Qu. Max.
17.18 21.65 23.07 23.15 24.52 28.96
# subsetting a column accessed with $ only needs one value
$pH.Wert[100] data
[1] 8.38
# You can also define new columns with the $
$location <- "Aasee"
datahead(data)
Datum Wassertemperatur pH.Wert Sauerstoffgehalt location
1 2021-05-31 23:57 17.98 8.05 10.53 Aasee
2 2021-06-01 00:09 17.66 8.04 9.64 Aasee
3 2021-06-01 00:19 18.03 8.12 11.30 Aasee
4 2021-06-01 00:27 18.08 8.14 11.32 Aasee
5 2021-06-01 00:39 18.06 8.12 11.06 Aasee
6 2021-06-01 00:49 18.01 8.10 10.91 Aasee
Subsetting a data.frame
by column names also works:
# read: from data, rows 1 to 5, and columns with the name Datum and pH.Wert
seq(5), c("Datum", "pH.Wert")] data[
Datum pH.Wert
1 2021-05-31 23:57 8.05
2 2021-06-01 00:09 8.04
3 2021-06-01 00:19 8.12
4 2021-06-01 00:27 8.14
5 2021-06-01 00:39 8.12
Of course logical operators also work for data.frame
subsetting:
# read: from data, only the rows where ph is larger than 8.9, and all the columns
$pH.Wert > 8.9, ] data[data
Datum Wassertemperatur pH.Wert Sauerstoffgehalt location
4056 2021-06-29 19:20 25.37 8.92 21.21 Aasee
4185 2021-06-30 17:07 24.14 8.91 20.58 Aasee
4186 2021-06-30 17:18 24.14 8.91 20.26 Aasee
4188 2021-06-30 17:38 24.17 8.92 20.50 Aasee
4189 2021-06-30 17:49 24.15 8.93 20.56 Aasee
4190 2021-06-30 17:59 24.14 8.92 20.15 Aasee
4191 2021-06-30 18:07 24.13 8.91 19.97 Aasee
4192 2021-06-30 18:19 24.14 8.91 19.71 Aasee
4193 2021-06-30 18:29 24.12 8.91 19.70 Aasee
4194 2021-06-30 18:39 24.12 8.92 19.43 Aasee
# read: from data, only rows where temperature is smaller than 17.5, and only the column with the name Datum
$Wassertemperatur < 17.5, "Datum"] data[data
[1] "2021-06-01 06:59" "2021-06-01 07:09" "2021-06-01 07:18" "2021-06-01 07:29"
[5] "2021-06-01 07:39" "2021-06-01 07:50" "2021-06-01 08:09" "2021-06-01 08:17"
[9] "2021-06-01 08:38" "2021-06-01 08:49" "2021-06-01 08:58" "2021-06-01 11:58"
[13] "2021-06-01 12:19" "2021-06-01 12:59" "2021-06-01 13:04" "2021-06-01 13:13"
[17] "2021-06-01 13:25"