library(dplyr)
= sunshine |>
sunshine filter(Jahr > 2015 & Jahr <= 2020) |>
select("Jahr", "Sachsen", "Hessen", "Bayern")
Wide vs. Long data.frames
Most data you collect yourself or can download from data repositories will come in some variation or form of a “Excel” style spreadsheet. What I mean by that is, that a row in the table contains different features of the same observation. Each column on the other hand contains the same feature of many different observations. Of course, what information we store with feature and observation heavily depends on the content and purpose of the data.
Take a look at the following data from the German weather service DWD about sunshine duration across German states. You can get this data here. I will only use a subset of the data:
Wide format
Jahr | Sachsen | Hessen | Bayern |
---|---|---|---|
2016 | 649.2 | 579.4 | 638.5 |
2017 | 682.9 | 608.2 | 720.3 |
2018 | 772.1 | 774.0 | 779.6 |
2019 | 791.2 | 769.3 | 787.0 |
2020 | 653.0 | 640.2 | 695.6 |
Take a conscious look on this data.frame
for a moment.
- What information does one row contain?
- What is stored in one column?
- What is the meaning of the numbers in the
data.frame
? (And maybe more importantly: How do you know?)
This typical format is called wide as we have information (i.e. the sunshine duration) stored in different columns for different states per observation year (the row). While this is a perfectly fine form of data storage (e.g. you can get a good overview of the data), there is a second organisation scheme that can also be very useful, especially in the context of data analysis and visualization with R: the long format.
Long format
The principle of the long format is, that we have one column where we store the actual measurement values (sunshine duration in this example). Along that, we have multiple columns with metadata that describe and “encode” each value. Take a look at the example data from above in the long format:
Jahr | Bundesland | Sonnenscheindauer |
---|---|---|
2016 | Sachsen | 649.2 |
2016 | Hessen | 579.4 |
2016 | Bayern | 638.5 |
2017 | Sachsen | 682.9 |
2017 | Hessen | 608.2 |
2017 | Bayern | 720.3 |
2018 | Sachsen | 772.1 |
2018 | Hessen | 774.0 |
2018 | Bayern | 779.6 |
2019 | Sachsen | 791.2 |
2019 | Hessen | 769.3 |
2019 | Bayern | 787.0 |
2020 | Sachsen | 653.0 |
2020 | Hessen | 640.2 |
2020 | Bayern | 695.6 |
All the measurement of sunshine duration from the different states are now in one column Sonnenscheindauer
. And the two columns Jahr
and Bundesland
describe each value. In a way, the long format is better for the overall understanding than the wide format as it is way clearer what the content of each column actually means.
Conversion between long and wide with tidyr
library(tidyr)
= pivot_longer(sunshine,
sunshine_long cols = c("Sachsen", "Hessen", "Bayern"),
names_to = "Bundesland",
values_to = "Sonnenscheindauer")
Jahr | Bundesland | Sonnenscheindauer |
---|---|---|
2016 | Sachsen | 649.2 |
2016 | Hessen | 579.4 |
2016 | Bayern | 638.5 |
2017 | Sachsen | 682.9 |
2017 | Hessen | 608.2 |
2017 | Bayern | 720.3 |
2018 | Sachsen | 772.1 |
2018 | Hessen | 774.0 |
2018 | Bayern | 779.6 |
2019 | Sachsen | 791.2 |
2019 | Hessen | 769.3 |
2019 | Bayern | 787.0 |
2020 | Sachsen | 653.0 |
2020 | Hessen | 640.2 |
2020 | Bayern | 695.6 |
= pivot_wider(sunshine_long,
sunshine_wide names_from = "Bundesland",
values_from = "Sonnenscheindauer")
Jahr | Sachsen | Hessen | Bayern |
---|---|---|---|
2016 | 649.2 | 579.4 | 638.5 |
2017 | 682.9 | 608.2 | 720.3 |
2018 | 772.1 | 774.0 | 779.6 |
2019 | 791.2 | 769.3 | 787.0 |
2020 | 653.0 | 640.2 | 695.6 |
= pivot_wider(sunshine_long,
sunshine_year names_from = "Jahr",
values_from = "Sonnenscheindauer")
Bundesland | 2016 | 2017 | 2018 | 2019 | 2020 |
---|---|---|---|---|---|
Sachsen | 649.2 | 682.9 | 772.1 | 791.2 | 653.0 |
Hessen | 579.4 | 608.2 | 774.0 | 769.3 | 640.2 |
Bayern | 638.5 | 720.3 | 779.6 | 787.0 | 695.6 |
If you want to go into more details for pivot_wider
and pivot_longer
check out this tidyr reference article.