Wide vs. Long data.frames

Most data you collect yourself or can download from data repositories will come in some variation or form of a “Excel” style spreadsheet. What I mean by that is, that a row in the table contains different features of the same observation. Each column on the other hand contains the same feature of many different observations. Of course, what information we store with feature and observation heavily depends on the content and purpose of the data.

Take a look at the following data from the German weather service DWD about sunshine duration across German states. You can get this data here. I will only use a subset of the data:

library(dplyr)

sunshine = sunshine |>
    filter(Jahr > 2015 & Jahr <= 2020) |>
    select("Jahr", "Sachsen", "Hessen", "Bayern")

Wide format

Jahr	Sachsen	Hessen	Bayern
2016	649.2	579.4	638.5
2017	682.9	608.2	720.3
2018	772.1	774.0	779.6
2019	791.2	769.3	787.0
2020	653.0	640.2	695.6

Take a conscious look on this data.frame for a moment.

What information does one row contain?
What is stored in one column?
What is the meaning of the numbers in the data.frame? (And maybe more importantly: How do you know?)

This typical format is called wide as we have information (i.e. the sunshine duration) stored in different columns for different states per observation year (the row). While this is a perfectly fine form of data storage (e.g. you can get a good overview of the data), there is a second organisation scheme that can also be very useful, especially in the context of data analysis and visualization with R: the long format.

Long format

The principle of the long format is, that we have one column where we store the actual measurement values (sunshine duration in this example). Along that, we have multiple columns with metadata that describe and “encode” each value. Take a look at the example data from above in the long format:

Jahr	Bundesland	Sonnenscheindauer
2016	Sachsen	649.2
2016	Hessen	579.4
2016	Bayern	638.5
2017	Sachsen	682.9
2017	Hessen	608.2
2017	Bayern	720.3
2018	Sachsen	772.1
2018	Hessen	774.0
2018	Bayern	779.6
2019	Sachsen	791.2
2019	Hessen	769.3
2019	Bayern	787.0
2020	Sachsen	653.0
2020	Hessen	640.2
2020	Bayern	695.6

All the measurement of sunshine duration from the different states are now in one column Sonnenscheindauer. And the two columns Jahr and Bundesland describe each value. In a way, the long format is better for the overall understanding than the wide format as it is way clearer what the content of each column actually means.

Conversion between long and wide with `tidyr`

library(tidyr)

sunshine_long = pivot_longer(sunshine,
                             cols = c("Sachsen", "Hessen", "Bayern"),
                             names_to = "Bundesland",
                             values_to = "Sonnenscheindauer")

Jahr	Bundesland	Sonnenscheindauer
2016	Sachsen	649.2
2016	Hessen	579.4
2016	Bayern	638.5
2017	Sachsen	682.9
2017	Hessen	608.2
2017	Bayern	720.3
2018	Sachsen	772.1
2018	Hessen	774.0
2018	Bayern	779.6
2019	Sachsen	791.2
2019	Hessen	769.3
2019	Bayern	787.0
2020	Sachsen	653.0
2020	Hessen	640.2
2020	Bayern	695.6

sunshine_wide = pivot_wider(sunshine_long,
                            names_from = "Bundesland",
                            values_from = "Sonnenscheindauer")

Jahr	Sachsen	Hessen	Bayern
2016	649.2	579.4	638.5
2017	682.9	608.2	720.3
2018	772.1	774.0	779.6
2019	791.2	769.3	787.0
2020	653.0	640.2	695.6

sunshine_year = pivot_wider(sunshine_long,
                            names_from = "Jahr",
                            values_from = "Sonnenscheindauer")

Bundesland	2016	2017	2018	2019	2020
Sachsen	649.2	682.9	772.1	791.2	653.0
Hessen	579.4	608.2	774.0	769.3	640.2
Bayern	638.5	720.3	779.6	787.0	695.6

If you want to go into more details for pivot_wider and pivot_longer check out this tidyr reference article.

Wide vs. Long data.frames

Wide format

Long format

Conversion between long and wide with tidyr

Conversion between long and wide with `tidyr`