Tidy datasets are all alike
but every messy dataset is messy in its own way.
Hadley Wickham (2014b, p. 2)
The notion of tidy data lies at the core of the tidyverse. Importantly, any dataset can be organized in a variety of formats. Although different formats can all contain the same data, they differ in how easy or hard they are to work with (e.g., for specific analyses or plots). Tidy data is formatted in a simple and straightforward way, which makes it immensely practical (e.g., easy to understand, analyze, and transform into other formats). Before defining the notion of tidy data, we first compare and contrast different sets of rectangular data that contain the same information.
After working through this chapter, you will be able to:
- describe and organize the layout of data tables;
- define the notion of tidy data; and use tidyr commands to:
- separate one variable into the values of two variables;
- unite the values of two variables into one variable;
- gather values distributed over multiple columns into one variable;
- spread the values of a variable over multiple columns.
7.1.2 Varieties of tabular data
In R, rectangular data is typically organized in data frames or tibbles (see Sections 1.5 and Chapter 5). Importantly, each column is a vector (of a particular type) that contains the values of a variable. Thus, whereas every column must be of one (single) type, every row (aka. a “case” or “observation”) can contain the values of different variables and types.
The same set of data (i.e., the values of variables) can be organised in many different ways. For instance, consider the following bar plot, which shows the absolute number of TB cases documented by the World Health Organization in three countries (Afghanistan, Brazil, and China) in two years (1999 and 2000):
How can these values be represented in a table? Although the data in this example is almost trivial, there are many possible ways to arrange the data in tabular form. Perhaps the most straightforward way of representing this data in a table is the following:
::kable(tidyr::table4a, caption = "table4a: TB cases per country and year.")knitr
The figure and table directly correspond to each other. However, when trying to use ggplot2 for re-creating the figure from the data in
table4a, we realize that this is difficult.
Try recreating the above bar plot using ggplot2 with
tidyr::table4aas data. Why is this difficult?
Describe a dataset that would make it easier to create this plot.
Using ggplot2 (and many other R functions) requires that we specify the independent and dependent variables. Here, the independent variables are
country (with 3 instances) and
year (with 2 instances). However,
table4a does not contain a variable for
year. Instead, each instance of the variable is represented as a column (variable). Thus, the data contains a variable (
year) that is not represented as a column.
An easier dataset to create the plot would look like this:
This data contains 6 observations (cases) and 3 variables (columns): 2 independent variables (
year) and 1 dependent variable (
7.1.3 One table in many formats
Let’s consider a slightly more complicated dataset that adds the population of each country as another variable:
This data – available as
tidyr::table1 – contains six observations (cases) and four variables (columns): Two independent variables (
year) and two dependent variables (
?tidyr::table1 for a description and source information.)
The following tables (or tibbles) all provide the same data in different formats:
- The data in
table2still contains our two dependent variables (counts of TB
population), but combines them in one variable (
count). To signal which variable is described by
count, the data contains a new variable
typethat characterizes the
- The data in
table3still contains the same information as
table2, but collapses the information previously contained in
populationinto one variable
rate. The new
ratevariable is represented as a ratio, which actually contains two pieces of information (numerator and denominator):
- The data in
table4bsplit the information into 2 tables:
table4acontains the values of TB
table4bthe counts of the
population. However, each sub-table splits the
yearvariable into two separate variables:
- The data in
table5is similar to
table3in containing the
ratevariable (which really consists of two variables), but additionally splits up the
yearinformation into two variables (noting the
centuryand a 2-digit
Thus, all these tables contain the same information, but differ in their layout or format. Theoretically, all these tables are equal, but some are more equal — or rather more practical — than others.
- Recreate the above bar plot using ggplot2 with
7.1.4 Defining tidy data
In the previous section, we have compared many tables that contain the same data. They key motivation for tidy data is that some formats are easier to work with than others.
Definition: A tidy dataset conforms to three interrelated rules:
Each variable has its own column.
Each case/observation has its own row.
Each value has its own cell.
See the paper at https://www.jstatsoft.org/article/view/v059i10 (Wickham, 2014b) for the background of tidy data and http://r4ds.had.co.nz/tidy-data.html#fig:tidy-structure for a graphical illustration of these rules.
The three rules defining tidy data are connected, as it is impossible to only satisfy two of the three rules. This leads to a simpler set of two practical instructions for tidying a messy set of data:
A. Turn each dataset into a tibble.
B. Put each variable into a column.
Assuming that you are dealing with a single tibble, they key instruction to remember is B: Put each variable into a column.
To achieve this, you primarily need to define what should be considered as an observation and as a variable.
For instance, measuring the IQ of \(N=100\) students twice (e.g., at the age of 10 and the age of 20) could be represented in a table that contains 100 rows (one for each student) and two separate variables for the IQ values (e.g., two columns for
However, it would seem tidier to represent the two IQ measurements as one variable (
iq) per person and qualify the two measurement occasions by a key variable (
age, with two possible values 10 and 20). This table would contain two rows per person (i.e., 200 observations overall).
- Which of the data tables in the above example (
table5) are tidy? Why or why not?
In our previous examples (in Section 7.1.3), only the data of
table1 was tidy, while the data in
table5 all were messy in some way.
7.1.5 Advantages of tidy data
From a theoretical viewpoint, being in a tidy format is not inherently better or worse than any of the other data formats. However, just like not all plots are equally suited to make a particular point, not all data formats are equally suited to be analyzed and transformed. As tools (like functions and packages) require data to be in a particular formats, they can only be applied if the data format fits to the requirements of the tool. Overall, tidy data has the following advantages:
Consistency: Consistent data structures make it easier to learn the tools that work with it because they have an underlying uniformity.
Vectorization: Placing variables in columns allows R’s vectorised nature to shine. For instance, the basic dplyr verbs (and most built-in R functions) work with vectors of values. That makes transforming tidy data easy and natural.
Matching data and tools: The tidyverse packages — like dplyr, ggplot2, and many others — are designed to work with tidy data.
The key advantage of tidy data is that is typically easy to work with. This does not mean that we never have to change its format. However, given the tools provided by R and the tidyverse, tidy data can easily be transformed into other formats.
Note some common misconceptions: Although tidy data tends to be a good thing, it is typically not the most compact and not necessarily the most human-readable version of a dataset. Similarly, many graphical or statistical methods require data in shapes that are not tidy (e.g., running linear regression models requires data to be in so-called long format). Again, tidy data is not an end in itself, but often a means for easily transforming data into alternative shapes. Finally, it can be difficult to decide which datasets are considered to be tidy: We usually need to interpret the semantics of the rows and columns (i.e., understand the meanings of observations and variables) to determine the tidyness of a dataset.
7.1.6 Data used
In this chapter, we will first use some variants of a simple example dataset (i.e.,
table5 of the tidyr package, and
table8 of the ds4psy package).
However, we will also use other datasets from the dplyr and ds4psy packages, as well as revisit some data used in Tibbles (Chapter 5).
7.1.7 Getting ready
This chapter formerly assumed that you have read and worked through Chapter 12: Tidy data of the r4ds textbook (Wickham & Grolemund, 2017). It now can be read by itself, but reading Chapter 12 of r4ds is still recommended.
Please do the following to get started:
Structure your document by inserting headings and empty lines between different parts. Here’s an example how your initial file could look:
--- : "Chapter 7: Tidying data" title: "Your name" author: "2021 June 14" date: html_document output--- Add text or code chunks here. # Exercises (07: Tidying data) ## Exercise 1 ## Exercise 2 etc. <!-- The end (eof). -->
Create an initial code chunk below the header of your
.Rmdfile that loads the R packages of the tidyverse (and see Section F.3.3 if you want to get rid of the messages and warnings of this chunk in your HTML output).
Save your file (e.g., as
07_tidy.Rmdin the R folder of your current project) and remember saving and knitting it regularly as you keep adding content to it.
Next, we will consider four essential tidyr commands that help creating and transforming tidy data.