7.1 Introduction

Tidy datasets are all alike
but every messy dataset is messy in its own way.

Hadley Wickham (2014b, p. 2)

The notion of tidy data lies at the core of the tidyverse. Importantly, any dataset can be organized in a variety of formats. Although different formats can all contain the same data, they still differ in how easy or hard they are to work with. As we have argued in Section 3.1.1, the ease or difficulty of a particular dataset depends on the combination of our current goals, our experience and skills, and the tools (e.g., functions and packages) we are using.

The technical notion of tidy data refers to a shape that is formatted in a simple and straightforward fashion, which makes it immensely practical (e.g., easy to understand, analyze, and transform into other formats). Before defining the notion of tidy data, we first compare and contrast different sets of rectangular data that contain the same information.

7.1.1 Objectives

After working through this chapter, you will be able to:

  1. describe and organize the layout of data tables;
  2. define the notion of tidy data; and use tidyr commands to:
  3. separate one variable into the values of two variables;
  4. unite the values of two variables into one variable;
  5. gather values distributed over multiple columns into one variable;
  6. spread the values of a variable over multiple columns.

7.1.2 Varieties of tabular data

In R, rectangular data is by far the most common shape of data and typically stored as data frames or tibbles (see Sections 1.5 and Chapter 5). Importantly, such tables are internally stored as lists of atomic vectors. This means that each column of a table is a vector (of a particular type) that contains the values of a variable. Thus, whereas every column must be of one (single, homogeneous) type, every row (aka. a “case” or “observation”) can contain the values of different (heterogeneous) variables and types.

As we have seen in Section 3.1.1, the same data (e.g., the values of a variable) can be organized in many different ways. Transforming the shape of data without changing any values is known as reshaping data, as opposed to reducing or enhancing data. In this chapter, we extend the notion of “reshaping data” to different forms of rectangular tables. As a starting point, consider the following bar plot, which shows the absolute number of TB cases documented by the World Health Organization for three countries (Afghanistan, Brazil, and China) in two years (1999 and 2000):

A seemingly simple question is:

  • How can we represent this data in a table?

Although the amount of data in this example is very small, there are many possible ways to arrange the data in tabular form. Perhaps the most straightforward way of representing this data in a table is the following:

knitr::kable(tidyr::table4a, caption = "table4a: TB cases per country and year.")
Table 7.1: table4a: TB cases per country and year.
country 1999 2000
Afghanistan 745 2666
Brazil 37737 80488
China 212258 213766

The figure and table directly correspond to each other. However, when trying to use ggplot2 for re-creating the figure from the data in table4a, we realize that this is difficult.

Practice

  • Try recreating the above bar plot using ggplot2 with tidyr::table4a as data. Why is this difficult?

  • Describe a dataset that would make it easier to create this plot.

Answer

Using ggplot2 (and many other R functions) requires that we specify the independent and dependent variables. Here, the independent variables are country (with 3 instances) and year (with 2 instances). However, table4a does not contain a variable for year. Instead, each instance of the variable is represented as a column (variable). Thus, the data contains a variable (year) that is not represented as a column.

An easier dataset to create the plot would look like this:

Table 7.2: TB cases per country and year (in tidy format).
country year cases
Afghanistan 1999 745
Afghanistan 2000 2666
Brazil 1999 37737
Brazil 2000 80488
China 1999 212258
China 2000 213766

This data contains 6 observations (cases) and 3 variables (columns): 2 independent variables (country and year) and 1 dependent variable (cases).

7.1.3 One table in many formats

Let’s consider a slightly more complicated dataset that adds the population of each country as another variable:

Table 7.3: A tidy data table (table1).
country year cases population
Afghanistan 1999 745 19987071
Afghanistan 2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583

This data – available as tidyr::table1 – contains six observations (cases) and four variables (columns): Two independent variables (country and year) and two dependent variables (cases and population). (See ?tidyr::table1 for a description and source information.) The following tables (or tibbles) all provide the same data in different formats:

  • The data in table2 still contains our two dependent variables (counts of TB cases and population), but combines them in one variable (count). To signal which variable is described by count, the data contains a new variable type that characterizes the count variable:
Table 7.4: One variable (count) contains two DVs (table2).
country year type count
Afghanistan 1999 cases 745
Afghanistan 1999 population 19987071
Afghanistan 2000 cases 2666
Afghanistan 2000 population 20595360
Brazil 1999 cases 37737
Brazil 1999 population 172006362
Brazil 2000 cases 80488
Brazil 2000 population 174504898
China 1999 cases 212258
China 1999 population 1272915272
China 2000 cases 213766
China 2000 population 1280428583

A data table in this form (with several variables characterizing the cases on \(n\)-categories and one dedicated variable that provides the frequency count of each category combination) is often called a contingency table. In R, such contingency tables can also be represented in multi-dimensional data structures (of type “array”, “table” or “xtabs”).

  • The data in table3 still contains the same information as table1 and table2, but collapses the information previously contained in cases and population into one variable rate. The new rate variable is represented as a ratio, which actually contains two pieces of information (numerator and denominator):
Table 7.5: One variable (rate) contain 2 DVs (table3).
country year rate
Afghanistan 1999 745/19987071
Afghanistan 2000 2666/20595360
Brazil 1999 37737/172006362
Brazil 2000 80488/174504898
China 1999 212258/1272915272
China 2000 213766/1280428583
  • The data in table4a and table4b split the information into 2 tables: table4a contains the values of TB cases and table4b the counts of the population. However, each sub-table splits the year variable into two separate variables:
Table 7.6: 1999 and 2000 contain same DV (table4a).
country 1999 2000
Afghanistan 745 2666
Brazil 37737 80488
China 212258 213766
Table 7.6: 1999 and 2000 contain same DV (table4b).
country 1999 2000
Afghanistan 19987071 20595360
Brazil 172006362 174504898
China 1272915272 1280428583
  • The data in table5 is similar to table3 in containing the rate variable (which really consists of two variables), but additionally splits up the year information into two variables (noting the century and a 2-digit year):
Table 7.7: DV year in 2 columns, DV rate contains 2 DVs (table5).
country century year rate
Afghanistan 19 99 745/19987071
Afghanistan 20 00 2666/20595360
Brazil 19 99 37737/172006362
Brazil 20 00 80488/174504898
China 19 99 212258/1272915272
China 20 00 213766/1280428583

Thus, all these tables contain the same information, but differ in their layout or format. Theoretically, all these tables are equal, but some are more equal — or rather more practical — than others.

Practice

  • Recreate the above bar plot using ggplot2 with tidyr::table1 as data.

7.1.4 Defining tidy data

In the previous section, we have compared many tables that contain the same data. They key motivation for tidy data is that some formats are easier to work with than others.

Definition: A tidy dataset conforms to three interrelated rules:

  1. Each variable has its own column.

  2. Each case/observation has its own row.

  3. Each value has its own cell.

See the paper at https://www.jstatsoft.org/article/view/v059i10 (Wickham, 2014b) for the background of tidy data and http://r4ds.had.co.nz/tidy-data.html#fig:tidy-structure for a graphical illustration of these rules.

The three rules defining tidy data are connected, as it is impossible to only satisfy two of the three rules. This leads to a simpler set of two practical instructions for tidying a messy set of data:

A. Turn each dataset into a tibble.

B. Put each variable into a column.

Assuming that you are dealing with a single tibble, they key instruction to remember is B: Put each variable into a column. To achieve this, we primarily need to define what should be considered as an observation (row) and what functions as a variable (column). For instance, measuring the IQ of \(N = 100\) students twice (e.g., at the age of 10 and the age of 20) could be represented in a table that contains 100 rows (one for each student) and two separate variables for the IQ values (e.g., two columns for iq_10 and iq_20). However, it would seem `tidier'' to represent the two IQ\ measurements as one variable (iq) per person and qualify the two measurement occasions by a key variable (age`, with two possible values 10 and 20). This table would contain two rows per person (i.e., 200 observations overall).

Although the notion of tidy data is important, it also remains — like many intuitive concepts — somewhat vague. The reason for this is that the term “variable” must be understood in a functional sense: A variable is some measure or description that we want to use as a variable in an analysis. For instance, depending on the particular task at hand, a time-related “variable” could be a particular date, or the month, year, or century that corresponds to a date. Thus, what can be considered “tidy” partly lies in the eyes of the beholder and depends on what we want to do with data, rather than on some inherent property of the data itself.

More generally, the difference between messy and tidy data depends on

  • (a) our goals or intended use of the data (e.g., which task do we want to address?) and

  • (b) the tools with which we typically carry out our tasks (e.g., which functions are we familiar with?).

Given the tools provided by R and prominent R packages (e.g., dplyr or ggplot2), it makes sense to first identify the variables of our analysis and then reshape the data so that each variable of interest is in its own column. Although this format is informationally equivalent to many alternative formats, it provides practical benefits for further transforming the data. For instance, we can easily use the variables to filter, select, group, or pivot the data to reshape or reduce it to answer our questions.

Practice

  1. Which of the data tables in the above example (table1 to table5) are tidy? Why or why not?

  2. Is a tidy table always the most compact table (in terms of its number of cells)? (If not, provide a counterexample.)

Solution

  • ad 1.: In our previous examples (in Section 7.1.3), only the data of table1 was tidy, while the data in table2 to table5 all were messy in some way.

  • ad 2.: Tidy data can be less compact than untidy alternative. As an example, consider and contrast the following tables:

Table 7.8: The table4a of the tidyr package is compact, but not tidy.
country 1999 2000
Afghanistan 745 2666
Brazil 37737 80488
China 212258 213766
Table 7.8: A tidy version of table4a is less compact, but easier to handle.
country year cases
Afghanistan 1999 745
Afghanistan 2000 2666
Brazil 1999 37737
Brazil 2000 80488
China 1999 212258
China 2000 213766

The table4a table (from the tidyr package) contains 9 cells, but is not tidy. By contrast, a tidy version of it (that represents the year variable in its own column) contains 18 cells. Thus, a tidy version of a table can be larger than an untidy version of the same data. Its key benefit is not its small size, but the ease with which it can be transformed in the context of the tools provided by R.

7.1.5 Advantages of tidy data

From a theoretical viewpoint, being in a tidy format is not inherently better or worse than any of the other data formats. However, just like not all plots are equally suited to make a particular point, not all data formats are equally suited to be analyzed and transformed. As tools (like functions and packages) require data to be in a particular formats, they can only be applied if the data format fits to the requirements of the tool. Overall, tidy data has the following advantages:

  1. Consistency: Consistent data structures make it easier to learn the tools that work with it because they have an underlying uniformity.

  2. Vectorization: Placing variables in columns allows R’s vectorised nature to shine. For instance, the basic dplyr verbs (and most built-in R functions) work with vectors of values. That makes transforming tidy data easy and natural.

  3. Matching data and tools: The tidyverse packages — like dplyr, ggplot2, and many others — are designed to work with tidy data.

The key advantage of tidy data is that is typically easy to work with. This does not mean that we never have to change its format. However, given the tools provided by R and the tidyverse, tidy data can easily be transformed into other formats.

Note some common misconceptions: Although tidy data tends to be a good thing, it is typically not the most compact and not necessarily the most human-readable version of a dataset. Similarly, many graphical or statistical methods require data in shapes that are not tidy (e.g., running linear regression models requires data to be in so-called long format). Again, tidy data is not an end in itself, but often a means for easily transforming data into alternative shapes. Finally, it can be difficult to decide which datasets are considered to be tidy: We usually need to interpret the semantics of the rows and columns (i.e., understand the meanings of observations and variables) to determine the tidyness of a dataset.

7.1.6 Data used

In this chapter, we will first use some variants of a simple example dataset (i.e., table1 to table5 of the tidyr package, and table6 to table8 of the ds4psy package). However, we will also use other datasets from the dplyr and ds4psy packages, as well as revisit some data used in Tibbles (Chapter 5).

7.1.7 Getting ready

This chapter formerly assumed that you have read and worked through Chapter 12: Tidy data of the r4ds textbook (Wickham & Grolemund, 2017). It now can be read by itself, but reading Chapter 12 of r4ds is still recommended.

Please do the following to get started:

  • Create an R Markdown (.Rmd) document (for instructions, see Appendix F and the templates linked in Section F.2).

  • Structure your document by inserting headings and empty lines between different parts. Here’s an example how your initial file could look:

---
title: "Chapter 7: Tidying data"
author: "Your name"
date: "2024 October 20"
output: html_document
---

Add text or code chunks here.

# Exercises (07: Tidying data)
  
## Exercise 1
  
## Exercise 2
  
etc. 
  
<!-- The end (eof). -->
  • Create an initial code chunk below the header of your .Rmd file that loads the R packages of the tidyverse (and see Section F.3.3 if you want to get rid of the messages and warnings of this chunk in your HTML output).

  • Save your file (e.g., as 07_tidy.Rmd in the R folder of your current project) and remember saving and knitting it regularly as you keep adding content to it.

Next, we will consider four essential tidyr commands that help creating and transforming tidy data.

References

Wickham, H. (2014b). Tidy data. Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10
Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. Retrieved from http://r4ds.had.co.nz