7.1 Introduction

Tidy datasets are all alike
but every messy dataset is messy in its own way.

  Hadley Wickham [-@Wickham2014, p. 2]  

The notion of tidy data lies at the core of the tidyverse. Importantly, any dataset can be organized in a variety of formats. Although different formats can all contain the same data, they differ in how easy or hard they are to work with (e.g., for specific analyses or plots). Tidy data is formatted in a simple and straightforward way, which makes it immensely practical (e.g., easy to understand, analyze, and transform into other formats). Before defining the notion of tidy data, we first compare and contrast different sets of rectangular data that contain the same information.

7.1.1 Objectives

After working through this chapter, you will be able to:

  1. describe and organize the layout of data tables;
  2. define the notion of tidy data; and use tidyr commands to:
  3. separate one variable into the values of two variables;
  4. unite the values of two variables into one variable;
  5. gather values distributed over multiple columns into one variable;
  6. spread the values of a variable over multiple columns.

7.1.2 Varieties of tabular data

In R, rectangular data is typically organized in data frames or tibbles (see Sections 1.5 and Chapter 5). Importantly, each column is a vector (of a particular type) that contains the values of a variable. Thus, whereas every column must be of one (single) type, every row (aka. a “case” or “observation”) can contain the values of different variables and types.

The same set of data (i.e., the values of variables) can be organised in many different ways. For instance, consider the following bar plot, which shows the absolute number of TB cases documented by the World Health Organization in 3 countries (Afghanistan, Brazil, and China) in 2 years (1999 and 2000):

How can these values be represented in a table? Although the data in this example is almost trivial, there are many possible ways to arrange the data in tabular form. Perhaps the most straightforward way of representing this data in a table is the following:

Table 7.1: table4a: TB cases per country and year.
country 1999 2000
Afghanistan 745 2666
Brazil 37737 80488
China 212258 213766

The figure and table directly correspond to each other. However, when trying to use ggplot2 for re-creating the figure from the data in table4a, we realize that this is difficult.

Practice

  • Try recreating the above bar plot using ggplot2 with tidyr::table4a as data. Why is this difficult?

  • Describe a dataset that would make it easier to create this plot.

Answer

Using ggplot2 (and many other R functions) requires that we specify the independent and dependent variables. Here, the independent variables are country (with 3 instances) and year (with 2 instances). However, table4a does not contain a variable for year. Instead, each instance of the variable is represented as a column (variable). Thus, the data contains a variable (year) that is not represented as a column.

An easier dataset to create the plot would look like this:

Table 7.2: TB cases per country and year (in tidy format).
country year cases
Afghanistan 1999 745
Afghanistan 2000 2666
Brazil 1999 37737
Brazil 2000 80488
China 1999 212258
China 2000 213766

This data contains 6 observations (cases) and 3 variables (columns): 2 independent variables (country and year) and 1 dependent variable (cases).

7.1.3 One table in many formats

Let’s consider a slightly more complicated dataset that adds the population of each country as another variable:

Table 7.3: table1: Tidy data.
country year cases population
Afghanistan 1999 745 19987071
Afghanistan 2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583

This data – available as tidyr::table1 – contains 6 observations (cases) and 4 variables (columns): 2 independent variables (country and year) and 2 dependent variables (cases and population). (See ?tidyr::table1 for a description and source information.) The following tables (or tibbles) all provide the same data in different formats:

  • The data in table2 still contains our 2 dependent variables (counts of TB cases and population), but combines them in 1 variable (count). To signal which variable is described by count, the data contains a new variable type that characterizes the count variable:
Table 7.4: table2: count contains 2 DVs.
country year type count
Afghanistan 1999 cases 745
Afghanistan 1999 population 19987071
Afghanistan 2000 cases 2666
Afghanistan 2000 population 20595360
Brazil 1999 cases 37737
Brazil 1999 population 172006362
Brazil 2000 cases 80488
Brazil 2000 population 174504898
China 1999 cases 212258
China 1999 population 1272915272
China 2000 cases 213766
China 2000 population 1280428583
  • The data in table3 still contains the same information as table1 and table2, but collapses the information previously contained in cases and population into 1 variable rate. The new rate variable is represented as a ratio, which actually contains 2 pieces of information (numerator and denominator):
Table 7.5: table3: rate contains 2 DVs.
country year rate
Afghanistan 1999 745/19987071
Afghanistan 2000 2666/20595360
Brazil 1999 37737/172006362
Brazil 2000 80488/174504898
China 1999 212258/1272915272
China 2000 213766/1280428583
  • The data in table4a and table4b split the information into 2 tables: table4a contains the values of TB cases and table4b the counts of the population. However, each sub-table splits the year variable into 2 separate variables:
Table 7.6: table4a: 1999 and 2000 contain same DV (cases).
country 1999 2000
Afghanistan 745 2666
Brazil 37737 80488
China 212258 213766
Table 7.6: table4b: 1999 and 2000 contain same DV (population).
country 1999 2000
Afghanistan 19987071 20595360
Brazil 172006362 174504898
China 1272915272 1280428583
  • The data in table5 is similar to table3 in containing the rate variable (which really consists of 2 variables), but additionally splits up the year information into 2 variables (noting the century and a 2-digit year):
Table 7.7: table5: DV year in 2 columns, DV rate contains 2 DVs.
country century year rate
Afghanistan 19 99 745/19987071
Afghanistan 20 00 2666/20595360
Brazil 19 99 37737/172006362
Brazil 20 00 80488/174504898
China 19 99 212258/1272915272
China 20 00 213766/1280428583

Thus, all these tables contain the same information, but differ in their layout or format. Theoretically, all these tables are equal, but some are more equal — or rather more practical — than others.

Practice

  • Recreate the above bar plot using ggplot2 with tidyr::table1 as data.

7.1.4 Defining tidy data

In the previous section, we have compared many tables that contain the same data. They key motivation for tidy data is that some formats are easier to work with than others.

Definition: A tidy dataset conforms to 3 interrelated rules:

  1. Each variable has its own column.

  2. Each case/observation has its own row.

  3. Each value has its own cell.

See the paper at https://www.jstatsoft.org/article/view/v059i10 (Wickham, 2014b) for the background of tidy data and http://r4ds.had.co.nz/tidy-data.html#fig:tidy-structure for a graphical illustration of these rules.

The 3 rules defining tidy data are connected, as it is impossible to only satisfy 2 of the 3 rules. This leads to a simpler set of 2 practical instructions for tidying a messy set of data:

A. Turn each dataset into a tibble.

B. Put each variable into a column.

Assuming that you are dealing with a single tibble, they key instruction to remember is B: Put each variable into a column. To achieve this, you primarily need to define what should be considered as an observation and as a variable. For instance, measuring the IQ of \(N=100\) students twice (e.g., at the age of 10 and the age of 20) could be represented in a table that contains 100 rows (1 for each student) and 2 separate variables for the IQ values (e.g., 2 columns for iq_10 and iq_20). However, it would seem tidier to represent the 2 IQ measurements as 1 variable (iq) per person and qualify the 2 measurement occasions by a key variable (age, with 2 possible values 10 and 20). This table would contain 2 rows per person (i.e., 200 observations overall).

Practice

  • Which of the data tables in the above example (table1 to table5) are tidy? Why or why not?

In our previous examples (in Section 7.1.3), only the data of table1 was tidy, while the data in table2 to table5 all were messy in some way or other.

7.1.5 Advantages of tidy data

From a theoretical viewpoint, being in a tidy format is not inherently better or worse than any of the other data formats. However, just like not all plots are equally suited to make a particular point, not all data formats are equally suited to be analyzed and transformed. As tools (like functions and packages) require data to be in a particular formats, they can only be applied if the data format fits to the requirements of the tool. Overall, tidy data has the following advantages:

  1. Consistency: Consistent data structures make it easier to learn the tools that work with it because they have an underlying uniformity.

  2. Vectorization: Placing variables in columns allows R’s vectorised nature to shine. For instance, the basic dplyr verbs (and most built-in R functions) work with vectors of values. That makes transforming tidy data easy and natural.

  3. Matching data and tools: The tidyverse packages — like dplyr, ggplot2, and many others — are designed to work with tidy data.

The key advantage of tidy data is that is typically easy to work with. This does not mean that we never have to change its format. However, given the tools provided by R and the tidyverse, tidy data can easily be transformed into other formats.

Note some common misconceptions: Although tidy data tends to be a good thing, it is typically not the most compact and not necessarily the most human-readable version of a dataset. Similarly, many graphical or statistical methods require data in shapes that are not tidy (e.g., running linear regression models requires data to be in so-called long format). Again, tidy data is not an end in itself, but often a means for easily transforming data into alternative shapes. Finally, it can be difficult to decide which datasets are considered to be tidy: We usually need to interpret the semantics of the rows and columns (i.e., understand the meanings of observations and variables) to determine the tidyness of a dataset.

7.1.6 Data used

In this chapter, we will first use some variants of a simple example dataset (i.e., table1 to table5 of the tidyr package, and table6 to table8 of the ds4psy package). However, we will also use other datasets from the dplyr and ds4psy packages, as well as revisit some data used in Tibbles (Chapter 5).

7.1.7 Getting ready

This chapter formerly assumed that you have read and worked through Chapter 12: Tidy data of the r4ds textbook (Wickham & Grolemund, 2017). It now can be read by itself, but reading Chapter 12 of r4ds is still recommended.

Please do the following to get started:

  • Create an R Markdown (.Rmd) document (for instructions, see Appendix F and the templates linked in Section F.2).

  • Structure your document by inserting headings and empty lines between different parts. Here’s an example how your initial file could look:

  • Create an initial code chunk below the header of your .Rmd file that loads the R packages of the tidyverse (and see Section F.3.3 if you want to get rid of the messages and warnings of this chunk in your HTML output).

  • Save your file (e.g., as 07_tidy.Rmd in the R folder of your current project) and remember saving and knitting it regularly as you keep adding content to it.

Next, we will consider 4 essential tidyr commands that help creating and transforming tidy data.

References

Wickham, H. (2014b). Tidy data. Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10

Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. Retrieved from http://r4ds.had.co.nz