Chapter 7 Tidying data
Doing data science or statistics may convey the false impression that data naturally comes in rows and columns. Yet when looking at animals, plants, stars, or the interactions between all these entities, we see that nature rarely arranges its elements of interest in neat rectangles. Hence, the rectangular data sets that we are so familiar with are rectangular by design — to make them easy to read, to analyze, and to transform. But even rectangular datasets can be confusing and messy, and typically require some pre-processing before we can make sense of them.
The notion of tidy data is a fundamental concept of the tidyverse, and the tidyr package (Wickham & Henry, 2020) is a core component of the corresponding set of R packages (Wickham, 2019c). Tidy data is a desirable thing, yet also often misunderstood. Essentially, it is a particular shape of data that is easy to transform into other shapes, given the tools of the tidyverse. However, it is not necessarily easy to read (for humans) and often not the shape of data needed for some statistical analysis.
Tidying a dataset typically reshapes data without reducing it (see Section 3.1.1 for the distinction). In Chapter 3 on Transforming data, we have encountered some important tools for reshaping data from the dplyr package (Wickham, François, Henry, & Müller, 2021). The tidyr package (Wickham & Henry, 2020) provides additional commands for cleaning up messy data and for turning tidy data into alternative shapes. Although the package is maturing, it is still changing and developing. For instance, its version 1.0.0 only appeared in September 2019 and introduced new functionality that is not covered by most popular textbooks and tutorials yet. (We will provide some pointers in Section 7.2 below.)
Before defining the notion of tidy data, we need to realize the general point that the same set of data can be formatted in many different ways. Different formats can be informationally equivalent, but make it easier or harder to work with the data. Thus, formatting data in a way that is simple and straightforward to work with is not just a matter of preference, but immensely useful.