Chapter 7 Tidying data
Doing data science or statistics may convey the false impression that data naturally comes in rows and columns. Yet when looking at animals, plants, stars, or the interactions between all these entities, we see that nature rarely arranges its elements of interest in neat rectangles. Hence, the rectangular data sets that we are so familiar with are rectangular by design — to make them easy to read, to analyze, and to transform. But even rectangular datasets can be confusing and messy, and typically require some pre-processing before we can make sense of them.
The notion of tidy data is a fundamental concept of the tidyverse, and the tidyr package (Wickham & Henry, 2020) is a core component of the corresponding set of R packages (Wickham, 2019b). Tidy data is a desirable thing, yet also often misunderstood. Essentially, it is a particular shape of data that is easy to transform into other shapes, given the tools of the tidyverse. However, it is not necessarily easy to read (for humans) and often not the shape of data needed for some statistical analysis.
We have encountered some important tools for re-shaping data in our chapter on transforming data and the dplyr package (Wickham et al., 2020b). The tidyr package (Wickham & Henry, 2020) provides additional commands for turning messy data into tidy data and for getting tidy data into alternative shapes. Although the package is maturing, it is still changing and developing. For instance, its version 1.0.0 only appeared in September 2019 and introduced new functionality that is not covered by most popular textbooks and tutorials yet. (We will provide some pointers in Section 7.2 below.)
Before defining the notion of tidy data, we need to realize the general point that the same set of data can be formatted in many different ways. Different formats can be informationally equivalent, but make it easier or harder to work with the data. Thus, formatting data in a way that is simple and straightforward to work with is not just a matter of preference, but immensely useful.
Wickham, H. (2019b). tidyverse: Easily install and load the ’tidyverse’. Retrieved from https://CRAN.R-project.org/package=tidyverse
Wickham, H., & Henry, L. (2020). tidyr: Easily tidy data with ’spread()’ and ’gather()’ functions. Retrieved from https://CRAN.R-project.org/package=tidyr
Wickham, H., François, R., Henry, L., & Müller, K. (2020b). dplyr: A grammar of data manipulation. Retrieved from https://CRAN.R-project.org/package=dplyr