7.3 Summary
This chapter gravitated around the fact that any non-trivial dataset can be formatted in a variety of ways and introduced the concept of tidy data (defined in Section 7.1.4), which is a fundamental notion of the tidyverse.
The tidyr package (Wickham & Girlich, 2024) contains essential commands that allow to separate
(or unite
) the values of variables and gather
(or spread
) values by changing a table format from wide to long format (and vice versa).
After working through this chapter, you are now able to:
- describe and organize the layout of data tables;
- define the notion of tidy data; and use tidyr commands to:
- separate one variable into the values of two variables;
- unite the values of two variables into one variable;
- gather values distributed over multiple columns into one variable;
- spread the values of a variable over multiple columns.
A limitation of previous tidyr commands is that they only dealt with one dependent variable at a time. Nevertheless, using pipes of several commands can overcome this constraint (see Section 7.2.6 for examples).
The Posit cheatsheets on reshaping data with the tidyr package provides an overview over the tidyr commands you are now familiar with and lets you discover some additional ones:
Overall, the topic of data wrangling is still under active development.
The tidyr package discussed in this chapter replaced earlier packages — specifically reshape (2005–2010) and reshape2 (2010–2014) — and is still being changed at this point.
Thus, the gather()
and spread()
commands of tidyr are first steps, rather than the ultimate solution to data wrangling.
In 2019, the tidyr package was being complemented by the pivot_longer()
and pivot_wider()
functions, as well as some unnest_
functions for taming deeply nested list (like XML data files). See the vignettes on Pivoting and Rectangling for details on these developments.
But rather than waiting for further updates, we should trust that our general insights will still be valuable in the future and test our skills in tidying data by completing the following exercises.