Transforming data can imply two types of goals and tasks:
the selection, extraction, and aggregation of data (discover and show new aspects of/insights in data);
the process of re-shaping the form of data (present the same data in different shapes).
Both types of tasks operate on tables. Working with a table (as inputs to functions) typically yields new tables (as outputs of functions). This is where the pipe operator from the magrittr package (Bache & Wickham, 2014) comes into play. When re-shaping data, a particular form of data table can be described as tidy data.
5.1.1 Reflection: Same or different data?
Suppose we were interested in the number of tuberculosis (TB) cases documented by the World Health Organization of three countries (e.g., Afghanistan, Brazil, and China) and two years (e.g., 1999 and 2000).
Our first insight should be that absolute numbers of cases are good to know, but difficult to interpret by themselves. Given that China is much larger than Afghanistan, we should expect higher numbers for most diseases that can occur in either country. Thus, we should also know each country’s population to put the number of cases into perspective. Thus, our data contains 2 variables (TB cases and population) for 3 countries and 2 different time points (i.e., a total of \(2 \cdot 3 \cdot 2 = 12\) numeric data points).
Suppose we had these 12 numbers — but how would we organize them into a table? A second insight is that there are many different ways in which we could present the same data. Compare and contrast the following data tables:
<- tidyr::table1 t1 dim(t1) #>  6 4 ::kable(t1, caption = "The data of `tidyr::table1`.")knitr
<- tidyr::table2 t2 dim(t2) #>  12 4 ::kable(t2, caption = "The data of `tidyr::table2`.")knitr
<- ds4psy::table7 t3 dim(t3) #>  6 1 ::kable(t3, caption = "The data of `ds4psy::table7`.")knitr
<- ds4psy::table8 t4 dim(t4) #>  3 5 ::kable(t4, caption = "The data of `ds4psy::table8`.")knitr
Before reading on, let’s compare the tables and answer the following questions:
In which sense are these tables different? (Note their dimensions, variables, etc.)
In which sense do all these tables represent the same data?
Is any of the tables better or worse than the others?
The data contains values associated with four variables (e.g.,
population), but each table organizes the values in a different layout. Describing these layouts reports varying number of rows and columns (i.e., vectors or tables of different shapes) and perhaps terms like implicit or explicit. Importantly, the tables are all identical — or informationally equivalent — in the sense that they can be transformed into each other without gaining or losing information.
Which actions or operations would we need to perform to transform any table into one of the others? By the end of this chapter, we can do all this by using the tools provided by dplyr and tidyr. In addition, we will have acquired some new terminology for describing tables (using labels like “longer” or “wider,” or “messy” or “tidy”) and a pipe operator that allows creating chains of commands.