Data transformations and the corresponding operators can be classified into two general types:
Reducing operations are data transformations that typically modify both the shape and content of data. Examples of data reductions include sampling values from a population, describing a set of values by some measure (e.g., its mean or median), or computing some summary measure out of a set of values (e.g., to learn about a test’s sensitivity or positive predictive value).
Reshaping operations modify the shape or structure of data without changing its contents. When re-shaping data, some shapes are more suitable for further analysis than others. A particular form of tabular data can be described as tidy data. Typical examples for reshaping data include transforming the values of a vector into an array or matrix, or re-arranging the rows or columns of a table.
Thus, whereas reshaping operations can be reversed, reducing data usually goes beyond mere “trans-formation” by being uni-directional: We typically cannot reconstruct the original data from reduced data.
Practice: Reshape or reduce?
Assuming the following vectors
<- 1:9 v <- sample(c("A", "B", "C"), 10, replace = TRUE)a
do the following operations reduce or reshape these data structures? Why?
Hint: Evaluate the expressions in order to see their results.
matrix(v, nrow = 3) mean(v) data.frame(nr = v, freq = v) %>% group_by(nr) %>% count() rev(a) table(a) == "A"a
5.1.1 Reflection: Same or different data?
Suppose we were interested in the number of tuberculosis (TB) cases documented by the World Health Organization of three countries (e.g., Afghanistan, Brazil, and China) and two years (e.g., 1999 and 2000).
Our first insight should be that absolute numbers of cases are good to know, but difficult to interpret by themselves. Given that China is much larger than Afghanistan, we should expect higher numbers for most diseases that can occur in either country. Thus, we should also know each country’s population to put the number of cases into perspective. Thus, our data contains 2 variables (TB cases and population) for 3 countries and 2 different time points (i.e., a total of \(2 \cdot 3 \cdot 2 = 12\) numeric data points).
Suppose we had these 12 numbers — but how would we organize them into a table? A second insight is that there are many different ways in which we could present the same data. Compare and contrast the following data tables:
<- tidyr::table1 t1 dim(t1) #>  6 4 ::kable(t1, caption = "The data of `tidyr::table1`.")knitr
<- tidyr::table2 t2 dim(t2) #>  12 4 ::kable(t2, caption = "The data of `tidyr::table2`.")knitr
<- ds4psy::table7 t3 dim(t3) #>  6 1 ::kable(t3, caption = "The data of `ds4psy::table7`.")knitr
<- ds4psy::table8 t4 dim(t4) #>  3 5 ::kable(t4, caption = "The data of `ds4psy::table8`.")knitr
Before reading on, let’s compare the tables and answer the following questions:
In which sense are these tables different? (Note their dimensions, variables, etc.)
In which sense do all these tables represent the same data?
Is any of the tables better or worse than the others?
The data contains values associated with four variables (e.g.,
population), but each table organizes the values in a different layout. Describing these layouts reports varying number of rows and columns (i.e., vectors or tables of different shapes) and perhaps terms like implicit or explicit. Importantly, the tables are all identical — or informationally equivalent — in the sense that they can be transformed into each other without gaining or losing information.
Which actions or operations would we need to perform to transform any table into one of the others? By the end of this chapter, we can do all this by using the tools provided by dplyr and tidyr. In addition, we will have acquired some new terminology for describing tables (using labels like “longer” or “wider,” or “messy” or “tidy”) and a pipe operator that allows creating chains of commands.
5.1.2 Key concepts
Key concepts of Sections 5.2 and 5.3 include the pipe operator (from the magrittr package), and functions to filter vs. select, mutate variables, and group and summarize data (from the dplyr package).
While both types of data transformations can be demonstrated with vectors, we typically operate on entire data tables.
Working with a table (as inputs to functions) typically yields new tables (as outputs of functions).
This is where the pipe operator
%>% from the magrittr package (Bache & Wickham, 2014) comes into play:
It passes (or “pipes”) the result of one operation (i.e., the output of a function) as an input to another operation (i.e., as the first argument of another function).