6.1 Introduction

Data transformations and the corresponding operators can be classified into two general types:

1. Reducing operations are data transformations that typically modify both the shape and content of data. Examples of data reductions include sampling values from a population, describing a set of values by some measure (e.g., its mean or median), or computing some summary measure out of a set of values (e.g., to learn about a test’s sensitivity or positive predictive value).

2. Reshaping operations modify the shape or structure of data without changing its contents. When re-shaping data, some shapes are more suitable for further analysis than others. A particular form of tabular data can be described as tidy data. Typical examples for reshaping data include transforming the values of a vector into an array or matrix, or re-arranging the rows or columns of a table.

Thus, whereas reshaping operations can be reversed, reducing data usually goes beyond mere “trans-formation” by being uni-directional: We typically cannot reconstruct the original data from reduced data.

Practice: Reshape or reduce?

Assuming the following vectors v and a:

v <- 1:9
a <- sample(c("A", "B", "C"), 10, replace = TRUE)

do the following operations reduce or reshape these data structures? Why?

Hint: Evaluate the expressions in order to see their results.

matrix(v, nrow = 3)
mean(v)
data.frame(nr = v, freq = v) %>%
group_by(nr) %>%
count()

rev(a)
table(a)
a == "A"

6.1.1 Reflection: Same or different data?

Suppose we were interested in the number of tuberculosis (TB) cases documented by the World Health Organization of three countries (e.g., Afghanistan, Brazil, and China) and two years (e.g., 1999 and 2000).

Our first insight should be that absolute numbers of cases are good to know, but difficult to interpret by themselves. Given that China is much larger than Afghanistan, we should expect higher numbers for most diseases that can occur in either country. Thus, we should also know each country’s population to put the number of cases into perspective. Thus, our data contains 2 variables (TB cases and population) for 3 countries and 2 different time points (i.e., a total of $$2 \cdot 3 \cdot 2 = 12$$ numeric data points).

Suppose we had these 12 numbers — but how would we organize them into a table? A second insight is that there are many different ways in which we could present the same data. Compare and contrast the following data tables:

t1 <- tidyr::table1
dim(t1)
#> [1] 6 4
knitr::kable(t1, caption = "The data of tidyr::table1.")
Table 6.1: The data of tidyr::table1.
country year cases population
Afghanistan 1999 745 19987071
Afghanistan 2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583
t2 <- tidyr::table2
dim(t2)
#> [1] 12  4
knitr::kable(t2, caption = "The data of tidyr::table2.")
Table 6.2: The data of tidyr::table2.
country year type count
Afghanistan 1999 cases 745
Afghanistan 1999 population 19987071
Afghanistan 2000 cases 2666
Afghanistan 2000 population 20595360
Brazil 1999 cases 37737
Brazil 1999 population 172006362
Brazil 2000 cases 80488
Brazil 2000 population 174504898
China 1999 cases 212258
China 1999 population 1272915272
China 2000 cases 213766
China 2000 population 1280428583
t3 <- ds4psy::table7
dim(t3)
#> [1] 6 1
knitr::kable(t3, caption = "The data of ds4psy::table7.")
Table 6.3: The data of ds4psy::table7.
where_when_what
:99$745\19987071 :00$2666\20595360
:99$37737\172006362 :00$80488\174504898
:99$212258\1272915272 :00$213766\1280428583
t4 <- ds4psy::table8
dim(t4)
#> [1] 3 5
knitr::kable(t4, caption = "The data of ds4psy::table8.")
Table 6.4: The data of ds4psy::table8.
country cases_1999 cases_2000 popu_1999 popu_2000
Afghanistan 745 2666 19987071 20595360
Brazil 37737 80488 172006362 174504898
China 212258 213766 1272915272 1280428583

Before reading on, let’s compare the tables and answer the following questions:

• In which sense are these tables different? (Note their dimensions, variables, etc.)

• In which sense do all these tables represent the same data?

• Is any of the tables better or worse than the others?

The data contains values associated with four variables (e.g., country, year, cases, and population), but each table organizes the values in a different layout. Describing these layouts reports varying number of rows and columns (i.e., vectors or tables of different shapes) and perhaps terms like implicit or explicit. Importantly, the tables are all identical — or informationally equivalent — in the sense that they can be transformed into each other without gaining or losing information.

Which actions or operations would we need to perform to transform any table into one of the others? By the end of this chapter, we can do all this by using the tools provided by dplyr and tidyr. In addition, we will have acquired some new terminology for describing tables (using labels like “longer” or “wider,” or “messy” or “tidy”) and a pipe operator that allows creating chains of commands.

6.1.2 Key concepts

Key concepts of Sections 6.2 and 6.3 include the pipe operator (from the magrittr package), and functions to filter vs. select, mutate variables, and group and summarize data (from the dplyr package).

While both types of data transformations can be demonstrated with vectors, we typically operate on entire data tables. Working with a table (as inputs to functions) typically yields new tables (as outputs of functions). This is where the pipe operator %>% from the magrittr package comes into play: It passes (or “pipes”) the result of one operation (i.e., the output of a function) as an input to another operation (i.e., as the first argument of another function).

References

Bache, S. M., & Wickham, H. (2014). magrittr: A forward-pipe operator for R. https://CRAN.R-project.org/package=magrittr