5.1 Introduction

Transforming data can imply two types of goals and tasks:

  1. the selection, extraction, and aggregation of data (discover and show new aspects of/insights in data);

  2. the process of re-shaping the form of data (present the same data in different shapes).

Both types of tasks operate on tables. Working with a table (as inputs to functions) typically yields new tables (as outputs of functions). This is where the pipe operator from the magrittr package (Bache & Wickham, 2014) comes into play. When re-shaping data, a particular form of data table can be described as tidy data.

5.1.1 Reflection: Same or different data?

Suppose we were interested in the number of tuberculosis (TB) cases documented by the World Health Organization of three countries (e.g., Afghanistan, Brazil, and China) and two years (e.g., 1999 and 2000).

Our first insight should be that absolute numbers of cases are good to know, but difficult to interpret by themselves. Given that China is much larger than Afghanistan, we should expect higher numbers for most diseases that can occur in either country. Thus, we should also know each country’s population to put the number of cases into perspective. Thus, our data contains 2 variables (TB cases and population) for 3 countries and 2 different time points (i.e., a total of \(2 \cdot 3 \cdot 2 = 12\) numeric data points).

Suppose we had these 12 numbers — but how would we organize them into a table? A second insight is that there are many different ways in which we could present the same data. Compare and contrast the following data tables:

t1 <- tidyr::table1
#> [1] 6 4
knitr::kable(t1, caption = "The data of `tidyr::table1`.")
Table 5.1: The data of tidyr::table1.
country year cases population
Afghanistan 1999 745 19987071
Afghanistan 2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583
t2 <- tidyr::table2
#> [1] 12  4
knitr::kable(t2, caption = "The data of `tidyr::table2`.")
Table 5.2: The data of tidyr::table2.
country year type count
Afghanistan 1999 cases 745
Afghanistan 1999 population 19987071
Afghanistan 2000 cases 2666
Afghanistan 2000 population 20595360
Brazil 1999 cases 37737
Brazil 1999 population 172006362
Brazil 2000 cases 80488
Brazil 2000 population 174504898
China 1999 cases 212258
China 1999 population 1272915272
China 2000 cases 213766
China 2000 population 1280428583
t3 <- ds4psy::table7
#> [1] 6 1
knitr::kable(t3, caption = "The data of `ds4psy::table7`.")
Table 5.3: The data of ds4psy::table7.
t4 <- ds4psy::table8
#> [1] 3 5
knitr::kable(t4, caption = "The data of `ds4psy::table8`.")
Table 5.4: The data of ds4psy::table8.
country cases_1999 cases_2000 popu_1999 popu_2000
Afghanistan 745 2666 19987071 20595360
Brazil 37737 80488 172006362 174504898
China 212258 213766 1272915272 1280428583

Before reading on, let’s compare the tables and answer the following questions:

  • In which sense are these tables different? (Note their dimensions, variables, etc.)

  • In which sense do all these tables represent the same data?

  • Is any of the tables better or worse than the others?

The data contains values associated with four variables (e.g., country, year, cases, and population), but each table organizes the values in a different layout. Describing these layouts reports varying number of rows and columns (i.e., vectors or tables of different shapes) and perhaps terms like implicit or explicit. Importantly, the tables are all identical — or informationally equivalent — in the sense that they can be transformed into each other without gaining or losing information.

Which actions or operations would we need to perform to transform any table into one of the others? By the end of this chapter, we can do all this by using the tools provided by dplyr and tidyr. In addition, we will have acquired some new terminology for describing tables (using labels like “longer” or “wider,” or “messy” or “tidy”) and a pipe operator that allows creating chains of commands.

5.1.2 Key concepts

Key concepts of Sections 5.2 and 5.3 include the pipe operator (from the magrittr package), and functions to filter vs. select, mutate variables, and group and summarize data (from the dplyr package).