3.1 Introduction

Data transformation (aka. data munging or data wrangling) is an umbrella term for everything that happens between collecting data and interpreting a result. Even when categorizing the process of data analysis into general phases (e.g., data screening, data exploration, and hypothesis testing) some aspect of data transformation will occur in each of these phases. Typical tasks of data transformation include arranging, formatting, selecting, changing, and summarizing data.

The dplyr package shall be our light saber for transforming data.

Figure 3.2: The dplyr package shall be our light saber for transforming data.

From a tool-based perspective, the tasks involved in data transformation require a range of functions that allow selecting, reshaping, and re-assembling meaningful parts of data. A metaphor for a collection of such functions is a toolbox that includes different operations for slicing and dicing data (see our Swiss army knife analogy in Figure 1.2 in Section 1.1.3). As we will see below (see Section 3.1.2), the dplyr package (Wickham, François, et al., 2023) evokes the analogy of pliers. But since we work with starwars data througout this chapter, we see dplyr as a Swiss army knife full of light sabers. Thus, the dplyr package is a powerful toolbox for Rwars yedis.

3.1.1 Reshaping vs. reducing data

The different tasks involved in data transformation have very different effects. Whereas some operations leave a dataset intact, others modify the data in ways that highlight some aspects while sacrificing others. To gain some conceptual clarity, we first distinguish between two general types of data transformations.

1. Reshaping data

Reshaping operations modify the shape or structure of data without changing its contents. Importantly, two different shapes \(A\) and \(B\) of the same data are informationally equivalent in the sense that anything that can be computed from \(A\) can also be computed from \(B\) (and vice versa).36 Typical examples for reshaping data include structuring the values of a vector into an array or matrix or re-arranging the rows or columns of a table. Thus, the following data structures are all informationally equivalent (in the sense that they store the same data points and thus allow the same analyses and inferences), but differ in shape or structure:

# Example: Storing 1, 2, 3, 4 in different shapes: 

# as vectors:
(v1 <- 1:4)
#> [1] 1 2 3 4
(v2 <- rev(v1))
#> [1] 4 3 2 1

# as matrix:
(m1 <- matrix(1:4, nrow = 2))
#>      [,1] [,2]
#> [1,]    1    3
#> [2,]    2    4
(m2 <- matrix(1:4, nrow = 2, byrow = TRUE))
#>      [,1] [,2]
#> [1,]    1    2
#> [2,]    3    4

# as dataframe:
(d1 <- data.frame(x = 1:2, y = 3:4))
#>   x y
#> 1 1 3
#> 2 2 4
(d2 <- data.frame(m = 4:3, n = 2:1))
#>   m n
#> 1 4 2
#> 2 3 1

# as list:
(l1 <- list(a = 1:2, b = 3:4))
#> $a
#> [1] 1 2
#> 
#> $b
#> [1] 3 4

As these examples illustrate, reshaping operations can be performed within a single data structure or between different data structures. Strictly speaking, different data structures involve additional features beyond the mere values and shape of data that enable the access to and interpretation of the data. In R, such features are known as attributes (e.g., the names of rows and columns of a table) and typically shown when evaluating the structure of an R object (e.g., by the str() function).

While it may seem as if reshaping operations “only” change the surface structure, we must not underestimate the effects of shape and structure for our understanding and analysis of data. The shape and structure of data objects are important for at least two distinct, but related reasons:

  • Matching data to tools: Which data structure is used and how the data is arranged within a data structure enables or constrains how we can access and process the data. For instance, most functions (e.g., ggplot()) require that the data is shaped in a particular format and do not work if a different, but informationally equivalent form is supplied.

  • Matching data to our cognitive representations and skills: Just as our tools require particular data structures, we typically view and think about data in particular structures (e.g., cases as rows, variables as columns of a data frame). Despite the existence of many equivalent representations, human reasoners find it much easier to understand, interpret and use data that is arranged and sorted in ways that “fits” to the content (i.e., the semantic categories represented) and to their current goals or tasks.

The precise conditions and constraints of these matches are complex and at the center of many scientific debates (see Todd, Gigerenzer, & the ABC Research Group, 2012, for definitions and examples of ecological rationality). The literature in scientific psychology discusses the shape and structure of cognitive representations under the notions of categories, concepts, or models. The often puzzling phenomenon that informationally equivalent representations can elicit different behavior (e.g., when making decisions or solving problems) is known as representational effects or framing effects and still poorly understood (see Neth, Gradwohl, Streeb, Keim, & Gaissmaier, 2021, for examples and references). Overall, as our both tools and ways of thinking about and working with data typically afford particular data structures, we must not underestimate the importance of reshaping data.

2. Reducing data

Reducing operations are data transformations that typically modify both the shape and content of data. Thus, whereas reshaping operations can be reversed, reducing data usually goes beyond mere “trans-formation” by being uni-directional: We typically cannot reconstruct the original data from reduced data.

Good examples for reducing data are selecting some cases or variables from a larger dataset or aggregating data (e.g., computing the mean of some values). Whereas it is obvious that choosing some part reduces the amount of original data, computing a summary may still represent all values in some way, but reduces the resolution of the data.

Although the distinction between reshaping and reducing data is useful, these two categories are neither exhaustive nor mutually exclusive. For instance, a function that computes and adds a new variable based on existing values may reshape and enhance data. Similarly, most functions that summarize a range of values or variables both reshape and reduce the original data.

Precisely how we reshape and reduce data is usually driven by (a) the goals and purposes of our analysis, and (b) by the methods and tools that we want to apply. Whenever computing some measure, conducting a statistical test, or creating a visualization, we typically prepare data (e.g., format it in the required shape), arrange or select parts, focus on them in particular ways, and transform or aggregate the selected data in accordance with the rules of some methods or tools. For instance, when summarizing the values of a variable in a table, a visualization, or a statistical test, all three forms provide an overview of the data. Which of these alternative representations best answers our question depends on the match between our original goals and the features of each solution.

3.1.2 The function of pliers

The dplyr package (Wickham, François, et al., 2023) is a toolbox for transforming data. As we will see, dplyr contains functions for both reshaping and reducing data. Just as any tool for data transformation, these functions require particular data structures as inputs and yield particular data structures as outputs. In the case of dplyr, both its inputs and outputs are mostly rectangular data structures (i.e., data frames or tibbles) that vary in contents and shape.

Whereas we invoked the toolbox analogy of a Swiss army knife (above and in Section (basics:terminology)), the name of the dplyr package is inspired by “pliers”:

Pliers are tools for grasping things or tweaking them into new shapes. (Image by Evan-Amos, via Wikimedia Commons.)

Figure 3.3: Pliers are tools for grasping things or tweaking them into new shapes. (Image by Evan-Amos, via Wikimedia Commons.)

Pliers are tools for pulling out parts and tugging, tweaking, or twisting things into different shapes (see Figure 3.3). In our current context, the thing to tweak is a rectangular set of data (as an R data frame or tibble) and the dplyr tool allows manipulating this table into different tables that contain parts, additional or fewer variables, or provide summary information.

dplyr = MS Excel + control

For users that are familiar with basic spreadsheet in MS Excel: The dplyr functions allow similar manipulations of tabular data in R. However, spreadsheet users are typically solving many tasks by clicking interface buttons, entering simple formulas, and many copy-and-paste operations. While this can be simple and engaging, it is terribly error prone. The main problem with spreadsheets is that the process, typically consisting of many small interactive steps, remains transient and is lost, as only the resulting data table is stored. If a sequence of 100 steps included a minor error on step 29, we often need to start from scratch. Thus, it is very easy to make mistakes and almost impossible to recover from them if they are not noticed immediately.

By contrast, dplyr provides a series of simple commands for solving tasks like arranging or selecting rows or columns, categorizing variables into groups, and computing simple summary tables. Rather than incrementally constructing a spreadsheet and many implicit cut-and-paste operations, dplyr uses sequences of simple commands that explicate the entire process. In the spirit of reproducible research (Peng, 2011), this documents precisely what is being done and allows making corrections later.

3.1.3 Key commands

The essential dplyr functions which (we will examine in Section 3.2) are:

  1. arrange() sorts cases (rows);
  2. filter() and slice() select cases (rows) by logical conditions or number;
  3. select() selects and reorders variables (columns);
  4. mutate() computes new variables (columns) and adds them to the existing ones;
  5. summarise() collapses multiple values of a variable (rows of a column) to a single one;
  6. group_by() and ungroup() change the unit of aggregation (in combination with mutate() and summarise()).

Not quite as essential but still useful dplyr functions include:

  • rename() renames variables (columns) and keeps all others;
  • transmute() computes new variables (columns) and drops existing ones.

3.1.4 Objectives

After working through this chapter, you should be able to use dplyr to:

  1. arrange cases (rows) based on one or more criteria;
  2. select cases (rows) and variables (columns) from a data table;
  3. change and create new variables;
  4. compute summary statistics over variables and grouped values;
  5. combine multiple commands into pipes to answer questions and create new data tables.

Note that you can already perform some of these tasks. For instance, you know how to select cases or variables from tables by numeric indexing (as introduced in Section 1.5.3). Nevertheless, learning the corresponding dplyr commands still makes sense, as they provide easier and more consistent methods to perform a wider range of tasks.

3.1.5 Data used

In this chapter, we use the starwars data that is included in the dplyr package. To illustrate the essential dplyr commands, we first save a copy of dplyr::starwars as a tibble sw:

sw <- dplyr::starwars

Before proceeding further, we should get some sense of sw by running some quick tests. It is good practice to always check the dimensions of a new table at this point, but standard questions to ask of any new dataset include:

  • What are the dimensions (rows and columns) of our data file?

  • What do the individual cases (rows) represent?

  • Which variables (columns) exist and of which type are they?

  • Are there any missing (NA) values? If so, how many?

When dealing with a tibble (rather than a data frame), the easiest way to obtain a lot of information about our data table is by printing it to the console:

# Print tibble: 
sw
#> # A tibble: 87 × 14
#>    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
#>    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
#>  1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
#>  2 C-3PO          167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
#>  3 R2-D2           96    32 <NA>    white,… red        33   none  mascu… Naboo  
#>  4 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
#>  5 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
#>  6 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
#>  7 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
#>  8 R5-D4           97    32 <NA>    white,… red        NA   none  mascu… Tatooi…
#>  9 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
#> 10 Obi-Wan Ke…    182    77 auburn… fair    blue-g…    57   male  mascu… Stewjon
#> # … with 77 more rows, 4 more variables: species <chr>, films <list>,
#> #   vehicles <list>, starships <list>, and abbreviated variable names
#> #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld

Inspecting the output of printing sw already answers our first three questions: The tibble sw contains 87 cases (rows) and 13 variables (columns). Each case represents an individual of the Star Wars universe. The 13 variables are of many different types, including character, numeric (integer and double), and 3 variables that are “lists” of character elements.

More specialized commands to would yield the same answers include:

# Standard checks:
dim(sw)     # 87 x 13 variables
#> [1] 87 14
names(sw)   # variable names
#>  [1] "name"       "height"     "mass"       "hair_color" "skin_color"
#>  [6] "eye_color"  "birth_year" "sex"        "gender"     "homeworld" 
#> [11] "species"    "films"      "vehicles"   "starships"
glimpse(sw) # all variables and initial values 
#> Rows: 87
#> Columns: 14
#> $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
#> $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
#> $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
#> $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
#> $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
#> $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
#> $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
#> $ sex        <chr> "male", "none", "none", "male", "female", "male", "female",…
#> $ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini…
#> $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
#> $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
#> $ films      <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return…
#> $ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
#> $ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…

To obtain information about potentially missing values, we can count the number of NA values in our tibble and each variable:

# Missing (NA) values: ------ 
sum(is.na(sw))      # 101 missing values
#> [1] 105
colSums(is.na(sw))  # missing by column (variable) 
#>       name     height       mass hair_color skin_color  eye_color birth_year 
#>          0          6         28          5          0          0         44 
#>        sex     gender  homeworld    species      films   vehicles  starships 
#>          4          4         10          4          0          0          0

Practice

Before we start using dplyr, let’s repeat some concepts acquired in Chapter 1:

  • Explain how the number of missing values is determined by sum(is.na(sw)) in terms of data vs. applying functions to data, indexing, and summing logical data types.

  • How can we determine the number of non-missing values in sw?

  • How can we verify that the number of missing values and the number of non-missing values add up to the number of all values in sw?

# N of non-missing values:
sum(!is.na(sw))
#> [1] 1113
# Verify sums:
# (a) step-by-step: 
n_missing <- sum( is.na(sw)) 
n_present <- sum(!is.na(sw)) 

n_rows <- dim(sw)[1]
n_cols <- dim(sw)[2]
n_all  <- n_rows * n_cols
  
n_missing + n_present == n_all
#> [1] TRUE
# (b) in 1 step:  
sum(is.na(sw)) + sum(!is.na(sw)) == prod(dim(sw))
#> [1] TRUE

3.1.6 Getting ready

This chapter formerly assumed that you have read and worked through Chapter 5: Data transformation of the r4ds textbook (Wickham & Grolemund, 2017). It now can be read by itself, but reading Chapter 5 of r4ds is still recommended. Based on this background, we examine essential commands of the dplyr package in the context of examples and exercises.

Please do the following to get started:

  • Create an R Markdown (.Rmd) document (for instructions, see Appendix F and the templates linked in Section F.2).

  • Structure your document by inserting headings and empty lines between different parts. Here’s an example showing how your initial file could look:

---
title: "Chapter 3: Data transformation"
author: "Your name"
date: "2024 October 20"
output: html_document
---

Add text or code chunks here.

# Exercises (03: Data transformation)
  
## Exercise 1
  
## Exercise 2
  
etc. 
  
<!-- The end (eof). -->
  • Create an initial code chunk below the header of your .Rmd file that loads all required R packages (e.g., those of the tidyverse and the ds4psy package) and see Section F.3.3 if you want to get rid of any messages or warnings of this chunk in your HTML output.

  • Save your file (e.g., as 03_transform.Rmd in the R folder of your current project) and remember saving and knitting it regularly as you keep adding content to it.

  • Use the HTML output version of your file (e.g., 03_transform.html) to share and submit your results, but still show and run your code by setting your R code chunks to echo = TRUE, eval = TRUE (which corresponds to the default settings of code chunks).

With your new .Rmd file in place, you are ready to start wrangling data with the help of dplyr.

References

Neth, H., Gradwohl, N., Streeb, D., Keim, D. A., & Gaissmaier, W. (2021). Perspectives on the 2x2 matrix: Solving semantically distinct problems based on a shared structure of binary contingencies. Frontiers in Psychology, 11, 567817. https://doi.org/10.3389/fpsyg.2020.567817
Peng, R. D. (2011). Reproducible research in computational science. Science, 334(6060), 1226–1227. https://doi.org/10.1126/science.1213847
Todd, P. M., Gigerenzer, G., & the ABC Research Group. (2012). Ecological rationality: Intelligence in the world. New York, NY: Oxford University Press.
Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). dplyr: A grammar of data manipulation. Retrieved from https://CRAN.R-project.org/package=dplyr
Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. Retrieved from http://r4ds.had.co.nz

  1. In mathematical terms, reshaping data is a bijective function in which each element in \(A\) is mapped to a corresponding element in \(B\), and this mapping can be reversed.↩︎