Data transformation includes re-arranging, selecting, changing, and aggregating data. Assuming that we have data in the form of a tibble or data frame, dplyr provides a range of simple tools to transform this data.
3.1.1 The function of pliers
The name of the dplyr package is inspired by “pliers”:
Pliers are tools for pulling out parts and tugging, tweaking, or twisting things into different shapes (see Figure 3.2). In our current context, the thing to tweak is a rectangular set of data (as an R data frame or tibble) and the dplyr tool allows manipulating this table into other tables that contain parts, additional or fewer variables, or provide summary information.
dplyr = MS Excel + control
For users that are familiar with basic spreadsheet in MS Excel: The dplyr functions allow similar manipulations of tabular data in R. However, spreadsheet users are typically solving many tasks by clicking interface buttons, entering simple formulas, and many copy-and-paste operations. While this can be simple and engaging, it is terribly error prone. The main problem with spreadsheets is that the process, typically consisting of many small interactive steps, remains transient and is lost, as only the resulting data table is stored. If a sequence of 100 steps included a minor error on step 29, we often need to start from scratch. Thus, it is very easy to make mistakes and almost impossible to recover from them if they are not noticed immediately.
By contrast, dplyr provides a series of simple commands for solving tasks like arranging or selecting rows or columns, categorizing variables into groups, and computing simple summary tables. Rather than incrementally constructing a spreadsheet and many implicit cut-and-paste operations, dplyr uses sequences of simple commands that explicate the entire process. In the spirit of reproducible research (Peng, 2011), this documents precisely what is being done and allows making corrections later.
3.1.2 Key commands
The essential dplyr functions which we will examine in Section 3.2 are:
arrange()sorts cases (rows);
slice()select cases (rows) by logical conditions or number;
select()selects and reorders variables (columns);
mutate()computes new variables (columns) and adds them to the existing ones;
summarise()collapses multiple values of a variable (rows of a column) to a single one;
ungroup()change the unit of aggregation (in combination with
Not quite as essential but still useful dplyr functions include:
rename()renames variables (columns) and keeps all others;
transmute()computes new variables (columns) and drops existing ones.
After working through this chapter, you should be able to use dplyr to:
- arrange cases (rows) based on one or more criteria;
- select cases (rows) and variables (columns) from a data table;
- change and create new variables;
- compute summary statistics over variables and grouped values;
- combine multiple commands into pipes to answer questions and create new data tables.
Note that you can already perform some of these tasks. For instance, you know how to select cases or variables from tables by numeric indexing (as introduced in Section 1.5.3). Nevertheless, learning the corresponding dplyr commands still makes sense, as they provide easier and more consistent methods to perform a wider range of tasks.
3.1.4 Data used
In this chapter, we will primarily use the
starwars data that is included in the dplyr package.
To illustrate the essential dplyr commands, we first save a copy of
dplyr::starwars as a tibble
Before proceeding further, we should get some sense of
sw by running some quick tests.
It is good practice to always check the dimensions of a new table at this point, but standard questions to ask of any new dataset include:
What are the dimensions (rows and columns) of our data file?
What do the individual cases (rows) represent?
Which variables (columns) exist and of which type are they?
Are there any missing (
NA) values? If so, how many?
When dealing with a tibble (rather than a data frame), the easiest way to obtain a lot of information about our data table is by printing it to the console:
# Print tibble: sw
#> # A tibble: 87 x 14 #> name height mass hair_color skin_color eye_color birth_year sex gender #> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> #> 1 Luke S… 172 77 blond fair blue 19 male mascu… #> 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… #> 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu… #> 4 Darth … 202 136 none white yellow 41.9 male mascu… #> 5 Leia O… 150 49 brown light brown 19 fema… femin… #> 6 Owen L… 178 120 brown, grey light blue 52 male mascu… #> 7 Beru W… 165 75 brown light blue 47 fema… femin… #> 8 R5-D4 97 32 <NA> white, red red NA none mascu… #> 9 Biggs … 183 84 black light brown 24 male mascu… #> 10 Obi-Wa… 182 77 auburn, wh… fair blue-gray 57 male mascu… #> # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>, #> # films <list>, vehicles <list>, starships <list>
Inspecting the output of printing
sw already answers our first three questions:
sw contains 87 cases (rows) and 13 variables (columns).
Each case represents an individual of the Star Wars universe.
The 13 variables are of many different types, including character, numeric (integer and double), and 3 variables that are “lists” of character elements.
More specialized commands to would yield the same answers include:
# Standard checks: dim(sw) # 87 x 13 variables
#>  87 14
names(sw) # variable names
#>  "name" "height" "mass" "hair_color" "skin_color" #>  "eye_color" "birth_year" "sex" "gender" "homeworld" #>  "species" "films" "vehicles" "starships"
glimpse(sw) # all variables and initial values
#> Rows: 87 #> Columns: 14 #> $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or… #> $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2… #> $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.… #> $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N… #> $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "… #> $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",… #> $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, … #> $ sex <chr> "male", "none", "none", "male", "female", "male", "female",… #> $ gender <chr> "masculine", "masculine", "masculine", "masculine", "femini… #> $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T… #> $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma… #> $ films <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return… #> $ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp… #> $ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…
To obtain information about potentially missing values, we can count the number of
NA values in our tibble and each variable:
# Missing (NA) values: ------ sum(is.na(sw)) # 101 missing values
#>  105
colSums(is.na(sw)) # missing by column (variable)
#> name height mass hair_color skin_color eye_color birth_year #> 0 6 28 5 0 0 44 #> sex gender homeworld species films vehicles starships #> 4 4 10 4 0 0 0
Before we start using dplyr, let’s repeat some concepts acquired in Chapter 1:
Explain how the number of missing values is determined by
sum(is.na(sw))in terms of data vs. applying functions to data, indexing, and summing logical data types.
How can we determine the number of non-missing values in
How can we verify that the number of missing values and the number of non-missing values add up to the number of all values in
# N of non-missing values: sum(!is.na(sw))
#>  1113
# Verify sums: # (a) step-by-step: <- sum( is.na(sw)) n_missing <- sum(!is.na(sw)) n_present <- dim(sw) n_rows <- dim(sw) n_cols <- n_rows * n_cols n_all + n_present == n_alln_missing
#>  TRUE
# (b) in 1 step: sum(is.na(sw)) + sum(!is.na(sw)) == prod(dim(sw))
#>  TRUE
3.1.5 Getting ready
This chapter formerly assumed that you have read and worked through Chapter 5: Data transformation of the r4ds textbook (Wickham & Grolemund, 2017). It now can be read by itself, but reading Chapter 5 of r4ds is still recommended. Based on this background, we examine essential commands of the dplyr package in the context of examples and exercises.
Please do the following to get started:
Structure your document by inserting headings and empty lines between different parts. Here’s an example how your initial file could look:
--- : "Chapter 3: Data transformation" title: "Your name" author: "2021 June 14" date: html_document output--- Add text or code chunks here. # Exercises (03: Data transformation) ## Exercise 1 ## Exercise 2 etc. <!-- The end (eof). -->
Create an initial code chunk below the header of your
.Rmdfile that loads the R packages of the tidyverse and the ds4psy package (and see Section F.3.3 if you want to get rid of the messages and warnings of this chunk in your HTML output).
Save your file (e.g., as
03_transform.Rmdin the R folder of your current project) and remember saving and knitting it regularly as you keep adding content to it.
Use the HTML output version of your file (e.g.,
03_transform.html) to share and submit your results, but still show and run your code by setting your R code chunks to
echo = TRUE, eval = TRUE(which corresponds to the default settings of code chunks).
With your new
.Rmd file in place, you are ready to start wrangling data with the help of dplyr.