3.1 Introduction

Data transformation includes re-arranging, selecting, changing, and aggregating data. Assuming that we have data in the form of a tibble or data frame, dplyr provides a range of simple tools to transform this data.

3.1.1 Overview

The 6 essential dplyr commands which we will examine in Section 3.2 are:

  1. arrange sorts cases (rows);
  2. filter selects cases (rows) by logical conditions;
  3. select selects and reorders variables (columns);
  4. mutate computes variables (columns) and adds them to existing ones;
  5. summarise collapses multiple values of a variable (rows of a column) to a single one;
  6. group_by changes the unit of aggregation (in combination with mutate and summarise).

Not quite as essential but still useful dplyr commands include:

  • slice selects (ranges of) cases (rows) by number;
  • rename renames variables (columns) and keeps all others;
  • transmute computes new variables (columns) and drops existing ones;
  • sample_n and sample_frac draw random samples of cases (rows).

3.1.2 Objectives

After working through this chapter, you should be able to use dplyr to:

  1. arrange cases (rows) based on one or more criteria;
  2. select cases (rows) and variables (columns) from a data table;
  3. change and create new variables;
  4. compute summary statistics over variables and grouped values;
  5. combine multiple commands into pipes to answer questions and create new data tables.

Note that you can already perform some of these tasks. For instance, you know how to select cases or variables from tables by numeric indexing (as introduced in Section 1.5.3). Nevertheless, learning the corresponding dplyr commands still makes sense, as they provide easier and more consistent methods to perform a wider range of tasks.

3.1.3 Data used

In this chapter, we will primarily use the starwars data that is included in the dplyr package. To illustrate the essential dplyr commands, we first save a copy of dplyr::starwars as a tibble sw:

sw <- dplyr::starwars

Before proceeding further, we should get some sense of sw by running some quick tests. It is good practice to always check the dimensions of a new table at this point, but standard questions to ask of any new dataset include:

  • What are the dimensions (rows and columns) of our data file?

  • What do the individual cases (rows) represent?

  • Which variables (columns) exist and of which type are they?

  • Are there any missing (NA) values? If so, how many?

When dealing with a tibble (rather than a data frame), the easiest way to obtain a lot of information about our data table is by printing it to the console:

# Print tibble: 
sw
#> # A tibble: 87 x 13
#>    name  height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 Luke…    172    77 blond      fair       blue            19   male  
#>  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
#>  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
#>  4 Dart…    202   136 none       white      yellow          41.9 male  
#>  5 Leia…    150    49 brown      light      brown           19   female
#>  6 Owen…    178   120 brown, gr… light      blue            52   male  
#>  7 Beru…    165    75 brown      light      blue            47   female
#>  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
#>  9 Bigg…    183    84 black      light      brown           24   male  
#> 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
#> # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

Inspecting the output of printing sw already answers our first three questions: The tibble sw contains 87 cases (rows) and 13 variables (columns). Each case represents an individual of the Star Wars universe. The 13 variables are of many different types, including character, numeric (integer and double), and 3 variables that are “lists” of character elements.

More specialized commands to would yield the same answers include:

# Standard checks:
dim(sw)     # 87 x 13 variables
#> [1] 87 13
names(sw)   # variable names
#>  [1] "name"       "height"     "mass"       "hair_color" "skin_color"
#>  [6] "eye_color"  "birth_year" "gender"     "homeworld"  "species"   
#> [11] "films"      "vehicles"   "starships"
glimpse(sw) # all variables and initial values 
#> Observations: 87
#> Variables: 13
#> $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia O…
#> $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, …
#> $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77…
#> $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", …
#> $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", …
#> $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue"…
#> $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0,…
#> $ gender     <chr> "male", NA, NA, "male", "female", "male", "female", NA, "m…
#> $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "…
#> $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Hum…
#> $ films      <list> [<"Revenge of the Sith", "Return of the Jedi", "The Empir…
#> $ vehicles   <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "I…
#> $ starships  <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1…

To obtain information about potentially missing values, we can count the number of NA values in our tibble and each variable:

# Missing (NA) values: ------ 
sum(is.na(sw))      # 101 missing values
#> [1] 101
colSums(is.na(sw))  # missing by column (variable) 
#>       name     height       mass hair_color skin_color  eye_color birth_year 
#>          0          6         28          5          0          0         44 
#>     gender  homeworld    species      films   vehicles  starships 
#>          3         10          5          0          0          0

Practice

Before we start using dplyr, let’s repeat some concepts acquired in Chapter 1:

  • Explain how the number of missing values is determined by sum(is.na(sw)) in terms of data vs. applying functions to data, indexing, and summing logical data types.

  • How can we determine the number of non-missing values in sw?

  • How can we verify that the number of missing values and the number of non-missing values add up to the number of all values in sw?

# N of non-missing values:
sum(!is.na(sw))
#> [1] 1030
# Verify sums:
# (a) step-by-step: 
n_missing <- sum( is.na(sw)) 
n_present <- sum(!is.na(sw)) 

n_rows <- dim(sw)[1]
n_cols <- dim(sw)[2]
n_all  <- n_rows * n_cols
  
n_missing + n_present == n_all
#> [1] TRUE
# (b) in 1 step:  
sum(is.na(sw)) + sum(!is.na(sw)) == prod(dim(sw))
#> [1] TRUE

3.1.4 Getting ready

This chapter formerly assumed that you have read and worked through Chapter 5: Data transformation of the r4ds textbook (Wickham & Grolemund, 2017). It now can be read by itself, but reading Chapter 5 of r4ds is still recommended.

Based on this background, we examine essential commands of the dplyr package in the context of examples and exercises. Please do the following to get started:

  • Create an R Markdown (.Rmd) document (for instructions, see Appendix E and the templates linked in Section E.2).

  • Structure your document by inserting headings and empty lines between different parts. Here’s an example how your initial file could look:

---
title: "Chapter 3: Data transformation"
author: "Your name"
date: "2020 February 10"
output: html_document
---

Add text or code chunks here.

# Exercises (03: Data transformation)
  
## Exercise 1
  
## Exercise 2
  
etc. 
  
<!-- The end (eof). -->
  • Create an initial code chunk below the header of your .Rmd file that loads the R packages of the tidyverse (and see Section E.3.3 if you want to get rid of the messages and warnings of this chunk in your HTML output).

  • Save your file (e.g., as 03_transform.Rmd in the R folder of your current project) and remember saving and knitting it regularly as you keep adding content to it.

With your new .Rmd file in place, you are ready to start wrangling data with the help of dplyr.

References

Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. Retrieved from http://r4ds.had.co.nz