3.2 Essential dplyr commands

The dplyr package (Wickham et al., 2020) provides many useful commands, but the following 6 verbs are essential for transforming data and computing simple summary statistics:

  1. arrange sorts cases (rows);
  2. filter selects cases (rows) by logical conditions;
  3. select selects and reorders variables (columns);
  4. mutate computes variables (columns) and adds them to existing ones;
  5. summarise collapses multiple values of a variable (rows of a column) to a single one;
  6. group_by changes the unit of aggregation (in combination with mutate and summarise).

The following sections illustrate each of these commands in the context of examples. To keep things simple and entertaining, we use the toy dataset of sw <- dplyr::starwars to introduce the commands, but will proceed to more realistic datasets in the exercises (in Section 3.5).

3.2.1 arrange sorts rows

Using arrange sorts cases (rows) by putting specific variables (columns) in specific orders (e.g., ascending or descending). For instance, we could want to arrange cases (rows) by the name of individuals (in alphabetical order). The dplyr function arrange() let’s us do this by calling:

Before we proceed, 2 simple observations will facilitate our future life a lot:

  1. In R, we can generally omit argument names (as long as the order of arguments makes it clear what is meant). Thus, we can write the same command more easily as:
  1. In dplyr and other tidyverse packages, we can rewrite commands by using the so called pipe (written by the symbols %>%) of the magrittr package (Bache & Wickham, 2014):

Think of the pipe as passing whatever is on its left (here: sw) to the first argument of the function on its right (here: .data). (More details about using the pipe operator are provided below in Section 3.3.)

In other words, the last 3 commands (a), (b), and (c) are identical and yield the same output:

#> # A tibble: 87 x 14
#>    name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#>  1 Ackb…    180    83 none       brown mot… orange          41   male  mascu…
#>  2 Adi …    184    50 none       dark       blue            NA   fema… femin…
#>  3 Anak…    188    84 blond      fair       blue            41.9 male  mascu…
#>  4 Arve…     NA    NA brown      fair       brown           NA   male  mascu…
#>  5 Ayla…    178    55 none       blue       hazel           48   fema… femin…
#>  6 Bail…    191    NA black      tan        brown           67   male  mascu…
#>  7 Barr…    166    50 black      yellow     blue            40   fema… femin…
#>  8 BB8       NA    NA none       none       black           NA   none  mascu…
#>  9 Ben …    163    65 none       grey, gre… orange          NA   male  mascu…
#> 10 Beru…    165    75 brown      light      blue            47   fema… femin…
#> # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

This output contains the tibble sw, but arranged the rows alphabetically by the variable name, which is exactly what we wanted. Although this is neat, 2 immediate questions are:

  • How can we arrange rows in different (e.g., descending, rather than ascending) orders?

  • How can we arrange rows by more than 1 variable?

Both of these tasks are solved rather intuitively by adjusting our calls to arrange:

#> # A tibble: 87 x 14
#>    name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#>  1 Zam …    168    55 blonde     fair, gre… yellow            NA fema… femin…
#>  2 Yoda      66    17 white      green      brown            896 male  mascu…
#>  3 Yara…    264    NA none       white      yellow            NA male  mascu…
#>  4 Wilh…    180    NA auburn, g… fair       blue              64 male  mascu…
#>  5 Wick…     88    20 brown      brown      brown              8 male  mascu…
#>  6 Wedg…    170    77 brown      fair       hazel             21 male  mascu…
#>  7 Watto    137    NA black      blue, grey yellow            NA male  mascu…
#>  8 Wat …    193    48 none       green, gr… unknown           NA male  mascu…
#>  9 Tion…    206    80 none       grey       black             NA male  mascu…
#> 10 Taun…    213    NA none       grey       black             NA fema… femin…
#> # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>
#> # A tibble: 87 x 14
#>    name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#>  1 Taun…    213    NA none       grey       black             NA fema… femin…
#>  2 Shaa…    178    57 none       red, blue… black             NA fema… femin…
#>  3 Lama…    229    88 none       grey       black             NA male  mascu…
#>  4 Tion…    206    80 none       grey       black             NA male  mascu…
#>  5 Kit …    196    87 none       green      black             NA male  mascu…
#>  6 Plo …    188    80 none       orange     black             22 male  mascu…
#>  7 Gree…    173    74 <NA>       green      black             44 male  mascu…
#>  8 Nien…    160    68 none       grey       black             NA male  mascu…
#>  9 Gasg…    122    NA none       white, bl… black             NA male  mascu…
#> 10 BB8       NA    NA none       none       black             NA none  mascu…
#> # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

See ?dplyr::arrange for more help and additional examples.

Details

Note some details on using arrange in the above examples:

  • All basic dplyr commands can be called as verb(.data, ...) or — by using the pipe operator from magrittr — as .data %>% verb(...) (see vignette("magrittr") for details). Importantly, the pipe operator %>% is different from the + operator used in ggplot calls.

  • In contrast to base R commands, sequences of multiple variables in tidyverse commands can be written as comma-separated variables, rather than as vectors of variable names (e.g., c("gender", "height")) and are unquoted.

  • When specifying multiple variables in arrange, their order (x, y, ...) specifies the order or priority of operations (first by x, then by y, etc.).

Practice

  • Arrange the sw data in different ways, combining multiple variables and (ascending and descending) orders.

  • Where are the cases containing missing (NA) values in sorted variables placed?

3.2.2 filter selects rows

Using filter selects cases (rows) by logical conditions or a criterion. It keeps all rows for which the criterion is TRUE and drops all rows for which the criterion is FALSE or NA.

For instance, 2 identical ways to extract all humans from sw are:

and result in:

#> # A tibble: 35 x 14
#>    name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#>  1 Luke…    172    77 blond      fair       blue            19   male  mascu…
#>  2 Dart…    202   136 none       white      yellow          41.9 male  mascu…
#>  3 Leia…    150    49 brown      light      brown           19   fema… femin…
#>  4 Owen…    178   120 brown, gr… light      blue            52   male  mascu…
#>  5 Beru…    165    75 brown      light      blue            47   fema… femin…
#>  6 Bigg…    183    84 black      light      brown           24   male  mascu…
#>  7 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
#>  8 Anak…    188    84 blond      fair       blue            41.9 male  mascu…
#>  9 Wilh…    180    NA auburn, g… fair       blue            64   male  mascu…
#> 10 Han …    180    80 brown      fair       brown           29   male  mascu…
#> # … with 25 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

To filter by some criterion (here: a test that determines whether species == "Human" is TRUE or FALSE), we needed to know both the variable by which we wanted to filter (here: species) and its value of interest (here: "Human"). Note that the output of applying filter to sw is a new tibble, but this tibble only contains 35 cases (i.e., the humans from sw).

Filtering by more than one condition can be very effective, but requires some knowledge about logical operators:

#> # A tibble: 3 x 14
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Jar …    196    66 none       orange     orange            52 male  mascu…
#> 2 Adi …    184    50 none       dark       blue              NA fema… femin…
#> 3 Wat …    193    48 none       green, gr… unknown           NA male  mascu…
#> # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>
#> # A tibble: 3 x 14
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Jar …    196    66 none       orange     orange            52 male  mascu…
#> 2 Adi …    184    50 none       dark       blue              NA fema… femin…
#> 3 Wat …    193    48 none       green, gr… unknown           NA male  mascu…
#> # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>
#> # A tibble: 9 x 14
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Leia…    150    49 brown      light      brown             19 fema… femin…
#> 2 Beru…    165    75 brown      light      blue              47 fema… femin…
#> 3 Mon …    150    NA auburn     fair       blue              48 fema… femin…
#> 4 Nien…    160    68 none       grey       black             NA male  mascu…
#> 5 Shmi…    163    NA black      fair       brown             72 fema… femin…
#> 6 Ben …    163    65 none       grey, gre… orange            NA male  mascu…
#> 7 Cordé    157    NA brown      light      brown             NA fema… femin…
#> 8 Dormé    165    NA brown      light      brown             NA fema… femin…
#> 9 Padm…    165    45 brown      light      brown             46 fema… femin…
#> # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>
#> # A tibble: 9 x 14
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Leia…    150    49 brown      light      brown             19 fema… femin…
#> 2 Beru…    165    75 brown      light      blue              47 fema… femin…
#> 3 Mon …    150    NA auburn     fair       blue              48 fema… femin…
#> 4 Nien…    160    68 none       grey       black             NA male  mascu…
#> 5 Shmi…    163    NA black      fair       brown             72 fema… femin…
#> 6 Ben …    163    65 none       grey, gre… orange            NA male  mascu…
#> 7 Cordé    157    NA brown      light      brown             NA fema… femin…
#> 8 Dormé    165    NA brown      light      brown             NA fema… femin…
#> 9 Padm…    165    45 brown      light      brown             46 fema… femin…
#> # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>
#> # A tibble: 8 x 14
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Chew…    228   112 brown      unknown    blue             200 male  mascu…
#> 2 Gree…    173    74 <NA>       green      black             44 male  mascu…
#> 3 Yoda      66    17 white      green      brown            896 male  mascu…
#> 4 Bossk    190   113 none       green      red               53 male  mascu…
#> 5 Rugo…    206    NA none       green      orange            NA male  mascu…
#> 6 Kit …    196    87 none       green      black             NA male  mascu…
#> 7 Pogg…    183    80 none       green      yellow            NA male  mascu…
#> 8 Tarf…    234   136 brown      brown      blue              NA male  mascu…
#> # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

A common criterion for filtering is that we want to (a) only obtain cases with missing values on some variable(s), or (b) only keep cases without missing values on some variable(s):

#> # A tibble: 4 x 14
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Ric …    183    NA brown      fair       blue              NA <NA>  <NA>  
#> 2 Quar…    183    NA black      dark       brown             62 <NA>  <NA>  
#> 3 Sly …    178    48 none       pale       white             NA <NA>  <NA>  
#> 4 Capt…     NA    NA unknown    unknown    unknown           NA <NA>  <NA>  
#> # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>
#> # A tibble: 36 x 14
#>    name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#>  1 Luke…    172    77 blond      fair       blue            19   male  mascu…
#>  2 C-3PO    167    75 <NA>       gold       yellow         112   none  mascu…
#>  3 R2-D2     96    32 <NA>       white, bl… red             33   none  mascu…
#>  4 Dart…    202   136 none       white      yellow          41.9 male  mascu…
#>  5 Leia…    150    49 brown      light      brown           19   fema… femin…
#>  6 Owen…    178   120 brown, gr… light      blue            52   male  mascu…
#>  7 Beru…    165    75 brown      light      blue            47   fema… femin…
#>  8 Bigg…    183    84 black      light      brown           24   male  mascu…
#>  9 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
#> 10 Anak…    188    84 blond      fair       blue            41.9 male  mascu…
#> # … with 26 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

As filter selects cases, its result should typically be a table with the same number of columns as the original one, but fewer rows. See ?dplyr::filter for more help and additional examples.

Details

Note some details on using filter:

  • Separating multiple conditions by commas is the same as using the logical AND (&).

  • As seen with arrange, variable names are unquoted.

  • A comma between conditions or tests (x, y, ...) means the same as & (logical AND), as each test results in a vector of Boolean values.

  • Unlike in base R, rows for which the condition evaluates to NA are dropped.

  • Additional filter functions include near() for testing numerical (near-)identity.

Practice

  1. Verify for an example that filtering by 2 criteria yields the same result as filtering twice (once for each criterion).
#> [1] TRUE
#> [1] TRUE

Can you explain why we added arrange(name) to the end of each filter pipe?

  1. Use filter on the sw data to select some either diverse or narrow subset of individuals. For instance,
  • which individual with blond hair and blue eyes has an unknown mass?
  • of which species are individuals that are over 2m tall and have brown hair?
  • which individuals from Tatooine are not male (but may be NA)?
  • which individuals are neither male nor female OR heavier than 130kg?

slice selects rows by number

If we want to select specific rows of a data table and already know their row number, we can use the slice command of dplyr:

Strictly speaking, we would not need slice, as we could always create a column that contains the number of the corresponding row and then filter for the numeric values of this column:

However, the last example shows that the tests for specific numeric values of row_nr can get cumbersome. Hence, slice is a welcome addition to our dplyr vocabulary.

Practice

Predict the outcome of slice(sw, 1:nrow(sw)) and then evaluate the expression and your prediction.

3.2.3 select selects columns

Using select selects variables (columns) by their names or numbers. As it works exactly like filter (but selects columns, rather than rows), we can immediately select not just one, but multiple variables. Actually, there are many ways of achieving the same result:

#> # A tibble: 87 x 4
#>    name               species birth_year gender   
#>    <chr>              <chr>        <dbl> <chr>    
#>  1 Luke Skywalker     Human         19   masculine
#>  2 C-3PO              Droid        112   masculine
#>  3 R2-D2              Droid         33   masculine
#>  4 Darth Vader        Human         41.9 masculine
#>  5 Leia Organa        Human         19   feminine 
#>  6 Owen Lars          Human         52   masculine
#>  7 Beru Whitesun lars Human         47   feminine 
#>  8 R5-D4              Droid         NA   masculine
#>  9 Biggs Darklighter  Human         24   masculine
#> 10 Obi-Wan Kenobi     Human         57   masculine
#> # … with 77 more rows
#> # A tibble: 87 x 4
#>    name               species birth_year gender   
#>    <chr>              <chr>        <dbl> <chr>    
#>  1 Luke Skywalker     Human         19   masculine
#>  2 C-3PO              Droid        112   masculine
#>  3 R2-D2              Droid         33   masculine
#>  4 Darth Vader        Human         41.9 masculine
#>  5 Leia Organa        Human         19   feminine 
#>  6 Owen Lars          Human         52   masculine
#>  7 Beru Whitesun lars Human         47   feminine 
#>  8 R5-D4              Droid         NA   masculine
#>  9 Biggs Darklighter  Human         24   masculine
#> 10 Obi-Wan Kenobi     Human         57   masculine
#> # … with 77 more rows
#> # A tibble: 87 x 4
#>    name               species birth_year gender   
#>    <chr>              <chr>        <dbl> <chr>    
#>  1 Luke Skywalker     Human         19   masculine
#>  2 C-3PO              Droid        112   masculine
#>  3 R2-D2              Droid         33   masculine
#>  4 Darth Vader        Human         41.9 masculine
#>  5 Leia Organa        Human         19   feminine 
#>  6 Owen Lars          Human         52   masculine
#>  7 Beru Whitesun lars Human         47   feminine 
#>  8 R5-D4              Droid         NA   masculine
#>  9 Biggs Darklighter  Human         24   masculine
#> 10 Obi-Wan Kenobi     Human         57   masculine
#> # … with 77 more rows
#> # A tibble: 87 x 4
#>    name               homeworld birth_year sex   
#>    <chr>              <chr>          <dbl> <chr> 
#>  1 Luke Skywalker     Tatooine        19   male  
#>  2 C-3PO              Tatooine       112   none  
#>  3 R2-D2              Naboo           33   none  
#>  4 Darth Vader        Tatooine        41.9 male  
#>  5 Leia Organa        Alderaan        19   female
#>  6 Owen Lars          Tatooine        52   male  
#>  7 Beru Whitesun lars Tatooine        47   female
#>  8 R5-D4              Tatooine        NA   none  
#>  9 Biggs Darklighter  Tatooine        24   male  
#> 10 Obi-Wan Kenobi     Stewjon         57   male  
#> # … with 77 more rows
#> # A tibble: 87 x 4
#>    name               homeworld birth_year sex   
#>    <chr>              <chr>          <dbl> <chr> 
#>  1 Luke Skywalker     Tatooine        19   male  
#>  2 C-3PO              Tatooine       112   none  
#>  3 R2-D2              Naboo           33   none  
#>  4 Darth Vader        Tatooine        41.9 male  
#>  5 Leia Organa        Alderaan        19   female
#>  6 Owen Lars          Tatooine        52   male  
#>  7 Beru Whitesun lars Tatooine        47   female
#>  8 R5-D4              Tatooine        NA   none  
#>  9 Biggs Darklighter  Tatooine        24   male  
#> 10 Obi-Wan Kenobi     Stewjon         57   male  
#> # … with 77 more rows

When selecting ranges of variables, the : operator allows selecting ranges of variables:

#> # A tibble: 87 x 6
#>    name               height  mass gender    homeworld species
#>    <chr>               <int> <dbl> <chr>     <chr>     <chr>  
#>  1 Luke Skywalker        172    77 masculine Tatooine  Human  
#>  2 C-3PO                 167    75 masculine Tatooine  Droid  
#>  3 R2-D2                  96    32 masculine Naboo     Droid  
#>  4 Darth Vader           202   136 masculine Tatooine  Human  
#>  5 Leia Organa           150    49 feminine  Alderaan  Human  
#>  6 Owen Lars             178   120 masculine Tatooine  Human  
#>  7 Beru Whitesun lars    165    75 feminine  Tatooine  Human  
#>  8 R5-D4                  97    32 masculine Tatooine  Droid  
#>  9 Biggs Darklighter     183    84 masculine Tatooine  Human  
#> 10 Obi-Wan Kenobi        182    77 masculine Stewjon   Human  
#> # … with 77 more rows

Selecting can also be used to re-arrange variables. In this case, the function everything() is useful, but refers to every variable not already specified:

#> # A tibble: 87 x 14
#>    species name  gender height  mass hair_color skin_color eye_color birth_year
#>    <chr>   <chr> <chr>   <int> <dbl> <chr>      <chr>      <chr>          <dbl>
#>  1 Human   Luke… mascu…    172    77 blond      fair       blue            19  
#>  2 Droid   C-3PO mascu…    167    75 <NA>       gold       yellow         112  
#>  3 Droid   R2-D2 mascu…     96    32 <NA>       white, bl… red             33  
#>  4 Human   Dart… mascu…    202   136 none       white      yellow          41.9
#>  5 Human   Leia… femin…    150    49 brown      light      brown           19  
#>  6 Human   Owen… mascu…    178   120 brown, gr… light      blue            52  
#>  7 Human   Beru… femin…    165    75 brown      light      blue            47  
#>  8 Droid   R5-D4 mascu…     97    32 <NA>       white, red red             NA  
#>  9 Human   Bigg… mascu…    183    84 black      light      brown           24  
#> 10 Human   Obi-… mascu…    182    77 auburn, w… fair       blue-gray       57  
#> # … with 77 more rows, and 5 more variables: sex <chr>, homeworld <chr>,
#> #   films <list>, vehicles <list>, starships <list>

A number of additional helper functions allow more sophisticated selections by testing variable names:

#> # A tibble: 87 x 4
#>    skin_color  sex    species starships
#>    <chr>       <chr>  <chr>   <list>   
#>  1 fair        male   Human   <chr [2]>
#>  2 gold        none   Droid   <chr [0]>
#>  3 white, blue none   Droid   <chr [0]>
#>  4 white       male   Human   <chr [1]>
#>  5 light       female Human   <chr [0]>
#>  6 light       male   Human   <chr [0]>
#>  7 light       female Human   <chr [0]>
#>  8 white, red  none   Droid   <chr [0]>
#>  9 light       male   Human   <chr [1]>
#> 10 fair        male   Human   <chr [5]>
#> # … with 77 more rows
#> # A tibble: 87 x 5
#>     mass species films     vehicles  starships
#>    <dbl> <chr>   <list>    <list>    <list>   
#>  1    77 Human   <chr [5]> <chr [2]> <chr [2]>
#>  2    75 Droid   <chr [6]> <chr [0]> <chr [0]>
#>  3    32 Droid   <chr [7]> <chr [0]> <chr [0]>
#>  4   136 Human   <chr [4]> <chr [0]> <chr [1]>
#>  5    49 Human   <chr [5]> <chr [1]> <chr [0]>
#>  6   120 Human   <chr [3]> <chr [0]> <chr [0]>
#>  7    75 Human   <chr [3]> <chr [0]> <chr [0]>
#>  8    32 Droid   <chr [1]> <chr [0]> <chr [0]>
#>  9    84 Human   <chr [1]> <chr [0]> <chr [1]>
#> 10    77 Human   <chr [6]> <chr [1]> <chr [5]>
#> # … with 77 more rows
#> # A tibble: 87 x 4
#>    hair_color    skin_color  eye_color birth_year
#>    <chr>         <chr>       <chr>          <dbl>
#>  1 blond         fair        blue            19  
#>  2 <NA>          gold        yellow         112  
#>  3 <NA>          white, blue red             33  
#>  4 none          white       yellow          41.9
#>  5 brown         light       brown           19  
#>  6 brown, grey   light       blue            52  
#>  7 brown         light       blue            47  
#>  8 <NA>          white, red  red             NA  
#>  9 black         light       brown           24  
#> 10 auburn, white fair        blue-gray       57  
#> # … with 77 more rows
#> # A tibble: 87 x 4
#>    hair_color    skin_color  eye_color homeworld
#>    <chr>         <chr>       <chr>     <chr>    
#>  1 blond         fair        blue      Tatooine 
#>  2 <NA>          gold        yellow    Tatooine 
#>  3 <NA>          white, blue red       Naboo    
#>  4 none          white       yellow    Tatooine 
#>  5 brown         light       brown     Alderaan 
#>  6 brown, grey   light       blue      Tatooine 
#>  7 brown         light       blue      Tatooine 
#>  8 <NA>          white, red  red       Tatooine 
#>  9 black         light       brown     Tatooine 
#> 10 auburn, white fair        blue-gray Stewjon  
#> # … with 77 more rows

As select selects variables, its result should typically be a table with the same number of cases as the original one, but fewer columns. See ?dplyr::select for more help and additional examples, as well as ?dplyr::select_if for conditional variants.

A dplyr function closely related to select is rename, which does exactly what it says:

#> # A tibble: 87 x 14
#>    creature height  mass hair_color skin_color eye_color birth_year sex   gender
#>    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#>  1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
#>  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
#>  3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
#>  4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
#>  5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
#>  6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
#>  7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
#>  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
#>  9 Biggs D…    183    84 black      light      brown           24   male  mascu…
#> 10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
#> # … with 77 more rows, and 5 more variables: from_planet <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

Details

Note some details on using select:

  • select works both by specifying variable (column) names and by specifying column numbers.

  • Again, variable names are unquoted.

  • The sequence of variable names (separated by commas) specifies the order of columns in the resulting tibble.

  • Selecting and adding everything() allows re-ordering variables.

  • Various helper functions (e.g., starts_with, ends_with, contains, matches, num_range) refer to (parts of) variable names.

  • rename renames specified variables (without quotes) and keeps all other variables.

Practice

  1. What is the result of sw %>% select(height)? More specifically, how does it differ from the vector sw$height?
#> [1] FALSE
#> [1] TRUE
  1. Use select on the dplyr::starwars data (sw) to select and re-order specific subsets of variables (e.g., all variables starting with “h”, all even columns, all character variables, etc.).

3.2.4 mutate computes new variables

Using mutate computes new variables (columns) from scratch or existing ones:

#> # A tibble: 87 x 8
#>    name               height  mass birth_year sex    gender    homeworld species
#>    <chr>               <int> <dbl>      <dbl> <chr>  <chr>     <chr>     <chr>  
#>  1 Luke Skywalker        172    77       19   male   masculine Tatooine  Human  
#>  2 C-3PO                 167    75      112   none   masculine Tatooine  Droid  
#>  3 R2-D2                  96    32       33   none   masculine Naboo     Droid  
#>  4 Darth Vader           202   136       41.9 male   masculine Tatooine  Human  
#>  5 Leia Organa           150    49       19   female feminine  Alderaan  Human  
#>  6 Owen Lars             178   120       52   male   masculine Tatooine  Human  
#>  7 Beru Whitesun lars    165    75       47   female feminine  Tatooine  Human  
#>  8 R5-D4                  97    32       NA   none   masculine Tatooine  Droid  
#>  9 Biggs Darklighter     183    84       24   male   masculine Tatooine  Human  
#> 10 Obi-Wan Kenobi        182    77       57   male   masculine Stewjon   Human  
#> # … with 77 more rows
#> # A tibble: 87 x 10
#>    name  height  mass birth_year sex   gender homeworld species    id
#>    <chr>  <int> <dbl>      <dbl> <chr> <chr>  <chr>     <chr>   <int>
#>  1 Luke…    172    77       19   male  mascu… Tatooine  Human       1
#>  2 C-3PO    167    75      112   none  mascu… Tatooine  Droid       2
#>  3 R2-D2     96    32       33   none  mascu… Naboo     Droid       3
#>  4 Dart…    202   136       41.9 male  mascu… Tatooine  Human       4
#>  5 Leia…    150    49       19   fema… femin… Alderaan  Human       5
#>  6 Owen…    178   120       52   male  mascu… Tatooine  Human       6
#>  7 Beru…    165    75       47   fema… femin… Tatooine  Human       7
#>  8 R5-D4     97    32       NA   none  mascu… Tatooine  Droid       8
#>  9 Bigg…    183    84       24   male  mascu… Tatooine  Human       9
#> 10 Obi-…    182    77       57   male  mascu… Stewjon   Human      10
#> # … with 77 more rows, and 1 more variable: height_feet <dbl>
#> # A tibble: 87 x 10
#>    name  height  mass birth_year sex   gender homeworld species    id
#>    <chr>  <int> <dbl>      <dbl> <chr> <chr>  <chr>     <chr>   <int>
#>  1 Luke…    172    77       19   male  mascu… Tatooine  Human       1
#>  2 C-3PO    167    75      112   none  mascu… Tatooine  Droid       2
#>  3 R2-D2     96    32       33   none  mascu… Naboo     Droid       3
#>  4 Dart…    202   136       41.9 male  mascu… Tatooine  Human       4
#>  5 Leia…    150    49       19   fema… femin… Alderaan  Human       5
#>  6 Owen…    178   120       52   male  mascu… Tatooine  Human       6
#>  7 Beru…    165    75       47   fema… femin… Tatooine  Human       7
#>  8 R5-D4     97    32       NA   none  mascu… Tatooine  Droid       8
#>  9 Bigg…    183    84       24   male  mascu… Tatooine  Human       9
#> 10 Obi-…    182    77       57   male  mascu… Stewjon   Human      10
#> # … with 77 more rows, and 1 more variable: height_feet <dbl>

A closely related dplyr verb is transmute, which only keeps computed variables and drops all other ones:

#> # A tibble: 87 x 2
#>       id height_feet
#>    <int>       <dbl>
#>  1     1        5.64
#>  2     2        5.48
#>  3     3        3.15
#>  4     4        6.63
#>  5     5        4.92
#>  6     6        5.84
#>  7     7        5.41
#>  8     8        3.18
#>  9     9        6.00
#> 10    10        5.97
#> # … with 77 more rows

Although mutate and transmute compute the same variables, mutate is of an incremental nature (by adding new variables to the existing table), whereas transmute drastically changes the table (by only keeping new variables). For most purposes, adding new columns to the existing table is perfectly fine.

Interestingly, a variable computed by transmute can immediately be used for computing another variable:

#> # A tibble: 87 x 12
#>    name  height  mass birth_year sex   gender homeworld species   BMI BMI_low
#>    <chr>  <int> <dbl>      <dbl> <chr> <chr>  <chr>     <chr>   <dbl> <lgl>  
#>  1 Luke…    172    77       19   male  mascu… Tatooine  Human    26.0 FALSE  
#>  2 C-3PO    167    75      112   none  mascu… Tatooine  Droid    26.9 FALSE  
#>  3 R2-D2     96    32       33   none  mascu… Naboo     Droid    34.7 FALSE  
#>  4 Dart…    202   136       41.9 male  mascu… Tatooine  Human    33.3 FALSE  
#>  5 Leia…    150    49       19   fema… femin… Alderaan  Human    21.8 FALSE  
#>  6 Owen…    178   120       52   male  mascu… Tatooine  Human    37.9 FALSE  
#>  7 Beru…    165    75       47   fema… femin… Tatooine  Human    27.5 FALSE  
#>  8 R5-D4     97    32       NA   none  mascu… Tatooine  Droid    34.0 FALSE  
#>  9 Bigg…    183    84       24   male  mascu… Tatooine  Human    25.1 FALSE  
#> 10 Obi-…    182    77       57   male  mascu… Stewjon   Human    23.2 FALSE  
#> # … with 77 more rows, and 2 more variables: BMI_high <lgl>, BMI_norm <lgl>

As mutate typically changes variables (by computing new ones), it seems appropriately named. However, note that mutate does typically not change the identity of the cases (rows) of a data table. See ?dplyr::mutate for more help and additional examples.

Details

Note some details on mutate and transmute:

  • mutate computes new variables (columns) and adds them to existing ones, while transmute drops existing ones.

  • Each mutate command specifies a new variable name (without quotes), followed by = and a rule for computing the new variable from existing ones.

  • Again, variable names are unquoted.

  • Multiple mutate steps are separated by commas, each of which creates a new variable.

  • See http://r4ds.had.co.nz/transform.html#mutate-funs for useful functions for creating new variables.

Practice

Compute a new variable mass_pound from mass (in kg) and the age of each individual in sw relative to Yoda’s age. (Note that the variable birth_year is provided in years BBY, i.e., Before Battle of Yavin.)

3.2.5 summarise computes summaries

summarise computes a function for a specified variable and collapses the values of the specified variable (i.e., the rows of a specified columns) to a single value:

#> # A tibble: 1 x 1
#>   mn_mass
#>     <dbl>
#> 1    97.3
#> # A tibble: 1 x 1
#>   mn_mass
#>     <dbl>
#> 1    97.3

In most cases, we want to compute not just one summary statistic of a variable (e.g., mass), but several ones:

#> # A tibble: 1 x 6
#>   n_mass mn_mass md_mass sd_mass max_mass big_mass
#>    <int>   <dbl>   <dbl>   <dbl>    <dbl> <lgl>   
#> 1     59    97.3      79    169.     1358 TRUE

Similarly, we often want to obtain summary information about more than one variable. For instance, we may want to know basic statistics about the height and weight variables in our sw data, and count some characteristics of character variables:

#> # A tibble: 1 x 9
#>   n_height mn_height sd_height n_mass mn_mass sd_mass n_names n_species n_worlds
#>      <int>     <dbl>     <dbl>  <int>   <dbl>   <dbl>   <int>     <int>    <int>
#> 1       81      174.      34.8     59    97.3    169.      87        38       49

While summarise provides many different summary statistics by itself, it is even more useful in combination with group_by (discussed next). See ?dplyr::summarise for more help and additional examples.

Details

Note some details on summarise:

  • summarise collapses multiple values into one value and returns a new tibble with as many rows as values computed.

  • Each summarise step specifies a new variable name (without quotes), followed by =, and a function for computing the new variable from existing ones.

  • Multiple summarise steps are separated by commas.

  • Again, variable names are unquoted.

  • See https://dplyr.tidyverse.org/reference/summarise.html for examples and useful functions in combination with summarise.

Practice

  1. Someone speculates that — on average — humans have longer names than droids, but droids are heavier than humans. Can you compute some summaries (e.g., by combining filter with summarise commands) to check this?
    Hint: The length of a character string s can be computed with nchar(s).
#> # A tibble: 1 x 6
#>   n_humans mn_name_len sd_name_len n_mass mn_mass sd_mass
#>      <int>       <dbl>       <dbl>  <int>   <dbl>   <dbl>
#> 1       35        11.3        4.11     22    82.8    19.4
#> # A tibble: 1 x 6
#>   n_droids mn_name_len sd_name_len n_mass mn_mass sd_mass
#>      <int>       <dbl>       <dbl>  <int>   <dbl>   <dbl>
#> 1        6        4.83       0.983      4    69.8    51.0
  • It looks like the average name length is about twice as high for humans (but note that there are only 5 droids in the dataset).

  • The hypothesis about droids being heavier on average is wrong, as the mean differences point in the opposite direction (but both distributions contain missing values and show large variations).

  1. Apply all summary functions mentioned in ?dplyr::summarise to the sw dataset.

3.2.6 group_by aggregates variables

Using group_by does not change the data, but the unit of aggregation for other commands, which is particularly useful in combination with mutate and summarise.

When used by itself, group_by returns the same tibble in a grouped form. For instance, the following commands will group sws by species:

#> # A tibble: 87 x 8
#> # Groups:   species [38]
#>    name               height  mass birth_year sex    gender    homeworld species
#>    <chr>               <int> <dbl>      <dbl> <chr>  <chr>     <chr>     <chr>  
#>  1 Luke Skywalker        172    77       19   male   masculine Tatooine  Human  
#>  2 C-3PO                 167    75      112   none   masculine Tatooine  Droid  
#>  3 R2-D2                  96    32       33   none   masculine Naboo     Droid  
#>  4 Darth Vader           202   136       41.9 male   masculine Tatooine  Human  
#>  5 Leia Organa           150    49       19   female feminine  Alderaan  Human  
#>  6 Owen Lars             178   120       52   male   masculine Tatooine  Human  
#>  7 Beru Whitesun lars    165    75       47   female feminine  Tatooine  Human  
#>  8 R5-D4                  97    32       NA   none   masculine Tatooine  Droid  
#>  9 Biggs Darklighter     183    84       24   male   masculine Tatooine  Human  
#> 10 Obi-Wan Kenobi        182    77       57   male   masculine Stewjon   Human  
#> # … with 77 more rows
#> # A tibble: 87 x 8
#> # Groups:   species [38]
#>    name               height  mass birth_year sex    gender    homeworld species
#>    <chr>               <int> <dbl>      <dbl> <chr>  <chr>     <chr>     <chr>  
#>  1 Luke Skywalker        172    77       19   male   masculine Tatooine  Human  
#>  2 C-3PO                 167    75      112   none   masculine Tatooine  Droid  
#>  3 R2-D2                  96    32       33   none   masculine Naboo     Droid  
#>  4 Darth Vader           202   136       41.9 male   masculine Tatooine  Human  
#>  5 Leia Organa           150    49       19   female feminine  Alderaan  Human  
#>  6 Owen Lars             178   120       52   male   masculine Tatooine  Human  
#>  7 Beru Whitesun lars    165    75       47   female feminine  Tatooine  Human  
#>  8 R5-D4                  97    32       NA   none   masculine Tatooine  Droid  
#>  9 Biggs Darklighter     183    84       24   male   masculine Tatooine  Human  
#> 10 Obi-Wan Kenobi        182    77       57   male   masculine Stewjon   Human  
#> # … with 77 more rows

This seems rather mundane, but becomes very powerful when combining the group_by statement with a subsequent mutate or summarise command.

3.2.6.1 Grouped mutates

When combining group_by with a subsequent mutate, the scope of the variables computed by mutate is the group defined by group_by. For instance, the following pipe counts the number of individuals of each species and computes their mean height within each species:

#> # A tibble: 87 x 10
#> # Groups:   species [38]
#>    name  height  mass birth_year sex   gender homeworld species n_individuals
#>    <chr>  <int> <dbl>      <dbl> <chr> <chr>  <chr>     <chr>           <int>
#>  1 Luke…    172    77       19   male  mascu… Tatooine  Human              35
#>  2 C-3PO    167    75      112   none  mascu… Tatooine  Droid               6
#>  3 R2-D2     96    32       33   none  mascu… Naboo     Droid               6
#>  4 Dart…    202   136       41.9 male  mascu… Tatooine  Human              35
#>  5 Leia…    150    49       19   fema… femin… Alderaan  Human              35
#>  6 Owen…    178   120       52   male  mascu… Tatooine  Human              35
#>  7 Beru…    165    75       47   fema… femin… Tatooine  Human              35
#>  8 R5-D4     97    32       NA   none  mascu… Tatooine  Droid               6
#>  9 Bigg…    183    84       24   male  mascu… Tatooine  Human              35
#> 10 Obi-…    182    77       57   male  mascu… Stewjon   Human              35
#> # … with 77 more rows, and 1 more variable: mn_height <dbl>

As before, the new variables (here: n_individuals and mn_height) are added to the tibble, but now their values are computed relative to the group_by variable (here: species) as the unit of aggregation. Interestingly, this implies that there exists no such thing as “the mean” of a variable, as any mean is always relative to some unit of aggregation. By changing the unit of aggregation, we can compute many different means for the same variable. For instance, we can compute the mean height of individuals overall, by species, by gender, etc.:

#> # A tibble: 87 x 6
#> # Groups:   name [87]
#>    name               height mn_height_1 mn_height_2 mn_height_3 mn_height_4
#>    <chr>               <int>       <dbl>       <dbl>       <dbl>       <dbl>
#>  1 Luke Skywalker        172        174.        177.        177.         172
#>  2 C-3PO                 167        174.        131.        177.         167
#>  3 R2-D2                  96        174.        131.        177.          96
#>  4 Darth Vader           202        174.        177.        177.         202
#>  5 Leia Organa           150        174.        177.        165.         150
#>  6 Owen Lars             178        174.        177.        177.         178
#>  7 Beru Whitesun lars    165        174.        177.        165.         165
#>  8 R5-D4                  97        174.        131.        177.          97
#>  9 Biggs Darklighter     183        174.        177.        177.         183
#> 10 Obi-Wan Kenobi        182        174.        177.        177.         182
#> # … with 77 more rows

3.2.6.2 Grouped summaries

Our summarise commands above yielded some summary of one or several variables as 1 line of output. When combining group_by with a subsequent summarise, we obtain the corresponding summary for each group:

#> # A tibble: 38 x 4
#>    species  n_individuals mn_height mn_mass
#>    <chr>            <int>     <dbl>   <dbl>
#>  1 Quermian             1      264      NaN
#>  2 Wookiee              2      231      124
#>  3 Kaminoan             2      221       88
#>  4 Kaleesh              1      216      159
#>  5 Gungan               3      209.      74
#>  6 Pau'an               1      206       80
#>  7 Besalisk             1      198      102
#>  8 Cerean               1      198       82
#>  9 Chagrian             1      196      NaN
#> 10 Nautolan             1      196       87
#> # … with 28 more rows

Note that the group_by followed by summarise returns a new tibble, with 38 rows (= groups of species) and

  • 1 column of the group variable (here species) and
  • 3 columns of the 3 newly summarised variables.

Here, we also arranged this output tibble by descending means of height.

3.2.6.3 Grouping by multiple variables

Using group_by with multiple variables yields a tibble containing the combination of all variable levels. For instance, how many combinations of hair_color and eye_color exist, we could count them as follows:

#> # A tibble: 87 x 14
#> # Groups:   hair_color, eye_color [35]
#>    name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#>  1 Luke…    172    77 blond      fair       blue            19   male  mascu…
#>  2 C-3PO    167    75 <NA>       gold       yellow         112   none  mascu…
#>  3 R2-D2     96    32 <NA>       white, bl… red             33   none  mascu…
#>  4 Dart…    202   136 none       white      yellow          41.9 male  mascu…
#>  5 Leia…    150    49 brown      light      brown           19   fema… femin…
#>  6 Owen…    178   120 brown, gr… light      blue            52   male  mascu…
#>  7 Beru…    165    75 brown      light      blue            47   fema… femin…
#>  8 R5-D4     97    32 <NA>       white, red red             NA   none  mascu…
#>  9 Bigg…    183    84 black      light      brown           24   male  mascu…
#> 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
#> # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

A common application of using group_by with multiple varialbes is to count the number of cases (here: individuals) in each sub-group:

#> # A tibble: 35 x 3
#> # Groups:   hair_color, eye_color [35]
#>    hair_color eye_color     n
#>    <chr>      <chr>     <int>
#>  1 black      brown         9
#>  2 brown      brown         9
#>  3 none       black         9
#>  4 brown      blue          7
#>  5 none       orange        7
#>  6 none       yellow        6
#>  7 blond      blue          3
#>  8 none       blue          3
#>  9 none       red           3
#> 10 black      blue          2
#> # … with 25 more rows
#> # A tibble: 35 x 3
#> # Groups:   hair_color [13]
#>    hair_color eye_color     n
#>    <chr>      <chr>     <int>
#>  1 black      brown         9
#>  2 brown      brown         9
#>  3 none       black         9
#>  4 brown      blue          7
#>  5 none       orange        7
#>  6 none       yellow        6
#>  7 blond      blue          3
#>  8 none       blue          3
#>  9 none       red           3
#> 10 black      blue          2
#> # … with 25 more rows

See ?dplyr::group_by for more help and additional examples.

Details

Note some details on group_by:

  • group_by changes the unit of aggregation for other commands (especially mutate and summarise).

  • Again, variable names are unquoted.

  • When using group_by with multiple variables, they are separated by commas.

  • Using group_by with mutate results in a tibble that has the same number of cases (rows) as the original tibble. By contrast, using group_by with summarise results in a new tibble with all combinations of variable levels as its cases (rows).

Practice

  1. In the last practice section above, we used 2 combinations of filter and summarise to check the hypotheses that — on average — humans have longer names than droids, but droids are heavier than humans. Now that we learned about group_by, try to perform this check in 1 pipe.
#> # A tibble: 2 x 7
#>   species n_cases mn_name_len sd_name_len n_mass mn_mass sd_mass
#>   <chr>     <int>       <dbl>       <dbl>  <int>   <dbl>   <dbl>
#> 1 Droid         6        4.83       0.983      4    69.8    51.0
#> 2 Human        35       11.3        4.11      22    82.8    19.4
  1. Yoda says: “Taller creatures heavier are than smaller ones.”
    Test his hypothesis for the sw dataset in 1 pipe by
  • selecting only the relevant variables name, height, and mass,
  • computing variables for the median height and a logical variable is_tall that is TRUE if and only if an individual is taller than the median height,
  • grouping the data by is_tall;
  • counting the cases and computing the mean mass for tall vs. non-tall individuals.
#> # A tibble: 3 x 4
#>   is_tall     n mn_mass sd_mass
#>   <lgl>   <int>   <dbl>   <dbl>
#> 1 FALSE      43   103.    234. 
#> 2 TRUE       38    91.1    25.7
#> 3 NA          6   NaN      NA
  • Yoda seems wrong: The 38 tall creatures (with a height above the median) have an average mass value of 91.1 kg, whereas the 43 smaller creatures (with a height below or equal to the median) have an average mass value of 103 kg.

You just smoked a median-split analysis in a pipe — congratulations! However, before getting too excited, we should try to understand why our results came out in this way. To explain Yoda’s mistake, let’s look at a scatterplot that plots mass as a function of height (and colors the points by the value of our is_tall variable):

Scatterplot of `mass` by `height` in the full `sw` dataset.

Figure 3.2: Scatterplot of mass by height in the full sw dataset.

Note that we used coord_cartesian to restrict the range of y values shown to ylim = c(0, 1700).

The scatterplot shows that the sw data contains a blatant outlier: Jabba Desilijic Tiure, the crime lord aka. ‘Jabba the Hutt’, has a mass of 1358 kg despite his below-average height of 175 cm. Only considering creatures with a mass up to 170 kg suggests that Yoda’s hypothesis is perfectly valid when this outlier is excluded:

Scatterplot of `mass` by `height` without Jabba the Hutt.

Figure 3.3: Scatterplot of mass by height without Jabba the Hutt.

This is yet another instance of the lesson taught by Anscombe’s quartet (in Section 2.1): We should never interpret the results of some statistical calculation without properly inspecting the underlying data.

References

Bache, S. M., & Wickham, H. (2014). magrittr: A forward-pipe operator for R. Retrieved from https://CRAN.R-project.org/package=magrittr

Wickham, H., François, R., Henry, L., & Müller, K. (2020). dplyr: A grammar of data manipulation. Retrieved from https://CRAN.R-project.org/package=dplyr