6.3 Transforming tables with dplyr

The dplyr package (Wickham et al., 2021) is a core component of the tidyverse. Like ggplot2, dplyr is widely used by people who otherwise do not reside within the tidyverse. But as dplyr is a package that is both immensely useful and embodies many of the tidyverse principles in paradigmatic form, we can think of it as the primary citizen of the tidyverse.

dplyr provides a set of commands — best thought of as verbs — that allow slicing and dicing rectangular datasets and computing many summary statistics. While each individual command is simple, they can be combined into a powerful language of data manipulation. In combination with other functions, using dplyr quickly provides us with quantitative overviews of datasets that amount to what psychologists often call descriptive statistics.

The following sections merely provide a summary of essential dplyr functions. More extensive resources for this section include:

6.3.1 The function of pliers

The name of the dplyr package is inspired by “pliers”:

Pliers are tools for pulling out things or twisting their shapes. (Image by Evan-Amos, via Wikimedia Commons.)

Figure 6.2: Pliers are tools for pulling out things or twisting their shapes. (Image by Evan-Amos, via Wikimedia Commons.)

Pliers are tools for pulling out parts and tugging, tweaking, or twisting things into different shapes (see Figure 6.2). In our current context, the thing to tweak is a rectangular set of data (as an R data frame or tibble) and the dplyr tool allows manipulating this table into other tables that contain parts, additional or fewer variables, or provide summary information.

dplyr = MS Excel + control

For users that are familiar with basic spreadsheet in MS Excel: The dplyr functions allow similar manipulations of tabular data in R. However, spreadsheet users are typically solving many tasks by clicking interface buttons, entering simple formulas, and many copy-and-paste operations. While this can be simple and engaging, it is terribly error prone. The main problem with spreadsheets is that the process, typically consisting of many small interactive steps, remains transient and is lost, as only the resulting data table is stored. If a sequence of 100 steps included a minor error on step 29, we often need to start from scratch. Thus, it is very easy to make mistakes and almost impossible to recover from them if they are not noticed immediately.

By contrast, dplyr provides a series of simple commands for solving tasks like arranging or selecting rows or columns, categorizing variables into groups, and computing simple summary tables. Rather than incrementally constructing a spreadsheet and many implicit cut-and-paste operations, dplyr uses sequences of simple commands that explicate the entire process. In the spirit of reproducible research (see Section 1.3), this documents precisely what is being done and allows making corrections later.

6.3.2 Essential dplyr functions

The following sections will briefly illustrate essential dplyr functions and their corresponding tasks:

  1. arrange() sorts cases (rows);
  2. filter() and slice() select cases (rows) by logical conditions;
  3. select() selects and reorders variables (columns);
  4. mutate() and transmute() compute new variables (columns) out of existing ones;
  5. summarise() collapses multiple values of a variable (rows of a column) to a single one;
  6. group_by() changes the unit of aggregation (in combination with mutate() and summarise()).

Learning dplyr essentially consists in memorizing these terms like the verbs of a new language. Studying and typing a few examples of each command makes it pretty easy to combine them into powerful pipes that allow slicing, dicing, and summarizing large data tables.

Examples

See Section 3.2: Essential dplyr commands for examples using the starwars data from the dplyr package:

sw <- dplyr::starwars

To provide some different examples here, we use the storms data from the dplyr package:

st <- dplyr::storms
dim(st)
#> [1] 10010    13

The data contains 10010 cases (rows) and 13 variables (columns). See ?dplyr::storms for a description of the data and its variables.

Using arrange() to sort cases (rows)

The arrange() function keeps the same data, but arranges its cases (rows) by the variable (column) mentioned:

# Re-arrange data:
st %>%
  arrange(month)
#> # A tibble: 10,010 x 13
#>    name   year month   day  hour   lat  long status category  wind pressure
#>    <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <ord>    <int>    <int>
#>  1 Zeta   2006     1     1     0  25.6 -38.3 tropi… 0           50      997
#>  2 Zeta   2006     1     1     6  25.4 -38.4 tropi… 0           50      997
#>  3 Zeta   2006     1     1    12  25.2 -38.5 tropi… 0           50      997
#>  4 Zeta   2006     1     1    18  25   -38.6 tropi… 0           55      994
#>  5 Zeta   2006     1     2     0  24.6 -38.9 tropi… 0           55      994
#>  6 Zeta   2006     1     2     6  24.3 -39.7 tropi… 0           50      997
#>  7 Zeta   2006     1     2    12  23.8 -40.4 tropi… 0           45     1000
#>  8 Zeta   2006     1     2    18  23.6 -40.8 tropi… 0           50      997
#>  9 Zeta   2006     1     3     0  23.4 -41   tropi… 0           55      994
#> 10 Zeta   2006     1     3     6  23.3 -41.3 tropi… 0           55      994
#> # … with 10,000 more rows, and 2 more variables: ts_diameter <dbl>,
#> #   hu_diameter <dbl>

Arranging rows by multiple variables is also possible:

# Re-arrange data (by multiple variables):
st %>%
  arrange(lat, long)
#> # A tibble: 10,010 x 13
#>    name   year month   day  hour   lat  long status category  wind pressure
#>    <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <ord>    <int>    <int>
#>  1 Isid…  1990     9     4     0   7.2 -23.4 tropi… -1          25     1010
#>  2 Isid…  1990     9     4     6   7.4 -25.1 tropi… -1          25     1010
#>  3 Pablo  1995    10     4    18   8.3 -31.4 tropi… -1          30     1009
#>  4 Pablo  1995    10     5     0   8.4 -32.8 tropi… -1          30     1009
#>  5 Isid…  1990     9     4    12   8.4 -26.7 tropi… -1          25     1009
#>  6 Isid…  1996     9    24    12   8.6 -23.3 tropi… -1          25     1008
#>  7 Arth…  1990     7    22     6   8.8 -41.9 tropi… -1          25     1010
#>  8 Arth…  1990     7    22    12   8.9 -43.2 tropi… -1          25     1010
#>  9 Joan   1988    10    10    18   8.9 -42.2 tropi… -1          25     1010
#> 10 Ivan   2004     9     4    12   8.9 -38.2 tropi… 0           50      997
#> # … with 10,000 more rows, and 2 more variables: ts_diameter <dbl>,
#> #   hu_diameter <dbl>

The arrange() function sorts text variables (of type “character”) in alphabetical and numeric variables (of type “integer” or “double”) in ascending order. Use desc() to sort a variable in the opposite order:

# Re-arrange data (in descending order):
st %>%
  arrange(desc(name), desc(lat))
#> # A tibble: 10,010 x 13
#>    name   year month   day  hour   lat  long status category  wind pressure
#>    <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <ord>    <int>    <int>
#>  1 Zeta   2005    12    31     6  25.7 -37.6 tropi… 0           50      997
#>  2 Zeta   2005    12    31    12  25.7 -37.9 tropi… 0           50      997
#>  3 Zeta   2005    12    31    18  25.7 -38.1 tropi… 0           45     1000
#>  4 Zeta   2005    12    31     0  25.6 -37.3 tropi… 0           45     1000
#>  5 Zeta   2006     1     1     0  25.6 -38.3 tropi… 0           50      997
#>  6 Zeta   2006     1     1     6  25.4 -38.4 tropi… 0           50      997
#>  7 Zeta   2005    12    30    18  25.2 -37   tropi… 0           45     1000
#>  8 Zeta   2006     1     1    12  25.2 -38.5 tropi… 0           50      997
#>  9 Zeta   2006     1     1    18  25   -38.6 tropi… 0           55      994
#> 10 Zeta   2005    12    30    12  24.7 -36.6 tropi… 0           45     1002
#> # … with 10,000 more rows, and 2 more variables: ts_diameter <dbl>,
#> #   hu_diameter <dbl>

Note that the variable names specified in arrange() — or in other dplyr functions — are not enclosed in quotation marks. This may seem a bit strange at first, but becomes totally intuitive after typing a few commands.

Using filter() or slice() to select cases (rows)

Many questions concern only a subset of the cases (rows) of our data. In these instances, a typical first step consists in filtering rows for particular values on one or more variables. For instance, the following command reduces the 10010 rows of the st data quite drastically by only including rows in which the wind speed exceeds 150 knots:

# Select cases (rows) based on a condition:
st %>% 
  filter(wind > 150)
#> # A tibble: 7 x 13
#>   name   year month   day  hour   lat  long status category  wind pressure
#>   <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <ord>    <int>    <int>
#> 1 Gilb…  1988     9    14     0  19.7 -83.8 hurri… 5          160      888
#> 2 Gilb…  1988     9    14     6  19.9 -85.3 hurri… 5          155      889
#> 3 Mitch  1998    10    26    18  16.9 -83.1 hurri… 5          155      905
#> 4 Mitch  1998    10    27     0  17.2 -83.8 hurri… 5          155      910
#> 5 Rita   2005     9    22     3  24.7 -87.3 hurri… 5          155      895
#> 6 Rita   2005     9    22     6  24.8 -87.6 hurri… 5          155      897
#> 7 Wilma  2005    10    19    12  17.3 -82.8 hurri… 5          160      882
#> # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>

As before, filter() can use multiple variables and include numeric and character variables:

# Select cases (rows) based on several conditions:
st %>% 
  filter(year > 2014, month == 9, status == "hurricane")
#> # A tibble: 4 x 13
#>   name   year month   day  hour   lat  long status category  wind pressure
#>   <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <ord>    <int>    <int>
#> 1 Fred   2015     9     1     0  17.4 -24.9 hurri… 1           65      991
#> 2 Joaq…  2015     9    30     6  25.4 -71.8 hurri… 1           65      978
#> 3 Joaq…  2015     9    30    12  24.9 -72.2 hurri… 1           70      971
#> 4 Joaq…  2015     9    30    18  24.4 -72.5 hurri… 1           80      961
#> # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>

A variant of filter() is slice(), which is used to select particular rows, which are either described by some number or some combination of a property and a number:

# Select cases (rows): 
st %>% slice_head(n = 3)  # select the first 3 cases
#> # A tibble: 3 x 13
#>   name   year month   day  hour   lat  long status category  wind pressure
#>   <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <ord>    <int>    <int>
#> 1 Amy    1975     6    27     0  27.5   -79 tropi… -1          25     1013
#> 2 Amy    1975     6    27     6  28.5   -79 tropi… -1          25     1013
#> 3 Amy    1975     6    27    12  29.5   -79 tropi… -1          25     1013
#> # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>
st %>% slice_tail(n = 3)  # select the last 3 cases
#> # A tibble: 3 x 13
#>   name   year month   day  hour   lat  long status category  wind pressure
#>   <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <ord>    <int>    <int>
#> 1 Kate   2015    11    11     6  35.2 -67.6 hurri… 1           70      985
#> 2 Kate   2015    11    11    12  36.2 -62.5 hurri… 1           75      980
#> 3 Kate   2015    11    11    18  37.6 -58.2 hurri… 1           65      980
#> # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>
st %>% slice_max(pressure, n = 3)  # select cases with 3 maximal values 
#> # A tibble: 3 x 13
#>   name   year month   day  hour   lat  long status category  wind pressure
#>   <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <ord>    <int>    <int>
#> 1 AL07…  2003     7    26    12  32.3 -82   tropi… -1          20     1022
#> 2 AL07…  2003     7    26    18  32.8 -82.6 tropi… -1          15     1022
#> 3 AL07…  2003     7    27     0  33   -83   tropi… -1          15     1022
#> # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>
st %>% slice_min(pressure, n = 3)  # select cases with 3 minimial values
#> # A tibble: 3 x 13
#>   name   year month   day  hour   lat  long status category  wind pressure
#>   <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <ord>    <int>    <int>
#> 1 Wilma  2005    10    19    12  17.3 -82.8 hurri… 5          160      882
#> 2 Gilb…  1988     9    14     0  19.7 -83.8 hurri… 5          160      888
#> 3 Gilb…  1988     9    14     6  19.9 -85.3 hurri… 5          155      889
#> # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>
st %>% slice_sample(n = 3)  # select 3 random cases
#> # A tibble: 3 x 13
#>   name   year month   day  hour   lat  long status category  wind pressure
#>   <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <ord>    <int>    <int>
#> 1 Joaq…  2015    10     2    16  23.6 -74.8 hurri… 3          110      940
#> 2 Dean   2001     8    27    18  40.6 -59.4 tropi… 0           60      994
#> 3 Chan…  1995     7    17     0  27.2 -69.4 tropi… 0           60      991
#> # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>

Note that filter() and slice() reduced the number of cases (rows), but left the number of variables (columns) intact. The complement is select(), which has the opposite effects.

Using select() to select variables (columns)

The select() function provides an easy way of selecting and re-arranging the variables (columns) of tables:

# Select some variables (columns): 
st %>%
  select(name, pressure, wind)
#> # A tibble: 10,010 x 3
#>    name  pressure  wind
#>    <chr>    <int> <int>
#>  1 Amy       1013    25
#>  2 Amy       1013    25
#>  3 Amy       1013    25
#>  4 Amy       1013    25
#>  5 Amy       1012    25
#>  6 Amy       1012    25
#>  7 Amy       1011    25
#>  8 Amy       1006    30
#>  9 Amy       1004    35
#> 10 Amy       1002    40
#> # … with 10,000 more rows

Important assets of select() are its additional features:

  • va:vx selects a range of variables (e.g., from va to vx);
  • !vy allows negative selections (e.g., selecting all variables except vy);
  • & or | selects the intersection or union of two sets of variables;
  • starts_with("abc") and ends_with("xyz") selects all variables whose names start or end with some characters (e.g., “abc” or “xyz”);
  • everything() selects all variables not selected yet (e.g., to re-order all variables).

Here are some typical examples for corresponding selections (adding slice_sample(n = 3) for showing only three random rows of the resulting table):

# Select a range of variables: 
st %>% select(name, year:day, lat:long) %>% slice_sample(n = 3)  
#> # A tibble: 3 x 6
#>   name   year month   day   lat  long
#>   <chr> <dbl> <dbl> <int> <dbl> <dbl>
#> 1 Gert   2005     7    25  21.8 -97.6
#> 2 Debby  1982     9    14  22.4 -71.8
#> 3 Lisa   2010     9    21  16.8 -31.9

# Select an intersection of negated variables:
st %>% select(!status & !ends_with("diameter")) %>% slice_sample(n = 3)  
#> # A tibble: 3 x 10
#>   name     year month   day  hour   lat  long category  wind pressure
#>   <chr>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <ord>    <int>    <int>
#> 1 Marilyn  1995     9    16     0  17.9 -64.7 2           90      962
#> 2 Harvey   1981     9    15     0  28.4 -62.6 4          115      946
#> 3 Hermine  1980     9    25     0  17.7 -95.5 0           45     1000

# Re-order the columns of a table (selecting everything):
st %>% select(year, name, lat:long, everything()) %>% slice_sample(n = 3)  
#> # A tibble: 3 x 13
#>    year name    lat  long month   day  hour status category  wind pressure
#>   <dbl> <chr> <dbl> <dbl> <dbl> <int> <dbl> <chr>  <ord>    <int>    <int>
#> 1  1996 Lili   23   -78.2    10    19     0 hurri… 2           85      975
#> 2  2012 Sandy  14   -77.6    10    23    18 tropi… 0           45      993
#> 3  2003 Isab…  14.4 -37.3     9     7    12 hurri… 1           65      987
#> # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>

When using a pipe for quickly answering some descriptive question, it is common practice to first apply some combination of filter() and slice() (for removing non-needed rows), and select() (for removing non-needed columns).

Using mutate() for computing new variables

A frequent task in data analysis consists in computing some new variable out of exisiting ones. Metaphorically, this can be viewed as “mutating” some table’s current information into slightly different form. The mutate() function of dplyr first names a new variable (e.g., var_new =) and then uses existing R expressions (e.g., arithmetic operators, functions, etc.) for computing the values of the new variable.

As an example, let’s combine some variables into a new date variable:

# Compute and add a new variable: 
st %>% 
  mutate(date = paste(year, month, day, sep = "-")) %>%
  select(name, date, year:day) %>%  # re-order and remove variables
  slice_sample(n = 5)  # show 5 random cases (rows)
#> # A tibble: 5 x 5
#>   name     date        year month   day
#>   <chr>    <chr>      <dbl> <dbl> <int>
#> 1 Danielle 1998-8-26   1998     8    26
#> 2 Fabian   1997-10-5   1997    10     5
#> 3 Klaus    1984-11-11  1984    11    11
#> 4 Hugo     1989-9-15   1989     9    15
#> 5 Nadine   2012-9-12   2012     9    12

In this example, we used the R function paste() to combine the three variables year, month, and day, into a single character variable. Note that the assignment to a new variable was signaled by the operator =, rather than R’s typical assignment operator <-. As new variables are added to the right of a table (as its last column), we added the select() function to re-order variables (and remove the ones not needed here). When actually working with dates and times later, we can use dedicated R functions for parsing date and time variables into variables that represent dates or date-times (see Chapter 10: Dates and times).

As with the other dplyr verbs, we can compute several new variables in one mutate() command by separating them by commas. If we wanted to get rid of the old variables, we could immediately remove them from the data by using transmute():

# Compute several new variables (replacing the old ones): 
st %>% 
  transmute(name = paste0(name, " (", status, ")"), 
            date = paste(year, month, day, sep = "-"),
            loc = paste0("(", lat, "; ", long, ")")) %>%
  slice_sample(n = 5)  # show 5 random cases (rows)
#> # A tibble: 5 x 3
#>   name                       date       loc          
#>   <chr>                      <chr>      <chr>        
#> 1 Ivan (tropical storm)      2004-9-23  (28.9; -92.2)
#> 2 Fred (tropical depression) 2015-9-5   (23.3; -41.5)
#> 3 Ingrid (tropical storm)    2007-9-13  (13.7; -46.7)
#> 4 Marco (hurricane)          1996-11-20 (13.8; -78.5)
#> 5 Ana (tropical depression)  1979-6-23  (14; -61.3)

However, it is generally a bad idea to throw away source data. Especially since any act of computing new variables can be error-prone, we do not recommend using transmute(). Instead, a careful data analyst prefers mutate() for creating new variables and immediately checks whether the newly created variables are correct. And keeping all original variables is rarely a problem, as we always can remove unwanted variables (by selecting only the ones needed) later.

As both our examples so far have used mutate() or transmute() to create new character variables, here’s an example on numerical data. If we wanted to add a variable that rounds the values of pressure to the nearest multiple of 10, we could first divide these values by 10, round the result to the nearest integer (using the R function round(x, 0)), before multiplying by 10 again:

# Compute and add a new numeric variable: 
st %>% 
  mutate(press_10 = round(pressure/10, 0) * 10) %>%
  select(name, pressure, press_10) %>% 
  slice_sample(n = 5)  # show 5 random cases (rows)
#> # A tibble: 5 x 3
#>   name    pressure press_10
#>   <chr>      <int>    <dbl>
#> 1 Lisa         999     1000
#> 2 Alberto      991      990
#> 3 Felix        987      990
#> 4 Floyd        967      970
#> 5 Chris        982      980

The immense power of mutate() lies in its use of functions for computing new variables out of the existing ones. As any R function can be used, the possibilities for creating new variables are limitless. However, note that the computations of each mutate() command are typically constrained to each individual case (row). This changes with the following two dplyr commands.

Using summarise() for aggregating over values of variables

Whereas mutate() computes new variables out of exiting ones for each case (i.e., by row), summarise() (and summarize()) computes summaries for individual variables (i.e., by column). Each summary is assigned to a new variable (with the same var_new = ... syntax as in mutate()). The type of summary is indicated by applying a function to one or more variables. Useful functions for potential summaries include:

  • Count: n(), n_distinct()

  • Logical: any(), all()

  • Center: mean(), median()

  • Spread: sd(), IQR(), mad()

  • Range: min(), max(), quantile()

  • Position: first(), last(), nth()

st %>%
  summarise(nr = n(),
            nr_names = n_distinct(name),
            mn_wind = mean(wind),
            max_wind = max(wind))
#> # A tibble: 1 x 4
#>      nr nr_names mn_wind max_wind
#>   <int>    <int>   <dbl>    <int>
#> 1 10010      198    53.5      160

Summaries of columns are nice to have, but nothing for which we needed a new function for. Instead, we could have simply computed the same summaries directly for the vectors of the st data:

nrow(st)
#> [1] 10010
n_distinct(st$name)
#> [1] 198
mean(st$wind)
#> [1] 53.495
max(st$wind)
#> [1] 160

Thus, the true value of the summarise() function lies in the fact that it aggregates not only over all values of a variable (i.e., entire columns), but also over the levels of grouped variables, or all combinations of grouped variables. To use this feature, we need to precede a summarise() function by a group_by() function.

Using group_by() for changing the aggregation unit

The group_by() function does very little by itself, but becomes immensely powerful in combination with other dplyr commands. To see how this works, let’s select only four variables from our st data and examine the effects of group_by():

st %>% 
  select(name, year, wind, pressure) %>%
  group_by(name)
#> # A tibble: 10,010 x 4
#> # Groups:   name [198]
#>    name   year  wind pressure
#>    <chr> <dbl> <int>    <int>
#>  1 Amy    1975    25     1013
#>  2 Amy    1975    25     1013
#>  3 Amy    1975    25     1013
#>  4 Amy    1975    25     1013
#>  5 Amy    1975    25     1012
#>  6 Amy    1975    25     1012
#>  7 Amy    1975    25     1011
#>  8 Amy    1975    30     1006
#>  9 Amy    1975    35     1004
#> 10 Amy    1975    40     1002
#> # … with 10,000 more rows

The resulting tibble contains all 10010 cases (rows) of st and the four selected variables. So what did the group_by(name) command do? Inspecting the output more closely shows the message: “Groups: name [198].” This suggests that something has changed, even though we do not see any effects. Interestingly, the number of groups mentioned (i.e., 198) matches the number of distinct storm names that we obtained above by evaluating n_distinct(st$name). Let’s try a second command:

st %>% 
  select(name, year, wind, pressure) %>%
  group_by(name, year)
#> # A tibble: 10,010 x 4
#> # Groups:   name, year [426]
#>    name   year  wind pressure
#>    <chr> <dbl> <int>    <int>
#>  1 Amy    1975    25     1013
#>  2 Amy    1975    25     1013
#>  3 Amy    1975    25     1013
#>  4 Amy    1975    25     1013
#>  5 Amy    1975    25     1012
#>  6 Amy    1975    25     1012
#>  7 Amy    1975    25     1011
#>  8 Amy    1975    30     1006
#>  9 Amy    1975    35     1004
#> 10 Amy    1975    40     1002
#> # … with 10,000 more rows

The resulting tibble seems unaltered, but the message below the tibble dimensions now reads: “Groups: name, year [426].” As there are 198 different values of name and the range of year values varies from 1975 to 2015, the number of 426 groups is not immediately obvious. As we will see momentarily, it results from the fact that some, but not all instances of name occur in more than one year.

The easiest way to identify the specific groups in both cases is to follow the last two statements by the count() function. This returns the groups (as the rows of a tibble) together with a variable n that counts the number of observations in each group:

# Group st by name and count (n of) observations per group:
st %>% 
  select(name, year, wind, pressure) %>%
  group_by(name) %>%
  count() %>%
  filter(name == "Felix") # show only one group
#> # A tibble: 1 x 2
#> # Groups:   name [1]
#>   name      n
#>   <chr> <int>
#> 1 Felix   178

# Group st by name, year and count (n of) observations per group:
st %>% 
  select(name, year, wind, pressure) %>%
  group_by(name, year) %>%
  count() %>%
  filter(name == "Felix") # show only one group
#> # A tibble: 4 x 3
#> # Groups:   name, year [4]
#>   name   year     n
#>   <chr> <dbl> <int>
#> 1 Felix  1989    57
#> 2 Felix  1995    59
#> 3 Felix  2001    40
#> 4 Felix  2007    22

Rather than returning the entire tibble, we added filter(name == "Felix") at the end to only return lines with this particular name value. When group_by(name), there is only one such group (containing 178 observations). By contrast, when group_by(name, year), there are four such groups (with a sum of 178 observations). Thus, a storm with the name Felix was observed in four distinct years.

ungroup() removes group()

The ungroup() function removes existing grouping factors. This is occasionally necessary for applying additional dplyr commands. For instance, if we first wanted to group storms by name and later draw a random sample of size n = 10 , we would need to add an intermediate ungroup() before applying slice_sample(n = 10) to the tibble of groups:

st %>%
  group_by(name) %>%
  ungroup() %>%
  slice_sample(n = 5)
#> # A tibble: 5 x 13
#>   name   year month   day  hour   lat  long status category  wind pressure
#>   <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <ord>    <int>    <int>
#> 1 AL14…  2002    10    16     6  20.5 -81.6 tropi… -1          30     1003
#> 2 Mari…  1995     9    13     6  11.9 -54.3 tropi… 0           45      999
#> 3 Cris…  2008     7    20    12  34.1 -76.7 tropi… 0           45     1006
#> 4 Emily  1987     9    22    18  16.7 -69.1 hurri… 3          110      958
#> 5 Bob    1991     8    20     6  45.6 -67.6 tropi… 0           50      987
#> # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>

Why do we want to group() data tables? The benefits of grouping become obvious by combining a group_by() function with a subsequent mutate() or summarise() function. In both cases, the unit of aggregation is changed from all cases to those within each group. Here are examples for both combinations.

Grouped mutates

To illustrate the effects of a sequence of group_by() and mutate(), let’s compute the mean wind speed twice:

  • the 1st use of mutate() computes the mean wind speed mn_wind_1 over all cases. The number of cases over which the mean is aggregated can be counted by the n() function (assigned to a variable mn_n_1).

  • the 2nd use of mutate() also computes the mean wind speed mn_wind_2 and uses the same n() function (assigned to a variable mn_n_2). The key difference is not in the content of the mutate() function, but the fact that the 2nd mutate() is located after the group_by(name) expression. This changed the unit of aggregation to this particular group.

st %>%
  select(name, year, wind) %>%
  mutate(mn_wind_1 = mean(wind),  # mean 1
         mn_n_1 = n()) %>%        
  group_by(name) %>%              # group by name
  mutate(mn_wind_2 = mean(wind),  # mean 2
         mn_n_2 = n()) %>%
  ungroup() %>%                   # ungroup
  slice_sample(n = 10)
#> # A tibble: 10 x 7
#>    name       year  wind mn_wind_1 mn_n_1 mn_wind_2 mn_n_2
#>    <chr>     <dbl> <int>     <dbl>  <int>     <dbl>  <int>
#>  1 Luis       1995   120      53.5  10010      89.2     60
#>  2 Georges    1998    70      53.5  10010      67.2     92
#>  3 Bonnie     1986    30      53.5  10010      55.2    185
#>  4 Hermine    2004    25      53.5  10010      37.4     70
#>  5 Frederic   1979   115      53.5  10010      52.6     66
#>  6 Lisa       2010    40      53.5  10010      46.7     99
#>  7 Leslie     2012    55      53.5  10010      53.9     59
#>  8 Gabrielle  2001    40      53.5  10010      57.2    130
#>  9 Emily      2005    30      53.5  10010      60.8    207
#> 10 Olga       2001    35      53.5  10010      48.6     47

To better inspect the resulting tibble, we followed the 2nd mutate() function with ungroup(). This removes any existing grouping operations and allows manipulating the table with the new variables. In this case, we used slice_sample(n = 10) to draw 10 random rows (out of the 10010 rows from st). These 10 rows show the difference in results of both mutate() commands. The first variable mn_wind_1 shows an identical value for all rows (as it computed the means over all 10010 rows). The second variable mn_wind_2 shows a different value for each name, as it was computed over those rows that shared the same name.

Beware that the results of grouped mutate() commands are easily misinterpreted: For instance, the values of mn_wind_2 could be misinterpreted as the mean wind speed of a particular storm. However, mn_wind_2 was only computed over all observations with the same name value. As some names occur repeatedly (in different years), the values do not necessarily refer to the same storm. Aggregating by storm would require a unique identifier for each storm (e.g., some combination of its name and date).

Grouped summaries

Perhaps even more frequent than following a group() function by mutate() is following it by summarise(). In this case, we aggregate the specified summaries over each group, rather than all data values.

To illustrate the effects of grouped summaries, compare the following three pipes of dplyr commands: All three contain the same summarise() part, which computes three new variables that report the number of cases in each summary n_cases, the mean wind speed mn_wind, and the maximum wind speed max_wind. The difference between the three versions lies in the group_by() statements prior to the summarise() command:

  • the 1st pipe computes the summary for all of st (without grouping);

  • the 2nd pipe computes the summary for each year of st;

  • the 3rd pipe computes the summary for each year and month of st:

st %>%
  summarise(n_cases = n(),
            mn_wind = mean(wind), 
            max_wind = max(wind))
#> # A tibble: 1 x 3
#>   n_cases mn_wind max_wind
#>     <int>   <dbl>    <int>
#> 1   10010    53.5      160

st %>%
  group_by(year) %>%
  summarise(n_cases = n(),
            mn_wind = mean(wind), 
            max_wind = max(wind))
#> # A tibble: 41 x 4
#>     year n_cases mn_wind max_wind
#>    <dbl>   <int>   <dbl>    <int>
#>  1  1975      86    50.9      100
#>  2  1976      52    59.9      105
#>  3  1977      53    54.0      150
#>  4  1978      54    40.5       80
#>  5  1979     301    48.7      150
#>  6  1980     161    53.7       90
#>  7  1981     164    56.6      115
#>  8  1982     105    49.5      115
#>  9  1983      79    47.0      100
#> 10  1984     236    51.4      115
#> # … with 31 more rows

st %>%
  group_by(year, month) %>%
  summarise(n_cases = n(),
            mn_wind = mean(wind), 
            max_wind = max(wind))
#> # A tibble: 198 x 5
#> # Groups:   year [41]
#>     year month n_cases mn_wind max_wind
#>    <dbl> <dbl>   <int>   <dbl>    <int>
#>  1  1975     6      16    37.5       60
#>  2  1975     7      14    56.8       60
#>  3  1975     8      40    45        100
#>  4  1975     9      16    73.8       95
#>  5  1976     8      18    68.1      105
#>  6  1976     9      18    56.4       90
#>  7  1976    10      16    54.7       90
#>  8  1977     8      10    53         80
#>  9  1977     9      34    55        150
#> 10  1977    10       9    51.1       70
#> # … with 188 more rows

As we can use all kinds of R functions in the mutate() and summarise() parts, preceding them by group_by() allows computing all kinds of new variables and descriptive summary statistics. In the following, we illustrate how we can answer quite interesting questions by appropriate dplyr pipes.

6.3.3 Answering questions (with dplyr and ggplot2)

To illustrate the typical workflow, let’s ask some simple questions and then provide a descriptive answer to it by a combination of dplyr and ggplot2 commands:

Question:

  • What were the 10 storms with the highest wind speeds?

Answer:

We can translate this question into the following one: What was the maximal wind speed that was recorded for each (named) storm?

st_top10_wind <- st %>%
  group_by(name) %>%
  summarise(max_wind = max(wind)) %>%
  arrange(desc(max_wind)) %>%
  slice(1:10)

st_top10_wind
#> # A tibble: 10 x 2
#>    name    max_wind
#>    <chr>      <int>
#>  1 Gilbert      160
#>  2 Wilma        160
#>  3 Mitch        155
#>  4 Rita         155
#>  5 Andrew       150
#>  6 Anita        150
#>  7 David        150
#>  8 Dean         150
#>  9 Felix        150
#> 10 Katrina      150

Question:

  • What was the maximal wind speed of each storm?

Answer:

We arrange the rows by (descending) wind speed, then group it by the name of storms, and select only the top row of each group:

st_max_wind <- st %>% 
  arrange(desc(wind)) %>%
  group_by(name) %>%
  slice_head(n = 1) 

dim(st_max_wind)
#> [1] 198  13

The resulting table st_max_wind contains 198 rows. Did the st data contain the same number of unique storm names? Let’s check:

length(unique(st$name))
#> [1] 198

Note a key difference between the two tables st_top10_wind and st_max_wind: st_top10_wind is only a small summary table that answers our question from above, whereas st_max_wind is a much larger subset of the original data (and includes the same variables as st).

In fact, our summary information of st_top10_wind should be contained within st_max_wind. Let’s check this: Inspecting wind_top10 shows four storms with wind speeds of at least 150 knots. Do we obtain the same storms when filtering the table st_max_wind for these values? The following expressions verify this by re-computing the top-10 storm names from the data in st_max_wind:

top10_2 <- st_max_wind %>%
  filter(wind >= 150) %>%
  select(name, wind) %>%
  arrange(desc(wind))
top10_2
#> # A tibble: 10 x 2
#> # Groups:   name [10]
#>    name     wind
#>    <chr>   <int>
#>  1 Gilbert   160
#>  2 Wilma     160
#>  3 Mitch     155
#>  4 Rita      155
#>  5 Andrew    150
#>  6 Anita     150
#>  7 David     150
#>  8 Dean      150
#>  9 Felix     150
#> 10 Katrina   150

all.equal(st_top10_wind$name, top10_2$name)
#> [1] TRUE

Question:

  • What was the average wind speed (in knots) and pressure (in milibars) by strom category?

Answer:

Using a dplyr pipe to compute a grouped summary table t_w:

t_w <- st %>% 
  group_by(category) %>%
  summarise(n = n(),
            mn_wind = mean(wind))

knitr::kable(t_w, caption = "Mean wind speed of storms (from **dplyr**).")
Table 6.5: Mean wind speed of storms (from dplyr).
category n mn_wind
-1 2545 27.26916
0 4373 45.80037
1 1685 70.91098
2 628 89.43471
3 363 104.64187
4 348 121.55172
5 68 145.07353

Using t_w to plot results with ggplot2:

ggplot(t_w, aes(x = category, y = mn_wind)) +
  geom_point(aes(size = n), col = "firebrick") +
  labs(tag = "A", title = "Wind speed by storm category", 
       x = "Storm category", y = "Wind speed (mean)") + 
  ylim(0, 150) + 
  theme_ds4psy()

Practice

Analog question:

  • What was the average air pressure (in milibars) by strom category?

Answer:

Using a dplyr pipe to compute a summary table t_p:

Table 6.6: Mean air pressure of storms (from dplyr).
category n mn_press
-1 2545 1007.6259
0 4373 999.3291
1 1685 981.5181
2 628 967.4729
3 363 953.5289
4 348 939.5345
5 68 916.4265

Using t_p to plot results with ggplot2:

Exercise 1

Use the data from dplyr::storms to show that there are specific storm seasons throughout the year.

  • In which months were how many storms of each category recorded?

Using a combination of dplyr and tidyr functions to compute the following summary table:

#> # A tibble: 45 x 3
#> # Groups:   month [10]
#>    month category     n
#>    <dbl> <ord>    <int>
#>  1     1 -1           2
#>  2     1 0           21
#>  3     4 0           13
#>  4     5 -1          22
#>  5     5 0           31
#>  6     6 -1         144
#>  7     6 0          148
#>  8     6 1           18
#>  9     7 -1         306
#> 10     7 0          392
#> # … with 35 more rows
Table 6.7: Mean wind speed and pressure (from dplyr).
month -1 0 1 2 3 4 5
1 2 21 0 0 0 0 0
4 0 13 0 0 0 0 0
5 22 31 0 0 0 0 0
6 144 148 18 0 0 0 0
7 306 392 49 9 6 11 1
8 643 1044 410 135 79 69 20
9 935 1619 751 368 209 201 37
10 352 756 312 96 57 54 10
11 120 278 112 20 12 13 0
12 21 71 33 0 0 0 0

Using t_2 to plot results with ggplot2:

The following plot uses the t_2 data, but is actually quite misleading:

ggplot(t_2, aes(x = factor(month))) +
  geom_bar(aes(fill = category)) +
  labs(title = "Storm counts per month", 
       x = "Month", y = "Count of observations") + 
  theme_ds4psy()

  • What’s the problem with it? How can it be fixed?

Solution 1

Correction: Two possible solutions

  1. Use geom_bar() with y = n and stat = "identity"
ggplot(t_2, aes(x = factor(month))) +
  geom_bar(aes(y = n, fill = category), stat = "identity") +
  labs(title = "Storm counts per month", 
       x = "Month", y = "Count of observations") + 
  theme_ds4psy()

  1. Use raw data of st and geom_bar() with default settings (i.e., stat = "count"):
ggplot(st, aes(x = factor(month))) +
  geom_bar(aes(fill = category)) +
  labs(title = "Storm counts per month", 
       x = "Month", y = "Count of observations") + 
  theme_ds4psy()

Exercise 2

  1. Identify all storms in st that were observed in more than one year.

  2. There are two ways in which a storm name can appear in more than one year:

  • A single storm may occur in multiple years (e.g., from December to January)
  • A storm name is used repeatedly (i.e., to name different storms in different years)

Check how often each of these cases occurs.

Solution 2

ad 1.:

st %>% 
  select(name, year) %>%
  group_by(name, year) %>%
  count() %>%
  # select(name, year) %>%
  group_by(name) %>%
  count() %>%
  filter(n > 1) %>%
  head() 
#> # A tibble: 6 x 2
#> # Groups:   name [6]
#>   name        n
#>   <chr>   <int>
#> 1 Alberto     6
#> 2 Alex        3
#> 3 Allison     3
#> 4 Ana         7
#> 5 Andrew      2
#> 6 Arthur      6
  # tail()

Check results for some storms:

  • Does “Alberto” really occur in 6 different years?
st %>% 
  filter(name == "Alberto") %>%
  group_by(name, year) %>%
  count()
#> # A tibble: 6 x 3
#> # Groups:   name, year [6]
#>   name     year     n
#>   <chr>   <dbl> <int>
#> 1 Alberto  1982    17
#> 2 Alberto  1988    11
#> 3 Alberto  1994    32
#> 4 Alberto  2000    79
#> 5 Alberto  2006    18
#> 6 Alberto  2012    13
  • Does “Ana” really occur in 7 different years?
st %>% 
  filter(name == "Ana") %>%
  group_by(name, year) %>%
  count()
#> # A tibble: 7 x 3
#> # Groups:   name, year [7]
#>   name   year     n
#>   <chr> <dbl> <int>
#> 1 Ana    1979    19
#> 2 Ana    1985    14
#> 3 Ana    1991    12
#> 4 Ana    1997    15
#> 5 Ana    2003    13
#> 6 Ana    2009    15
#> 7 Ana    2015    12
  • Does “Zeta” really occur in 2 different years?
st %>% 
  filter(name == "Zeta") %>%
  group_by(name, year) %>%
  count()
#> # A tibble: 2 x 3
#> # Groups:   name, year [2]
#>   name   year     n
#>   <chr> <dbl> <int>
#> 1 Zeta   2005     8
#> 2 Zeta   2006    23

Scatterplot: Visualize when and how often each name occurs…

all_names <- unique(st$name)

st_ym <- st %>%
  select(name, year, month, status) %>%
  mutate(y_m = paste0(year, "-", month),
         rnk = seq_along(name)  # ???: How to get rank of name?
         )
st_ym

# Position of a variable's value within a group:
st_ym <- st %>%
  select(name, year, month, status) %>%
  group_by(name) %>%
  mutate(y_m = paste0(year, "-", month),
         i_in_group = 1:n(),
         n_in_group = n(),
         group_id = cur_group_id(),
         name_nr = which(as.character(name) == all_names),
         initial = toupper(substr(name, 1, 1)),
         init_nr = match(initial, LETTERS))
st_ym 

# Visualize name_nr by time (color marks initial letter):
ggplot(st_ym, aes(x = y_m, y = name_nr, col = initial)) + 
  geom_point(size = 1) + 
  scale_color_manual(values = usecol(c("blue3", "gold", "firebrick3"), 22)) + 
  theme_classic()

# Visualize name_nr by time (color marks storm status):
ggplot(st_ym, aes(x = y_m, y = name_nr, col = status)) + 
  geom_point(size = 1) + 
  scale_color_manual(values = usecol(c("blue3", "gold", "firebrick3"), 3)) + 
  theme_classic()

# Visualize initial by time (color marks name):
ggplot(filter(st_ym, year < 1982), aes(x = y_m, y = init_nr, col = name)) + 
  geom_point(size = 1/2, alpha = 1/2) + 
  # scale_color_manual(values = usecol(c("blue3", "gold", "firebrick3"), 3)) + 
  theme_classic()

References

Neth, H. (2021a). Data science for psychologists. Social Psychology; Decision Sciences, University of Konstanz. https://bookdown.org/hneth/ds4psy/
Wickham, H., François, R., Henry, L., & Müller, K. (2021). dplyr: A grammar of data manipulation. https://CRAN.R-project.org/package=dplyr
Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. O’Reilly Media, Inc. http://r4ds.had.co.nz