Chapter 5 Handout 3: More dplyr verbs

In Handout 3, you learned about tapply() and how to apply functions to subsets of data. Here, we will do the same, as well as learning a few more common dplyr verbs.

5.1 Review: Pipes and filter()

As always, read in the dataset and use the tidyverse package.

library(tidyverse) # loads the tidyverse into your R environment
data <- mtcars # this is a dataset about cars
head(data)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Just as a brief review of our first dplyr verb, filter() subsets a dataset based on the conditions specified in the command. Pipes carry the object forward between commands.

# exercise: describe the subset of this data we've created
data_subset_a_lot <- data %>%
  filter(mpg > 20) %>%
  filter(mpg < 30) %>%
  filter(cyl == 6)

# now there are only 3 entries left!
data_subset_a_lot
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1

5.2 mutate()

mutate() is probably the most versatile dplyr verb. It can be used to either modify existing columns or create new columns. The way it does this is simple: you specify in the command a column and what you’d like done to it. If the column doesn’t already exist, it will create a new column with those properties.

For example, what if I mistakenly discounted car mpgs by 5 in the dataset data_subset_a_lot? I could modify the column as follows:

data_subset_a_lot <- data_subset_a_lot %>%
  mutate(mpg = mpg + 5) # since mpg already exists, no new column created
data_subset_a_lot # notice all mpg's have 5 added
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      26.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  26.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive 26.4   6  258 110 3.08 3.215 19.44  1  0    3    1

Then, I decide that I want the old data back (by subtracting 5 again), but in a different column, mpg_old. I would create a new column as follows:

data_subset_a_lot <- data_subset_a_lot %>%
  mutate(mpg_old = mpg - 5) # creates new column mpg_old
data_subset_a_lot # notice new column mpg_old
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb mpg_old
## Mazda RX4      26.0   6  160 110 3.90 2.620 16.46  0  1    4    4    21.0
## Mazda RX4 Wag  26.0   6  160 110 3.90 2.875 17.02  0  1    4    4    21.0
## Hornet 4 Drive 26.4   6  258 110 3.08 3.215 19.44  1  0    3    1    21.4

Finally, notice we can do the exact same thing in one go using pipes:

# this produces the same exact results!
data_subset_a_lot <- data_subset_a_lot %>%
  mutate(mpg = mpg + 5) %>% # since mpg already exists, no new column created
  mutate(mpg_old = mpg - 5) # creates new column mpg_old

5.3 group_by(), summarize(), ungroup(), and Subsets of Data

For a bit of context on the dataset we are using about cars, the cars can have 4, 6, or 8 cylinders (cyl). These are our “categories”. So, what if I wanted the mean mpg for each category of number of cylinders?

As we saw in Handout 3, we can calculate this using the tapply command:

means <- tapply(data$mpg, INDEX = data$cyl, FUN = mean)
means
##        4        6        8 
## 26.66364 19.74286 15.10000

But, the tidyverse way to do this is the group_by() and summarize() verbs. This occurs step-by-step, so we’ll take it by steps as well.

First, we establish the groups using group_by(). group_by() only takes the names of the categories you intend to, well, group by, and doesn’t do anything on its own.

means <- data %>%
  group_by(cyl) # establishes cyl as our groupings

Next, we use summarize() to specify funcitons to be applied to each established group. It’s important to note that this is creating a new data frame, so we need to name every new summary statistics we produce as a column:

means <- data %>%
  group_by(cyl) %>%
  summarize(data_mean = mean(mpg)) # calculate mean of mpg, and name as data_mean
means
## # A tibble: 3 × 2
##     cyl data_mean
##   <dbl>     <dbl>
## 1     4      26.7
## 2     6      19.7
## 3     8      15.1

Let’s take a step back on what we just did. We asked summarize() to produce the mean of mpg based on the groupings we specified (mean(mpg)), and then named that as a column in our new summary dataframe (data_mean = mean(mpg)).

Finally, a note. Once you are done with your groupings, you should always ungroup() your data! There’s a lot of different small reasons why, but overall, you will ultimately get much fewer bugs ungrouping your groups whenever you are done. If you ever encounter a bug with a dataset you grouped at some point, you should first check whether you ungrouped or not. Ungrouping is simple: just pass your data into the ungroup() command (no extra stuff needed):

# proper style!
means <- data %>%
  group_by(cyl) %>%
  summarize(data_mean = mean(mpg)) %>% # calculate mean of mpg, and name as data_mean
  ungroup() # ungroup() your data when you are done!
means
## # A tibble: 3 × 2
##     cyl data_mean
##   <dbl>     <dbl>
## 1     4      26.7
## 2     6      19.7
## 3     8      15.1

You might ask, why is this method preferred over tapply? After all, tapply got to the solution a lot quicker. Well, the group_by() and summarize() combo is extremely useful for producing many summary statistics at once. tapply only allows us to apply one function at a time, whereas summarize() allows us to produce a lot more summary data at once.

For an example, what if for the groups (cyl), I wanted not only mean, but also number of observations, median, maximum, and minimum? This is trivial using summarize():

data_summary <- data %>%
  group_by(cyl) %>% # establish groups
  summarize(data_mean = mean(mpg), # calculate mean of mpg, and name as data_mean
            number_of_obs = n(), # number of observations
            data_median = median(mpg), # median of mpg
            data_max = max(mpg), # max of mpg
            data_min = min(mpg)) %>% # min of mpg
  ungroup() # always ungroup when you are done!
data_summary
## # A tibble: 3 × 6
##     cyl data_mean number_of_obs data_median data_max data_min
##   <dbl>     <dbl>         <int>       <dbl>    <dbl>    <dbl>
## 1     4      26.7            11        26       33.9     21.4
## 2     6      19.7             7        19.7     21.4     17.8
## 3     8      15.1            14        15.2     19.2     10.4