Chapter 5 Handout 3: More dplyr verbs
In Handout 3, you learned about tapply()
and how to apply functions to subsets of data. Here, we will do the same, as well as learning a few more common dplyr verbs.
5.1 Review: Pipes and filter()
As always, read in the dataset and use the tidyverse package.
library(tidyverse) # loads the tidyverse into your R environment
<- mtcars # this is a dataset about cars
data head(data)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Just as a brief review of our first dplyr verb, filter()
subsets a dataset based on the conditions specified in the command. Pipes carry the object forward between commands.
# exercise: describe the subset of this data we've created
<- data %>%
data_subset_a_lot filter(mpg > 20) %>%
filter(mpg < 30) %>%
filter(cyl == 6)
# now there are only 3 entries left!
data_subset_a_lot
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
5.2 mutate()
mutate()
is probably the most versatile dplyr verb. It can be used to either modify existing columns or create new columns. The way it does this is simple: you specify in the command a column and what you’d like done to it. If the column doesn’t already exist, it will create a new column with those properties.
For example, what if I mistakenly discounted car mpg
s by 5 in the dataset data_subset_a_lot
? I could modify the column as follows:
<- data_subset_a_lot %>%
data_subset_a_lot mutate(mpg = mpg + 5) # since mpg already exists, no new column created
# notice all mpg's have 5 added data_subset_a_lot
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 26.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 26.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Hornet 4 Drive 26.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Then, I decide that I want the old data back (by subtracting 5 again), but in a different column, mpg_old
. I would create a new column as follows:
<- data_subset_a_lot %>%
data_subset_a_lot mutate(mpg_old = mpg - 5) # creates new column mpg_old
# notice new column mpg_old data_subset_a_lot
## mpg cyl disp hp drat wt qsec vs am gear carb mpg_old
## Mazda RX4 26.0 6 160 110 3.90 2.620 16.46 0 1 4 4 21.0
## Mazda RX4 Wag 26.0 6 160 110 3.90 2.875 17.02 0 1 4 4 21.0
## Hornet 4 Drive 26.4 6 258 110 3.08 3.215 19.44 1 0 3 1 21.4
Finally, notice we can do the exact same thing in one go using pipes:
# this produces the same exact results!
<- data_subset_a_lot %>%
data_subset_a_lot mutate(mpg = mpg + 5) %>% # since mpg already exists, no new column created
mutate(mpg_old = mpg - 5) # creates new column mpg_old
5.3 group_by()
, summarize()
, ungroup()
, and Subsets of Data
For a bit of context on the dataset we are using about cars, the cars can have 4, 6, or 8 cylinders (cyl
). These are our “categories”. So, what if I wanted the mean mpg
for each category of number of cylinders?
As we saw in Handout 3, we can calculate this using the tapply command:
<- tapply(data$mpg, INDEX = data$cyl, FUN = mean)
means means
## 4 6 8
## 26.66364 19.74286 15.10000
But, the tidyverse way to do this is the group_by()
and summarize()
verbs. This occurs step-by-step, so we’ll take it by steps as well.
First, we establish the groups using group_by()
. group_by()
only takes the names of the categories you intend to, well, group by, and doesn’t do anything on its own.
<- data %>%
means group_by(cyl) # establishes cyl as our groupings
Next, we use summarize()
to specify funcitons to be applied to each established group. It’s important to note that this is creating a new data frame, so we need to name every new summary statistics we produce as a column:
<- data %>%
means group_by(cyl) %>%
summarize(data_mean = mean(mpg)) # calculate mean of mpg, and name as data_mean
means
## # A tibble: 3 × 2
## cyl data_mean
## <dbl> <dbl>
## 1 4 26.7
## 2 6 19.7
## 3 8 15.1
Let’s take a step back on what we just did. We asked summarize()
to produce the mean of mpg based on the groupings we specified (mean(mpg)
), and then named that as a column in our new summary dataframe (data_mean = mean(mpg)
).
Finally, a note. Once you are done with your groupings, you should always ungroup()
your data! There’s a lot of different small reasons why, but overall, you will ultimately get much fewer bugs ungrouping your groups whenever you are done. If you ever encounter a bug with a dataset you grouped at some point, you should first check whether you ungrouped or not. Ungrouping is simple: just pass your data into the ungroup()
command (no extra stuff needed):
# proper style!
<- data %>%
means group_by(cyl) %>%
summarize(data_mean = mean(mpg)) %>% # calculate mean of mpg, and name as data_mean
ungroup() # ungroup() your data when you are done!
means
## # A tibble: 3 × 2
## cyl data_mean
## <dbl> <dbl>
## 1 4 26.7
## 2 6 19.7
## 3 8 15.1
You might ask, why is this method preferred over tapply
? After all, tapply
got to the solution a lot quicker. Well, the group_by()
and summarize()
combo is extremely useful for producing many summary statistics at once. tapply
only allows us to apply one function at a time, whereas summarize()
allows us to produce a lot more summary data at once.
For an example, what if for the groups (cyl
), I wanted not only mean, but also number of observations, median, maximum, and minimum? This is trivial using summarize()
:
<- data %>%
data_summary group_by(cyl) %>% # establish groups
summarize(data_mean = mean(mpg), # calculate mean of mpg, and name as data_mean
number_of_obs = n(), # number of observations
data_median = median(mpg), # median of mpg
data_max = max(mpg), # max of mpg
data_min = min(mpg)) %>% # min of mpg
ungroup() # always ungroup when you are done!
data_summary
## # A tibble: 3 × 6
## cyl data_mean number_of_obs data_median data_max data_min
## <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 4 26.7 11 26 33.9 21.4
## 2 6 19.7 7 19.7 21.4 17.8
## 3 8 15.1 14 15.2 19.2 10.4