Chapter 5 Handout 3: More dplyr verbs
In Handout 3, you learned about tapply() and how to apply functions to subsets of data. Here, we will do the same, as well as learning a few more common dplyr verbs.
5.1 Review: Pipes and filter()
As always, read in the dataset and use the tidyverse package.
library(tidyverse) # loads the tidyverse into your R environment
data <- mtcars # this is a dataset about cars
head(data)## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Just as a brief review of our first dplyr verb, filter() subsets a dataset based on the conditions specified in the command. Pipes carry the object forward between commands.
# exercise: describe the subset of this data we've created
data_subset_a_lot <- data %>%
filter(mpg > 20) %>%
filter(mpg < 30) %>%
filter(cyl == 6)
# now there are only 3 entries left!
data_subset_a_lot## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
5.2 mutate()
mutate() is probably the most versatile dplyr verb. It can be used to either modify existing columns or create new columns. The way it does this is simple: you specify in the command a column and what you’d like done to it. If the column doesn’t already exist, it will create a new column with those properties.
For example, what if I mistakenly discounted car mpgs by 5 in the dataset data_subset_a_lot? I could modify the column as follows:
data_subset_a_lot <- data_subset_a_lot %>%
mutate(mpg = mpg + 5) # since mpg already exists, no new column created
data_subset_a_lot # notice all mpg's have 5 added## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 26.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 26.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Hornet 4 Drive 26.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Then, I decide that I want the old data back (by subtracting 5 again), but in a different column, mpg_old. I would create a new column as follows:
data_subset_a_lot <- data_subset_a_lot %>%
mutate(mpg_old = mpg - 5) # creates new column mpg_old
data_subset_a_lot # notice new column mpg_old## mpg cyl disp hp drat wt qsec vs am gear carb mpg_old
## Mazda RX4 26.0 6 160 110 3.90 2.620 16.46 0 1 4 4 21.0
## Mazda RX4 Wag 26.0 6 160 110 3.90 2.875 17.02 0 1 4 4 21.0
## Hornet 4 Drive 26.4 6 258 110 3.08 3.215 19.44 1 0 3 1 21.4
Finally, notice we can do the exact same thing in one go using pipes:
# this produces the same exact results!
data_subset_a_lot <- data_subset_a_lot %>%
mutate(mpg = mpg + 5) %>% # since mpg already exists, no new column created
mutate(mpg_old = mpg - 5) # creates new column mpg_old5.3 group_by(), summarize(), ungroup(), and Subsets of Data
For a bit of context on the dataset we are using about cars, the cars can have 4, 6, or 8 cylinders (cyl). These are our “categories”. So, what if I wanted the mean mpg for each category of number of cylinders?
As we saw in Handout 3, we can calculate this using the tapply command:
means <- tapply(data$mpg, INDEX = data$cyl, FUN = mean)
means## 4 6 8
## 26.66364 19.74286 15.10000
But, the tidyverse way to do this is the group_by() and summarize() verbs. This occurs step-by-step, so we’ll take it by steps as well.
First, we establish the groups using group_by(). group_by() only takes the names of the categories you intend to, well, group by, and doesn’t do anything on its own.
means <- data %>%
group_by(cyl) # establishes cyl as our groupingsNext, we use summarize() to specify funcitons to be applied to each established group. It’s important to note that this is creating a new data frame, so we need to name every new summary statistics we produce as a column:
means <- data %>%
group_by(cyl) %>%
summarize(data_mean = mean(mpg)) # calculate mean of mpg, and name as data_mean
means## # A tibble: 3 × 2
## cyl data_mean
## <dbl> <dbl>
## 1 4 26.7
## 2 6 19.7
## 3 8 15.1
Let’s take a step back on what we just did. We asked summarize() to produce the mean of mpg based on the groupings we specified (mean(mpg)), and then named that as a column in our new summary dataframe (data_mean = mean(mpg)).
Finally, a note. Once you are done with your groupings, you should always ungroup() your data! There’s a lot of different small reasons why, but overall, you will ultimately get much fewer bugs ungrouping your groups whenever you are done. If you ever encounter a bug with a dataset you grouped at some point, you should first check whether you ungrouped or not. Ungrouping is simple: just pass your data into the ungroup() command (no extra stuff needed):
# proper style!
means <- data %>%
group_by(cyl) %>%
summarize(data_mean = mean(mpg)) %>% # calculate mean of mpg, and name as data_mean
ungroup() # ungroup() your data when you are done!
means## # A tibble: 3 × 2
## cyl data_mean
## <dbl> <dbl>
## 1 4 26.7
## 2 6 19.7
## 3 8 15.1
You might ask, why is this method preferred over tapply? After all, tapply got to the solution a lot quicker. Well, the group_by() and summarize() combo is extremely useful for producing many summary statistics at once. tapply only allows us to apply one function at a time, whereas summarize() allows us to produce a lot more summary data at once.
For an example, what if for the groups (cyl), I wanted not only mean, but also number of observations, median, maximum, and minimum? This is trivial using summarize():
data_summary <- data %>%
group_by(cyl) %>% # establish groups
summarize(data_mean = mean(mpg), # calculate mean of mpg, and name as data_mean
number_of_obs = n(), # number of observations
data_median = median(mpg), # median of mpg
data_max = max(mpg), # max of mpg
data_min = min(mpg)) %>% # min of mpg
ungroup() # always ungroup when you are done!
data_summary## # A tibble: 3 × 6
## cyl data_mean number_of_obs data_median data_max data_min
## <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 4 26.7 11 26 33.9 21.4
## 2 6 19.7 7 19.7 21.4 17.8
## 3 8 15.1 14 15.2 19.2 10.4