10.4 dplyr

The dplyr package is a relatively new R package that allows you to do all kinds of analyses quickly and easily. It is especially useful for creating tables of summary statistics across specific groups of data. In this section, we’ll go over a very brief overview of how you can use dplyr to easily do grouped aggregation. Just to be clear - you can use dplyr to do everything the aggregate() function does and much more! However, this will be a very brief overview and I strongly recommend you look at the help menu for dplyr for additional descriptions and examples.

To use the dplyr package, you first need to install it with install.packages() and load it:

install.packages("dplyr")     # Install dplyr (only necessary once)
library("dplyr")              # Load dplyr

Programming with dplyr looks a lot different than programming in standard R. dplyr works by combining objects (dataframes and columns in dataframes), functions (mean, median, etc.), and verbs (special commands in dplyr). In between these commands is a new operator called the pipe which looks like this: %>%}. The pipe simply tells R that you want to continue executing some functions or verbs on the object you are working on. You can think about this pipe as meaning ‘and then…’

To aggregate data with dplyr, your code will look something like the following code. In this example, assume that the dataframe you want to summarize is called my.df, the variable you want to group the data by independent variables iv1, iv2, and the columns you want to aggregate are called col.a, col.b and col.c

# Template for using dplyr
my.df %>%                  # Specify original dataframe
  filter(iv3 > 30) %>%     # Filter condition
  group_by(iv1, iv2) %>%   # Grouping variable(s)
  summarise(
    a = mean(col.a),       # calculate mean of column col.a in my.df
    b = sd(col.b),         # calculate sd of column col.b in my.df
    c = max(col.c))        # calculate max on column col.c in my.df, ...

When you use dplyr, you write code that sounds like: “The original dataframe is XXX, now filter the dataframe to only include rows that satisfy the conditions YYY, now group the data at each level of the variable(s) ZZZ, now summarize the data and calculate summary functions XXX…”

Let’s start with an example: Let’s create a dataframe of aggregated data from the pirates dataset. I’ll filter the data to only include pirates who wear a headband. I’ll group the data according to the columns sex and college. I’ll then create several columns of different summary statistic of some data across each grouping. To create this aggregated data frame, I will use the new function group_by and the verb summarise. I will assign the result to a new dataframe called pirates.agg:

pirates.agg <- pirates %>%                   # Start with the pirates dataframe
               filter(headband == "yes") %>% # Only pirates that wear hb
               group_by(sex, college) %>%    # Group by these variables
               summarise( 
                        age.mean = mean(age),      # Define first summary...
                        tat.med = median(tattoos), # you get the idea...
                        n = n()                    # How many are in each group?
               ) # End

# Print the result
pirates.agg
## # A tibble: 6 x 5
## # Groups:   sex [?]
##      sex college age.mean tat.med     n
##    <chr>   <chr>    <dbl>   <dbl> <int>
## 1 female    CCCC       26      10   206
## 2 female   JSSFP       34      10   203
## 3   male    CCCC       23      10   358
## 4   male   JSSFP       32      10    85
## 5  other    CCCC       25      10    24
## 6  other   JSSFP       32      12    11

As you can see from the output on the right, our final object pirates.agg is the aggregated dataframe we want which aggregates all the columns we wanted for each combination of sex and college One key new function here is n(). This function is specific to dplyr and returns a frequency of values in a summary command.

Let’s do a more complex example where we combine multiple verbs into one chunk of code. We’ll aggregate data from the movies dataframe.

movies %>% # From the movies dataframe...
    filter(genre != "Horror" & time > 50) %>% # Select only these rows
    group_by(rating, sequel) %>% # Group by rating and sequel
    summarise( #
      frequency = n(), # How many movies in each group?
      budget.mean = mean(budget, na.rm = T),  # Mean budget?
      revenue.mean = mean(revenue.all), # Mean revenue?
      billion.p = mean(revenue.all > 1000)) # Percent of movies with revenue > 1000?
## # A tibble: 14 x 6
## # Groups:   rating [?]
##       rating sequel frequency budget.mean revenue.mean billion.p
##        <chr>  <int>     <int>       <dbl>        <dbl>     <dbl>
##  1         G      0        59       41.23          234    0.0000
##  2         G      1        12       92.92          357    0.0833
##  3     NC-17      0         2        3.75           18    0.0000
##  4 Not Rated      0        84        1.74           56    0.0000
##  5 Not Rated      1        12        0.67           66    0.0000
##  6        PG      0       312       51.78          191    0.0096
##  7        PG      1        62       77.21          372    0.0161
##  8     PG-13      0       645       52.09          168    0.0062
##  9     PG-13      1       120      124.16          524    0.1167
## 10         R      0       623       31.38          109    0.0000
## 11         R      1        42       58.25          226    0.0000
## 12      <NA>      0        86        1.65           34    0.0000
## 13      <NA>      1        15        5.51           48    0.0000
## 14      <NA>     NA        11        0.00           34    0.0000

As you can see, our result is a dataframe with 14 rows and 6 columns. The data are summarized from the movie dataframe, only include values where the genre is not Horror and the movie length is longer than 50 minutes, is grouped by rating and sequel, and shows several summary statistics.

10.4.1 Additional dplyr help

We’ve only scratched the surface of what you can do with dplyr. In fact, you can perform almost all of your R tasks, from loading, to managing, to saving data, in the dplyr framework. For more tips on using dplyr, check out the dplyr vignette at https://cran.r-project.org/web/packages/dplyr/vignettes/introduction.html. Or open it in RStudio by running the following command:

# Open the dplyr introduction in R
vignette("introduction", package = "dplyr")

There is also a very nice YouTube video covering dplyr at https://goo.gl/UY2AE1. Finally, consider also reading R for Data Science written by Garrett Grolemund and Hadley Wickham, which teaches R from the ground-up using the dplyr framework.