dplyr package is a relatively new R package that allows you to do all kinds of analyses quickly and easily. It is especially useful for creating tables of summary statistics across specific groups of data. In this section, we’ll go over a very brief overview of how you can use dplyr to easily do grouped aggregation. Just to be clear - you can use dplyr to do everything the
aggregate() function does and much more! However, this will be a very brief overview and I strongly recommend you look at the help menu for dplyr for additional descriptions and examples.
To use the dplyr package, you first need to install it with
install.packages() and load it:
install.packages("dplyr") # Install dplyr (only necessary once) library("dplyr") # Load dplyr
Programming with dplyr looks a lot different than programming in standard R. dplyr works by combining objects (dataframes and columns in dataframes), functions (mean, median, etc.), and verbs (special commands in
dplyr). In between these commands is a new operator called the pipe which looks like this:
%>%}. The pipe simply tells R that you want to continue executing some functions or verbs on the object you are working on. You can think about this pipe as meaning ‘and then…’
To aggregate data with
dplyr, your code will look something like the following code. In this example, assume that the dataframe you want to summarize is called
my.df, the variable you want to group the data by independent variables
iv1, iv2, and the columns you want to aggregate are called
# Template for using dplyr my.df %>% # Specify original dataframe filter(iv3 > 30) %>% # Filter condition group_by(iv1, iv2) %>% # Grouping variable(s) summarise( a = mean(col.a), # calculate mean of column col.a in my.df b = sd(col.b), # calculate sd of column col.b in my.df c = max(col.c)) # calculate max on column col.c in my.df, ...
When you use dplyr, you write code that sounds like: “The original dataframe is XXX, now filter the dataframe to only include rows that satisfy the conditions YYY, now group the data at each level of the variable(s) ZZZ, now summarize the data and calculate summary functions XXX…”
Let’s start with an example: Let’s create a dataframe of aggregated data from the
pirates dataset. I’ll filter the data to only include pirates who wear a headband. I’ll group the data according to the columns
college. I’ll then create several columns of different summary statistic of some data across each grouping. To create this aggregated data frame, I will use the new function
group_by and the verb
summarise. I will assign the result to a new dataframe called
pirates.agg <- pirates %>% # Start with the pirates dataframe filter(headband == "yes") %>% # Only pirates that wear hb group_by(sex, college) %>% # Group by these variables summarise( age.mean = mean(age), # Define first summary... tat.med = median(tattoos), # you get the idea... n = n() # How many are in each group? ) # End # Print the result pirates.agg ## # A tibble: 6 x 5 ## # Groups: sex [?] ## sex college age.mean tat.med n ## <chr> <chr> <dbl> <dbl> <int> ## 1 female CCCC 26 10 206 ## 2 female JSSFP 34 10 203 ## 3 male CCCC 23 10 358 ## 4 male JSSFP 32 10 85 ## 5 other CCCC 25 10 24 ## 6 other JSSFP 32 12 11
As you can see from the output on the right, our final object
pirates.agg is the aggregated dataframe we want which aggregates all the columns we wanted for each combination of
college One key new function here is
n(). This function is specific to dplyr and returns a frequency of values in a summary command.
Let’s do a more complex example where we combine multiple verbs into one chunk of code. We’ll aggregate data from the movies dataframe.
movies %>% # From the movies dataframe... filter(genre != "Horror" & time > 50) %>% # Select only these rows group_by(rating, sequel) %>% # Group by rating and sequel summarise( # frequency = n(), # How many movies in each group? budget.mean = mean(budget, na.rm = T), # Mean budget? revenue.mean = mean(revenue.all), # Mean revenue? billion.p = mean(revenue.all > 1000)) # Percent of movies with revenue > 1000? ## # A tibble: 14 x 6 ## # Groups: rating [?] ## rating sequel frequency budget.mean revenue.mean billion.p ## <chr> <int> <int> <dbl> <dbl> <dbl> ## 1 G 0 59 41.23 234 0.0000 ## 2 G 1 12 92.92 357 0.0833 ## 3 NC-17 0 2 3.75 18 0.0000 ## 4 Not Rated 0 84 1.74 56 0.0000 ## 5 Not Rated 1 12 0.67 66 0.0000 ## 6 PG 0 312 51.78 191 0.0096 ## 7 PG 1 62 77.21 372 0.0161 ## 8 PG-13 0 645 52.09 168 0.0062 ## 9 PG-13 1 120 124.16 524 0.1167 ## 10 R 0 623 31.38 109 0.0000 ## 11 R 1 42 58.25 226 0.0000 ## 12 <NA> 0 86 1.65 34 0.0000 ## 13 <NA> 1 15 5.51 48 0.0000 ## 14 <NA> NA 11 0.00 34 0.0000
As you can see, our result is a dataframe with 14 rows and 6 columns. The data are summarized from the movie dataframe, only include values where the genre is not Horror and the movie length is longer than 50 minutes, is grouped by rating and sequel, and shows several summary statistics.
10.4.1 Additional dplyr help
We’ve only scratched the surface of what you can do with
dplyr. In fact, you can perform almost all of your R tasks, from loading, to managing, to saving data, in the
dplyr framework. For more tips on using dplyr, check out the dplyr vignette at https://cran.r-project.org/web/packages/dplyr/vignettes/introduction.html. Or open it in RStudio by running the following command:
# Open the dplyr introduction in R vignette("introduction", package = "dplyr")
There is also a very nice YouTube video covering
dplyr at https://goo.gl/UY2AE1. Finally, consider also reading R for Data Science written by Garrett Grolemund and Hadley Wickham, which teaches R from the ground-up using the dplyr framework.