YaRrr! The Pirate’s Guide to R

10.3 `aggregate()`: Grouped aggregation

Argument	Description
`formula`	A formula in the form `y ~ x1 + x2 + ...` where y is the dependent variable, and x1, x2… are the independent variables. For example, `salary ~ sex + age` will aggregate a `salary` column at every unique combination of `sex` and `age`
`FUN`	A function that you want to apply to y at every level of the independent variables. E.g.; `mean`, or `max`.
`data`	The dataframe containing the variables in `formula`
`subset`	A subset of data to analyze. For example, `subset(sex == "f" & age > 20)` would restrict the analysis to females older than 20. You can ignore this argument to use all data.

The first aggregation function we’ll cover is aggregate(). Aggregate allows you to easily answer questions in the form: “What is the value of the function FUN applied to a dependent variable dv at each level of one (or more) independent variable(s) iv?

# General structure of aggregate()
aggregate(formula = dv ~ iv, # dv is the data, iv is the group 
          FUN = fun, # The function you want to apply
          data = df) # The dataframe object containing dv and iv

Let’s give aggregate() a whirl. No…not a whirl…we’ll give it a spin. Definitely a spin. We’ll use aggregate() on the ChickWeight dataset to answer the question “What is the mean weight for each diet?”

If we wanted to answer this question using basic R functions, we’d have to write a separate command for each supplement like this:

# The WRONG way to do grouped aggregation. 
#  We should be using aggregate() instead!
mean(ChickWeight$weight[ChickWeight$Diet == 1])
## [1] 103
mean(ChickWeight$weight[ChickWeight$Diet == 2])
## [1] 123
mean(ChickWeight$weight[ChickWeight$Diet == 3])
## [1] 143
mean(ChickWeight$weight[ChickWeight$Diet == 4])
## [1] 135

If you are ever writing code like this, there is almost always a simpler way to do it. Let’s replace this code with a much more elegant solution using aggregate().For this question, we’ll set the value of the dependent variable Y to weight, x1 to Diet, and FUN to mean

# Calculate the mean weight for each value of Diet
aggregate(formula = weight ~ Diet,  # DV is weight, IV is Diet
          FUN = mean,               # Calculate the mean of each group
          data = ChickWeight)       # dataframe is ChickWeight
##   Diet weight
## 1    1    103
## 2    2    123
## 3    3    143
## 4    4    135

As you can see, the aggregate() function has returned a dataframe with a column for the independent variable Diet, and a column for the results of the function mean applied to each level of the independent variable. The result of this function is the same thing we’d got from manually indexing each level of Diet individually – but of course, this code is much simpler and more elegant!

You can also include a subset argument within an aggregate() function to apply the function to subsets of the original data. For example, if I wanted to calculate the mean chicken weights for each diet, but only when the chicks are less than 10 weeks old, I would do the following:

# Calculate the mean weight for each value of Diet,
#  But only when chicks are less than 10 weeks old

aggregate(formula = weight ~ Diet,  # DV is weight, IV is Diet
          FUN = mean,               # Calculate the mean of each group
          subset = Time < 10,       # Only when Chicks are less than 10 weeks old
          data = ChickWeight)       # dataframe is ChickWeight
##   Diet weight
## 1    1     58
## 2    2     63
## 3    3     66
## 4    4     69

You can also include multiple independent variables in the formula argument to aggregate(). For example, let’s use aggregate() to now get the mean weight of the chicks for all combinations of both Diet and Time, but now only for weeks 0, 2, and 4:

# Calculate the mean weight for each value of Diet and Time,
#  But only when chicks are 0, 2 or 4 weeks okd

aggregate(formula = weight ~ Diet + Time,  # DV is weight, IVs are Diet and Time
          FUN = mean,                      # Calculate the mean of each group
          subset = Time %in% c(0, 2, 4),   # Only when Chicks are 0, 2, and 4 weeks old
          data = ChickWeight)              # dataframe is ChickWeight
##    Diet Time weight
## 1     1    0     41
## 2     2    0     41
## 3     3    0     41
## 4     4    0     41
## 5     1    2     47
## 6     2    2     49
## 7     3    2     50
## 8     4    2     52
## 9     1    4     56
## 10    2    4     60
## 11    3    4     62
## 12    4    4     64

10.3 aggregate(): Grouped aggregation

10.3 `aggregate()`: Grouped aggregation