6.3 group_by()
and ungroup()
Takes existing data and groups specific variables together for future operations. Many operations are performed on groups.
Example: Grouping by age and sex (male/female) might be useful in a dataset if we care about how females of a certain age scored compared to males of a certain age (or comparing ages within males or within females).
Let’s create a sample dataset to reflect this example (to avoid entry errors, copy and paste this into your script):
## Creating identification number to represent 50 individual people
ID <- c(1:50)
## Creating sex variable (25 males/25 females)
Sex <- rep(c("male", "female"), 25) # rep stands for replicate
## Creating age variable (20-39 year olds)
Age <- c(26, 25, 39, 37, 31, 34, 34, 30, 26, 33,
39, 28, 26, 29, 33, 22, 35, 23, 26, 36,
21, 20, 31, 21, 35, 39, 36, 22, 22, 25,
27, 30, 26, 34, 38, 39, 30, 29, 26, 25,
26, 36, 23, 21, 21, 39, 26, 26, 27, 21)
## Creating a dependent variable called Score
Score <- c(0.010, 0.418, 0.014, 0.090, 0.061, 0.328, 0.656, 0.002, 0.639, 0.173,
0.076, 0.152, 0.467, 0.186, 0.520, 0.493, 0.388, 0.501, 0.800, 0.482,
0.384, 0.046, 0.920, 0.865, 0.625, 0.035, 0.501, 0.851, 0.285, 0.752,
0.686, 0.339, 0.710, 0.665, 0.214, 0.560, 0.287, 0.665, 0.630, 0.567,
0.812, 0.637, 0.772, 0.905, 0.405, 0.363, 0.773, 0.410, 0.535, 0.449)
## Creating a unified dataset that puts together all variables
data <- tibble(ID, Sex, Age, Score)
6.3.1 summarize()
and group_by()
Let’s say that I want calculate/compare the average Score
(and other measures) for males and females separately:
data %>%
group_by(Sex) %>%
summarize(m = mean(Score), # calculates the mean
s = sd(Score), # calculates the standard deviation
n = n()) %>% # calculates the total number of observations
ungroup()
## # A tibble: 2 x 4
## Sex m s n
## <chr> <dbl> <dbl> <int>
## 1 female 0.437 0.268 25
## 2 male 0.487 0.268 25
In the above code, we have grouped by Sex
, meaning that calculations performed on our data will account for males and females separately. Following code execution, the console displays the mean Score
, the standard deviation (sd
), and the total number of participants (n()
) for females and for males (group_by(Sex)
). That is, the average Score
for females is 0.437 and the average Score
for males is 0.487.
Let’s group by Sex
and Age
next (the order in which the variables appear withint group_by()
doesn’t matter):
data %>%
group_by(Sex, Age) %>% # grouped by Sex and Age
summarize(m = mean(Score),
s = sd(Score),
n = n()) %>%
ungroup()
## # A tibble: 27 x 5
## Sex Age m s n
## <chr> <dbl> <dbl> <dbl> <int>
## 1 female 20 0.046 NaN 1
## 2 female 21 0.740 0.253 3
## 3 female 22 0.672 0.253 2
## 4 female 23 0.501 NaN 1
## 5 female 25 0.579 0.167 3
## 6 female 26 0.41 NaN 1
## 7 female 28 0.152 NaN 1
## 8 female 29 0.426 0.339 2
## 9 female 30 0.170 0.238 2
## 10 female 33 0.173 NaN 1
## # ... with 17 more rows
There are now considerably more rows (27 rows) in this output. When performing calculations, R now considers each combination of Age
and Sex
. For example, the average 25-year-old female had a score of 0.579.
We also see that there are some missing standard deviation values (NaN). This is because calculating standard deviation requires more than one participant/observation.
6.3.2 mutate()
and group_by()
We could also utilize mutate()
after group_by()
to add a new column based on the group.
data %>%
group_by(Sex) %>%
mutate(m = mean(Score)) %>% # calculates mean score by Sex
ungroup()
## # A tibble: 50 x 5
## ID Sex Age Score m
## <int> <chr> <dbl> <dbl> <dbl>
## 1 1 male 26 0.01 0.487
## 2 2 female 25 0.418 0.437
## 3 3 male 39 0.014 0.487
## 4 4 female 37 0.09 0.437
## 5 5 male 31 0.061 0.487
## 6 6 female 34 0.328 0.437
## 7 7 male 34 0.656 0.487
## 8 8 female 30 0.002 0.437
## 9 9 male 26 0.639 0.487
## 10 10 female 33 0.173 0.437
## # ... with 40 more rows
Instead of collapsing all rows to a summary value, mutate()
adds a new column (m
) containing the average male and female score. The averaged scores in column m
correspond to the value in the Sex
column (0.487 for males and 0.437 for females).
6.3.3 Ungrouping
Notice that ungroup()
is always used after the group()
command after performing calculations. If you forget to ungroup()
data, future data management will likely produce errors. Always ungroup()
when you’ve finished with your calculations.
Let’s see an example of when ungrouping matters:
## Example 1
data %>%
group_by(Sex) %>%
mutate(m = mean(Age)) %>% # calculates the average age of males and females
mutate(x = mean(Score)) %>% # counts number of participants
ungroup() # closing ungroup()
Compare this with code that includes ungroup()
nested between the two mutate()
functions:
## Example 2
data %>%
group_by(Sex) %>%
mutate(m = mean(Age)) %>% # calculates the average age of males and females
ungroup() %>% # nested ungroup()
mutate(x = mean(Score)) # counts number of participants
In the first example, m
, which calculates mean Age
, is either 29.2 if the participant is male or 28.96 if the participant is female. x
, which calculates mean Score
, is 0.487 for males and 0.437 for females. For both calculations, the data is grouped by Sex
.
In the second example, m
still calculates the average
Age
for males separate from females as in the first example. However, x
equals a Score
of 0.462 for every row/observation. This is because group_by(Sex)
is removed via ungroup()
after the first mutate()
function. Here, x
calculates the mean Score
for all participants together.
Neither method is right or wrong – it depends on what you’re trying to achieve. When deciding where to place the ungroup()
function, ask yourself: Does it make sense to calculate different values for this Variable
? If so, the group_by(Variable)
function should be written before the calculation function (mutate/summarize).
If you use group_by()
, you must have a matching ungroup()
somewhere. Even if you do not plan on performing additional calculations, it’s a good habit to keep. Making sure that you ungroup()
is especially important when creating objects!!
## Creating/Saving the object named "data1"
data1 <-
data %>%
group_by(Sex) %>%
mutate(m = mean(Age))
## Using the data1 object after it's been saved above (WITHOUT an ungroup)
data1 %>%
mutate(x = mean(Score))
Anytime you use data1
, a saved object, it will automatically have group_by(Sex)
as part of its definition and further calculations will account for these grouping variables.
Forgetting to ungroup()
can get even more complex!
## Creating/Saving the object named "data1"
data1 <-
data %>%
group_by(Sex) %>%
mutate(m = mean(Age))
## Using the data1 object after it's been saved above (WITHOUT an ungroup)
data1 %>%
group_by(Age) %>%
mutate(x = mean(Score)) %>%
ungroup()
Now, the second code chunk is actually grouping by Sex
and Age
. That is, the x
variable calculates the mean Score
for each combination of Sex
and Age
. Even if this is what you wanted, it’s best to specify this on each line. As your scripts increase in length, it can become difficult to recall the specifics about object definitions, especially if it involves a special grouping variable. Keep your objects as simple as possible!
Here is the proper method in which to save and use the data1
object:
data1 <-
data %>%
group_by(Sex) %>%
mutate(m = mean(Age)) %>%
ungroup() # Ungroup at the end of a definition!!!
data1 %>%
group_by(Sex, Age) %>% # group the relevant variables here
mutate(x = mean(Score)) %>%
ungroup()
In conclusion, ALWAYS UNGROUP AFTER GROUPING