6.3 group_by() and ungroup()

Takes existing data and groups specific variables together for future operations. Many operations are performed on groups.

Example: Grouping by age and sex (male/female) might be useful in a dataset if we care about how females of a certain age scored compared to males of a certain age (or comparing ages within males or within females).

Let’s create a sample dataset to reflect this example (to avoid entry errors, copy and paste this into your script):

## Creating identification number to represent 50 individual people
ID <- c(1:50)

## Creating sex variable (25 males/25 females)
Sex <- rep(c("male", "female"), 25) # rep stands for replicate

## Creating age variable (20-39 year olds)
Age <- c(26, 25, 39, 37, 31, 34, 34, 30, 26, 33, 
         39, 28, 26, 29, 33, 22, 35, 23, 26, 36, 
         21, 20, 31, 21, 35, 39, 36, 22, 22, 25, 
         27, 30, 26, 34, 38, 39, 30, 29, 26, 25, 
         26, 36, 23, 21, 21, 39, 26, 26, 27, 21) 

## Creating a dependent variable called Score
Score <- c(0.010, 0.418, 0.014, 0.090, 0.061, 0.328, 0.656, 0.002, 0.639, 0.173, 
           0.076, 0.152, 0.467, 0.186, 0.520, 0.493, 0.388, 0.501, 0.800, 0.482, 
           0.384, 0.046, 0.920, 0.865, 0.625, 0.035, 0.501, 0.851, 0.285, 0.752, 
           0.686, 0.339, 0.710, 0.665, 0.214, 0.560, 0.287, 0.665, 0.630, 0.567, 
           0.812, 0.637, 0.772, 0.905, 0.405, 0.363, 0.773, 0.410, 0.535, 0.449)

## Creating a unified dataset that puts together all variables
data <- tibble(ID, Sex, Age, Score)

6.3.1 summarize() and group_by()

Let’s say that I want calculate/compare the average Score (and other measures) for males and females separately:

data %>% 
  group_by(Sex) %>% 
  summarize(m = mean(Score), # calculates the mean
            s = sd(Score),   # calculates the standard deviation
            n = n()) %>%     # calculates the total number of observations
  ungroup()
## # A tibble: 2 x 4
##   Sex        m     s     n
##   <chr>  <dbl> <dbl> <int>
## 1 female 0.437 0.268    25
## 2 male   0.487 0.268    25

In the above code, we have grouped by Sex, meaning that calculations performed on our data will account for males and females separately. Following code execution, the console displays the mean Score, the standard deviation (sd), and the total number of participants (n()) for females and for males (group_by(Sex)). That is, the average Score for females is 0.437 and the average Score for males is 0.487.

Let’s group by Sex and Age next (the order in which the variables appear withint group_by() doesn’t matter):

data %>% 
  group_by(Sex, Age) %>%     # grouped by Sex and Age
  summarize(m = mean(Score),
            s = sd(Score),   
            n = n()) %>% 
  ungroup()
## # A tibble: 27 x 5
##    Sex      Age     m       s     n
##    <chr>  <dbl> <dbl>   <dbl> <int>
##  1 female    20 0.046 NaN         1
##  2 female    21 0.740   0.253     3
##  3 female    22 0.672   0.253     2
##  4 female    23 0.501 NaN         1
##  5 female    25 0.579   0.167     3
##  6 female    26 0.41  NaN         1
##  7 female    28 0.152 NaN         1
##  8 female    29 0.426   0.339     2
##  9 female    30 0.170   0.238     2
## 10 female    33 0.173 NaN         1
## # ... with 17 more rows

There are now considerably more rows (27 rows) in this output. When performing calculations, R now considers each combination of Age and Sex. For example, the average 25-year-old female had a score of 0.579.

We also see that there are some missing standard deviation values (NaN). This is because calculating standard deviation requires more than one participant/observation.

6.3.2 mutate() and group_by()

We could also utilize mutate() after group_by() to add a new column based on the group.

data %>% 
  group_by(Sex) %>% 
  mutate(m = mean(Score)) %>% # calculates mean score by Sex
  ungroup()
## # A tibble: 50 x 5
##       ID Sex      Age Score     m
##    <int> <chr>  <dbl> <dbl> <dbl>
##  1     1 male      26 0.01  0.487
##  2     2 female    25 0.418 0.437
##  3     3 male      39 0.014 0.487
##  4     4 female    37 0.09  0.437
##  5     5 male      31 0.061 0.487
##  6     6 female    34 0.328 0.437
##  7     7 male      34 0.656 0.487
##  8     8 female    30 0.002 0.437
##  9     9 male      26 0.639 0.487
## 10    10 female    33 0.173 0.437
## # ... with 40 more rows

Instead of collapsing all rows to a summary value, mutate() adds a new column (m) containing the average male and female score. The averaged scores in column m correspond to the value in the Sex column (0.487 for males and 0.437 for females).

6.3.3 Ungrouping

Notice that ungroup() is always used after the group() command after performing calculations. If you forget to ungroup() data, future data management will likely produce errors. Always ungroup() when you’ve finished with your calculations.

Let’s see an example of when ungrouping matters:

## Example 1

data %>% 
  group_by(Sex) %>% 
  mutate(m = mean(Age)) %>%   # calculates the average age of males and females
  mutate(x = mean(Score)) %>% # counts number of participants
  ungroup()                   # closing ungroup() 

Compare this with code that includes ungroup() nested between the two mutate() functions:

## Example 2

data %>% 
  group_by(Sex) %>% 
  mutate(m = mean(Age)) %>%  # calculates the average age of males and females
  ungroup() %>%              # nested ungroup()
  mutate(x = mean(Score))    # counts number of participants

In the first example, m, which calculates mean Age, is either 29.2 if the participant is male or 28.96 if the participant is female. x, which calculates mean Score, is 0.487 for males and 0.437 for females. For both calculations, the data is grouped by Sex.

In the second example, m still calculates the average Age for males separate from females as in the first example. However, x equals a Score of 0.462 for every row/observation. This is because group_by(Sex) is removed via ungroup() after the first mutate() function. Here, x calculates the mean Score for all participants together.

Neither method is right or wrong – it depends on what you’re trying to achieve. When deciding where to place the ungroup() function, ask yourself: Does it make sense to calculate different values for this Variable? If so, the group_by(Variable) function should be written before the calculation function (mutate/summarize).

If you use group_by(), you must have a matching ungroup() somewhere. Even if you do not plan on performing additional calculations, it’s a good habit to keep. Making sure that you ungroup() is especially important when creating objects!!

## Creating/Saving the object named "data1"
data1 <- 
  data %>% 
  group_by(Sex) %>% 
  mutate(m = mean(Age))

## Using the data1 object after it's been saved above (WITHOUT an ungroup)
data1 %>% 
  mutate(x = mean(Score))

Anytime you use data1, a saved object, it will automatically have group_by(Sex) as part of its definition and further calculations will account for these grouping variables.

Forgetting to ungroup() can get even more complex!

## Creating/Saving the object named "data1"
data1 <- 
  data %>% 
  group_by(Sex) %>% 
  mutate(m = mean(Age))

## Using the data1 object after it's been saved above (WITHOUT an ungroup)
data1 %>% 
  group_by(Age) %>% 
  mutate(x = mean(Score)) %>% 
  ungroup()

Now, the second code chunk is actually grouping by Sex and Age. That is, the x variable calculates the mean Score for each combination of Sex and Age. Even if this is what you wanted, it’s best to specify this on each line. As your scripts increase in length, it can become difficult to recall the specifics about object definitions, especially if it involves a special grouping variable. Keep your objects as simple as possible!

Here is the proper method in which to save and use the data1 object:

data1 <- 
  data %>% 
  group_by(Sex) %>% 
  mutate(m = mean(Age)) %>% 
  ungroup() # Ungroup at the end of a definition!!!

data1 %>% 
  group_by(Sex, Age) %>%  # group the relevant variables here
  mutate(x = mean(Score)) %>% 
  ungroup()

In conclusion, ALWAYS UNGROUP AFTER GROUPING