3.6 Convert numeric to categorical by binning

3.6.1 Specific ranges

Sometimes you have a numeric variable that takes on values over a range (e.g., BMI, age, etc.) and you would like to create a categorical variable with levels corresponding to specific ranges. For example, let’s convert the variable Age (years) into AgeGroup with the following levels:

Level Label
1 40 to < 50y
2 50 to < 60y
3 60 to < 70y
4 70 to < 80y
5 80y+
# Check the number of missing values
sum(is.na(mydat$Age))
## [1] 0
# Find the min and max
range(mydat$Age)
## [1] 42 90

Use cut() to set the bin boundaries. The combination of include.lowest = T and right = F results in bins of the form [40, 50). A square bracket [ or ] indicates inclusion while parentheses indicate exclusion. So [40, 50) corresponds to the range “40 to less than 50”.

mydat$AgeGroup <- cut(mydat$Age,
                      breaks = c(40, 50, 60, 70, 80, 90),
                      include.lowest = T,
                      right = F)

cut() automatically creates a factor with labels that show the ranges.

table(mydat$AgeGroup, useNA = "ifany")
## 
## [40,50) [50,60) [60,70) [70,80) [80,90] 
##      64     203     183      40      40

In this case, there were 0 missing (NA) values both before and after. If you end up with more missing values than you started with, then you need to expand the range of the lowest and/or highest bins.

In tidyverse:

# Check the number of missing values
mydat_tibble %>%
  # Create a logical vector for missing
  mutate(nmiss_age = is.na(Age)) %>%
  # Count the missing values
  summarise(nmiss  = sum(nmiss_age))

# Find the min and max
mydat_tibble %>%
  summarise(min = min(Age),
            max = max(Age))

# Use cut() to set the bin boundaries
# include.lowest = T and right = F creates bins of the form [40, 50) (40 to < 50)
mydat_tibble <- mydat_tibble %>% 
  mutate(AgeGroup = cut(Age,
                        breaks = c(40, 50, 60, 70, 80, 90),
                        include.lowest = T,
                        right = F))

# cut() automatically creates a factor with labels that show the ranges
# A square bracket [ or ] indicates inclusion. So [40, 50) represents 40 to < 50
mydat_tibble %>% 
  count(AgeGroup)

# Verify that the bins contain values in the specified ranges
mydat_tibble %>% 
  group_by(AgeGroup) %>% 
  summarise(min = min(Age),
            max = max(Age))

3.6.2 Equal length bins

To create a factor variable with equal length bins, use the tidyverse function cut_interval() to specify the desired length of each bin, after which R will automatically figure out the break points. Alternatively, specify the desired number of bins and R will automatically create that correct number of equal length bins. If the result does not have exactly the breaks you would like, use cut() instead to specify custom ranges (see the previous section). See also ?cut_width which has additional options.

# Specify length, R figures out how many bins are needed
# Use right = F to have intervals closed on the left
mydat_tibble <- mydat_tibble %>% 
  mutate(AgeGroup = cut_interval(Age,
                                 length = 5,
                                 right = F))

Since length = 5, each bin has a width of 5 years.

# Examine the result
mydat_tibble %>% 
  count(AgeGroup)
## # A tibble: 10 × 2
##    AgeGroup     n
##    <fct>    <int>
##  1 [40,45)     16
##  2 [45,50)     48
##  3 [50,55)     97
##  4 [55,60)    106
##  5 [60,65)    108
##  6 [65,70)     75
##  7 [70,75)      9
##  8 [75,80)     31
##  9 [80,85)     22
## 10 [85,90]     18

You could instead specify the number of bins, for example 4:

# Specify # of bins, R figures out the length
mydat_tibble <- mydat_tibble %>% 
  mutate(AgeGroup = cut_interval(Age,
                                 n = 4,
                                 right = F))

R figured out that each bin should then have a width of 12 years.

# Examine the result
mydat_tibble %>% 
  count(AgeGroup)
## # A tibble: 4 × 2
##   AgeGroup     n
##   <fct>    <int>
## 1 [42,54)    132
## 2 [54,66)    260
## 3 [66,78)     90
## 4 [78,90]     48
mydat_tibble %>% 
  group_by(AgeGroup) %>% 
  summarise(min = min(Age),
            max = max(Age))
## # A tibble: 4 × 3
##   AgeGroup   min   max
##   <fct>    <int> <int>
## 1 [42,54)     42    53
## 2 [54,66)     54    65
## 3 [66,78)     66    77
## 4 [78,90]     78    90

3.6.3 Equal size bins

You can instead cut a variable in such a way that the bins have about the same number of observations. When there are multiple observations with the same value (ties), it may be impossible to get exactly equal numbers in the bins, but R will try to get as close as possible.

In base R, use quantile(). If you want four bins, for example, specify the quartiles (the 25th, 50th, and 75th percentiles), along with the minimum and the maximum.

# Use quantile within cut
# Use na.rm = T inside quantile() if there are missing values
mydat$AgeGroup <- cut(mydat$Age,
                      breaks = quantile(mydat$Age,
                                        probs = seq(0, 1, 0.25),
                                        na.rm = T),
                      include.lowest = T,
                      right = F)

# Check results
table(mydat$AgeGroup, useNA = "ifany")
## 
## [42,54) [54,59) [59,66) [66,90] 
##     132     113     147     138
tapply(mydat$Age, mydat$AgeGroup, range)
## $`[42,54)`
## [1] 42 53
## 
## $`[54,59)`
## [1] 54 58
## 
## $`[59,66)`
## [1] 59 65
## 
## $`[66,90]`
## [1] 66 90

In tidyverse, use cut_number():

# Intervals with approximately equal numbers of observations
mydat_tibble <- mydat_tibble %>% 
  mutate(AgeGroup = cut_number(Age,
                               n = 4,
                               right = F))

# Check results
mydat_tibble %>% 
  count(AgeGroup)

mydat_tibble %>% 
  group_by(AgeGroup) %>% 
  summarise(min = min(Age),
            max = max(Age))