3.6 Convert numeric to categorical by binning
3.6.1 Specific ranges
Sometimes you have a numeric variable that takes on values over a range (e.g., BMI, age, etc.) and you would like to create a categorical variable with levels corresponding to specific ranges. For example, let’s convert the variable Age (years) into AgeGroup
with the following levels:
Level | Label |
---|---|
1 | 40 to < 50y |
2 | 50 to < 60y |
3 | 60 to < 70y |
4 | 70 to < 80y |
5 | 80y+ |
## [1] 0
## [1] 42 90
Use cut()
to set the bin boundaries. The combination of include.lowest = T
and right = F
results in bins of the form [40, 50)
. A square bracket [
or ]
indicates inclusion while parentheses indicate exclusion. So [40, 50)
corresponds to the range “40 to less than 50”.
cut()
automatically creates a factor with labels that show the ranges.
##
## [40,50) [50,60) [60,70) [70,80) [80,90]
## 64 203 183 40 40
In this case, there were 0 missing (NA
) values both before and after. If you end up with more missing values than you started with, then you need to expand the range of the lowest and/or highest bins.
In tidyverse
:
# Check the number of missing values
mydat_tibble %>%
# Create a logical vector for missing
mutate(nmiss_age = is.na(Age)) %>%
# Count the missing values
summarise(nmiss = sum(nmiss_age))
# Find the min and max
mydat_tibble %>%
summarise(min = min(Age),
max = max(Age))
# Use cut() to set the bin boundaries
# include.lowest = T and right = F creates bins of the form [40, 50) (40 to < 50)
mydat_tibble <- mydat_tibble %>%
mutate(AgeGroup = cut(Age,
breaks = c(40, 50, 60, 70, 80, 90),
include.lowest = T,
right = F))
# cut() automatically creates a factor with labels that show the ranges
# A square bracket [ or ] indicates inclusion. So [40, 50) represents 40 to < 50
mydat_tibble %>%
count(AgeGroup)
# Verify that the bins contain values in the specified ranges
mydat_tibble %>%
group_by(AgeGroup) %>%
summarise(min = min(Age),
max = max(Age))
3.6.2 Equal length bins
To create a factor variable with equal length bins, use the tidyverse
function cut_interval()
to specify the desired length of each bin, after which R will automatically figure out the break points. Alternatively, specify the desired number of bins and R will automatically create that correct number of equal length bins. If the result does not have exactly the breaks you would like, use cut()
instead to specify custom ranges (see the previous section). See also ?cut_width
which has additional options.
# Specify length, R figures out how many bins are needed
# Use right = F to have intervals closed on the left
mydat_tibble <- mydat_tibble %>%
mutate(AgeGroup = cut_interval(Age,
length = 5,
right = F))
Since length = 5
, each bin has a width of 5 years.
## # A tibble: 10 × 2
## AgeGroup n
## <fct> <int>
## 1 [40,45) 16
## 2 [45,50) 48
## 3 [50,55) 97
## 4 [55,60) 106
## 5 [60,65) 108
## 6 [65,70) 75
## 7 [70,75) 9
## 8 [75,80) 31
## 9 [80,85) 22
## 10 [85,90] 18
You could instead specify the number of bins, for example 4:
# Specify # of bins, R figures out the length
mydat_tibble <- mydat_tibble %>%
mutate(AgeGroup = cut_interval(Age,
n = 4,
right = F))
R figured out that each bin should then have a width of 12 years.
## # A tibble: 4 × 2
## AgeGroup n
## <fct> <int>
## 1 [42,54) 132
## 2 [54,66) 260
## 3 [66,78) 90
## 4 [78,90] 48
## # A tibble: 4 × 3
## AgeGroup min max
## <fct> <int> <int>
## 1 [42,54) 42 53
## 2 [54,66) 54 65
## 3 [66,78) 66 77
## 4 [78,90] 78 90
3.6.3 Equal size bins
You can instead cut a variable in such a way that the bins have about the same number of observations. When there are multiple observations with the same value (ties), it may be impossible to get exactly equal numbers in the bins, but R will try to get as close as possible.
In base R, use quantile()
. If you want four bins, for example, specify the quartiles (the 25th, 50th, and 75th percentiles), along with the minimum and the maximum.
# Use quantile within cut
# Use na.rm = T inside quantile() if there are missing values
mydat$AgeGroup <- cut(mydat$Age,
breaks = quantile(mydat$Age,
probs = seq(0, 1, 0.25),
na.rm = T),
include.lowest = T,
right = F)
# Check results
table(mydat$AgeGroup, useNA = "ifany")
##
## [42,54) [54,59) [59,66) [66,90]
## 132 113 147 138
## $`[42,54)`
## [1] 42 53
##
## $`[54,59)`
## [1] 54 58
##
## $`[59,66)`
## [1] 59 65
##
## $`[66,90]`
## [1] 66 90
In tidyverse, use cut_number()
: