5.4 By another variable
Researchers often report descriptive statistics not only over the entire sample but also by some other variable, typically an exposure or outcome. The code in the previous section will work for computing descriptive statistics over the entire sample. The following demonstrates a few options for computing a statistic by levels of another variable.
5.4.1 tapply
The tapply()
function applies a function to a variable at each level of another variable or list of variables. For example:
## Male Female
## 177.2 191.8
# Mean of cholesterol by gender and race
tapply(mydat$choles, list(mydat$gender, mydat$race), mean, na.rm = T)
## Mexican American Other Hispanic Non-Hispanic White Non-Hispanic Black Other
## Male 173.4 213.0 174.8 168.7 177.9
## Female 189.7 192.2 196.8 185.3 180.9
5.4.2 group_by() + summarize()
The tidyverse
version of tapply()
is the combination of group_by()
and summarize()
.
# Mean of cholesterol by gender
mydat %>%
group_by(gender) %>%
summarize(mean = mean(choles, na.rm = T))
## # A tibble: 2 × 2
## gender mean
## <fct> <dbl>
## 1 Male 177.
## 2 Female 192.
# Mean of cholesterol by gender and race
mydat %>%
group_by(gender, race) %>%
summarize(mean = mean(choles, na.rm = T))
## # A tibble: 10 × 3
## # Groups: gender [2]
## gender race mean
## <fct> <fct> <dbl>
## 1 Male Mexican American 173.
## 2 Male Other Hispanic 213
## 3 Male Non-Hispanic White 175.
## 4 Male Non-Hispanic Black 169.
## 5 Male Other 178.
## 6 Female Mexican American 190.
## 7 Female Other Hispanic 192.
## 8 Female Non-Hispanic White 197.
## 9 Female Non-Hispanic Black 185.
## 10 Female Other 181.