5.4 By another variable

Researchers often report descriptive statistics not only over the entire sample but also by some other variable, typically an exposure or outcome. The code in the previous section will work for computing descriptive statistics over the entire sample. The following demonstrates a few options for computing a statistic by levels of another variable.

5.4.1 tapply

The tapply() function applies a function to a variable at each level of another variable or list of variables. For example:

# Mean of cholesterol by gender
tapply(mydat$choles, mydat$gender, mean, na.rm = T)

##   Male Female 
##  177.2  191.8

# Mean of cholesterol by gender and race
tapply(mydat$choles, list(mydat$gender, mydat$race), mean, na.rm = T)

##        Mexican American Other Hispanic Non-Hispanic White Non-Hispanic Black Other
## Male              173.4          213.0              174.8              168.7 177.9
## Female            189.7          192.2              196.8              185.3 180.9

5.4.2 group_by() + summarize()

The tidyverse version of tapply() is the combination of group_by() and summarize().

# Mean of cholesterol by gender
mydat %>% 
  group_by(gender) %>%
  summarize(mean = mean(choles, na.rm = T))

## # A tibble: 2 × 2
##   gender  mean
##   <fct>  <dbl>
## 1 Male    177.
## 2 Female  192.

# Mean of cholesterol by gender and race
mydat %>% 
  group_by(gender, race) %>%
  summarize(mean = mean(choles, na.rm = T))

## # A tibble: 10 × 3
## # Groups:   gender [2]
##    gender race                mean
##    <fct>  <fct>              <dbl>
##  1 Male   Mexican American    173.
##  2 Male   Other Hispanic      213 
##  3 Male   Non-Hispanic White  175.
##  4 Male   Non-Hispanic Black  169.
##  5 Male   Other               178.
##  6 Female Mexican American    190.
##  7 Female Other Hispanic      192.
##  8 Female Non-Hispanic White  197.
##  9 Female Non-Hispanic Black  185.
## 10 Female Other               181.