Chapter 20 More than two samples from normal distributions
Motivating scenarios: We are interested to investigate the difference between the means of more than two samples.
Learning goals: By the end of this chapter you should be able to
- Explain the multiple testing problem.
- Explain how ANOVA attempts to solve the multiple testing problem.
- Visually identify and mathematically interpret various “sums of squares”.
- Calculate an F statistic and relate this to the sampling distribution.
- Conduct appropriate “post-hoc” tests.
20.1 Setup
We often have more than two groups. For example - we might want to know, not just if a fertilizer is better than a control, but which of a number of fertilizers have different impacts on plant growth.
Or for the extended example in this chapter, if ambient temperature can influence body temperature in rodents, like the round-tailed ground squirrel, Spermophilus tereticaudus. In this experiment, Wooden and Walsberg (2004) put squirrels in environments that where 10°C (cold), 35°C (warm), and 45°C (hot), and measured their body temperature. The data are plotted in
#first lets load the data
<- read_csv("https://raw.githubusercontent.com/ybrandvain/biostat/master/data/sqrl.csv")
sqrl # now lets order data from coldest to warmest for a pretty plot
<- mutate(sqrl, ambient_temp = fct_relevel(ambient_temp, "cold", "warm", "hot"))
sqrl
ggplot(sqrl, aes(x = ambient_temp, y = body_temp, color = ambient_temp))+
geom_jitter(height = 0, width = .2, size = 2, alpha = .35, show.legend = FALSE)+
stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", width = .2, color = "black")+
scale_color_manual(values = c("blue", "orange", "red"))
include_graphics("https://upload.wikimedia.org/wikipedia/commons/thumb/1/19/Spermophilus_tereticaudus_01.jpg/256px-Spermophilus_tereticaudus_01.jpg")
So, we want to test the null hypothesis that ambient temperature and body temperature are unrelated in this experimental study of the round-tailed ground squirrel.
20.1.1 Why not all paiwise t-tests?
We could try conducting three two-sample t-tests, as below. But this is wrong! as now our probability of at least one false negative is \(1-(1-0.05)^3 = .143\), and our p-values are not properly callibrated.
# cold vs warm
%>%
sqrl ::filter(ambient_temp != "hot")%>%
dplyrt.test(body_temp ~ ambient_temp , var.equal = TRUE, data = .) %>%
tidy()
## # A tibble: 1 × 10
## estimate estimate1 estimate2 statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -6.25 31.0 37.2 -36.0 1.49e-31
## # … with 5 more variables: parameter <dbl>,
## # conf.low <dbl>, conf.high <dbl>, method <chr>,
## # alternative <chr>
# cold vs hot
%>%
sqrl ::filter(ambient_temp != "warm")%>%
dplyrt.test(body_temp ~ ambient_temp , var.equal = TRUE, data = .) %>%
tidy()
## # A tibble: 1 × 10
## estimate estimate1 estimate2 statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -10 31.0 41.0 -65.9 8.77e-41
## # … with 5 more variables: parameter <dbl>,
## # conf.low <dbl>, conf.high <dbl>, method <chr>,
## # alternative <chr>
# cold vs warm
%>%
sqrl ::filter(ambient_temp != "cold")%>%
dplyrt.test(body_temp ~ ambient_temp , var.equal = TRUE, data = .) %>%
tidy()
## # A tibble: 1 × 10
## estimate estimate1 estimate2 statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -3.75 37.2 41.0 -23.3 1.60e-24
## # … with 5 more variables: parameter <dbl>,
## # conf.low <dbl>, conf.high <dbl>, method <chr>,
## # alternative <chr>
20.1.2 More on multiple testing
This classic xkcd comic provides a great cartoon of the multiple testing problem, as well as the problem of science hype and communication. If we repeatedly test true null hypotheses we’re bound to reject some by chance.
This is a potential problem when looking at differences between groups, as the number of pairwise test rapidly expand with the number of groups. How rapidly, you ask? Let’s revisit the binomial equation, where \(k\) is 2 (because we make pairs of two) and we have \(n\) groups.
\[\begin{equation} \begin{split} \binom{n}{2} &= \frac{n!}{(n-k)! \times 2!} \\ &= \frac{n\times (n-1) \times (n-2)!}{(n-2)\times 2!} \\ &= \frac{n\times (n-1) }{2} \end{split} \tag{20.1} \end{equation}\]
So, with \(n\) groups we have \(n(n-1)/2\) pairwise comparisons, and a probability of rejecting at least one true null of \(1-(1-0.05)^{n(n-1)/2}\). So all pairwise tests with fifteen groups is 105 tests, and we’re basically assured at least one will be a false positive.
20.2 The ANOVA’s solution to the multiple testing problem
Instead of testing each combination of groups, the ANOVA poses a different null hypothesis. The ANOVA tests the null hypothesis that all samples come from the same statistical population.
ANOVA hypotheses
- \(H_0\): All samples come from the same (statistical) population. Practically, this says that all groups have the same mean.
- \(H_A\): Not all samples come from the same (statistical) population. Practically this says that not all groups have the same mean.
The ANOVA makes this null by some algebraic rearrangement of the sampling distribution. That is, the ANOVA tests the null hypothesis that all groups come from the same (statistical) population based on the expected variance of estimates from the sampling distribution!!
20.2.1 Mathemagic of the ANOVA
Recall that the standard error of the mean (\(\sigma_{\overline{Y}}\)) of a sample of size \(n\) from a normal sampling distribution is the standard deviation divided by the square root of the sample size: \(\sigma_{\overline{Y}} = \frac{\sigma}{\sqrt{n}}\).
With some algebra (squaring both sides and multiplying both side by n), we see that the sample size times the variance among groups \(n \times \sigma_{\overline{Y}}^2\) equals the variance within groups \(\sigma^2\) (Equation (20.2)).
\[\begin{equation} \begin{split} \sigma_{\overline{Y}} = \sigma / \sqrt{n} \\ \sigma_{\overline{Y}}^2 = \sigma^2 / n \\ n \times \sigma_{\overline{Y}}^2 = \sigma^2 \end{split} \tag{20.2} \end{equation}\]
So, if all groups come from the same population (i.e. under the null hypothesis), \(n \times \sigma_{\overline{Y}}^2\) will equal the variance within groups \(\sigma^2\).
If all groups do not come from the same population (i.e. under the alternative hypothesis), \(n \times \sigma_{\overline{Y}}^2\) will exceed the variance within groups \(\sigma^2\).
20.2.2 Parameters and estimates in an ANOVA
So we can turn these ideas into parameters to estimate.
Source | Parameter | Estimate | Notation |
---|---|---|---|
Among Groups | \(n \times \sigma_Y^2\) | Mean squares model | \(MS_{model}\) |
Within Groups | \(\sigma^2_Y\) | Mean squares error | \(MS_{error}\) |
Total | \(n \times \sigma_Y^2 + \sigma^2_Y\) | Mean squares total | \(MS_{total}\) |
So with this new math and algebra we can restate our null and alternative hypotheses for an ANOVA
\(H_0\): Mean squares model is not greater than mean squares error (i.e. among group variance equals zero).
\(H_A\): Mean squares model is greater than mean squares error (i.e. among group variance is not zero).
20.3 Concepts and calculations for an ANOVA
20.3.1 ANOVA partitions variance
Figure 20.2, really helps me understand what we are doing in ANOVA – we think about all variability in our data (a) as the sum of the variability among (b), and within groups. To do so, we first square the length of each line in Figure 20.2, and add them up. These are called the sums of squares.
We calculate them as follows
<- sqrl%>%
ss_calcs mutate(grand_mean = mean( body_temp)) %>%
group_by(ambient_temp) %>%
mutate(group_mean = mean( body_temp))%>%
ungroup()%>%
mutate(squared_total = (body_temp - grand_mean)^2,
squared_model = (group_mean - grand_mean)^2,
squared_error = (body_temp - group_mean)^2)
<- ss_calcs %>%
sums_of_squares summarise(n_groups = n_distinct(ambient_temp) ,
n = n(),
ss_total = sum(squared_total),
ss_model = sum(squared_model),
ss_error = sum(squared_error)) ; kable(sums_of_squares, digits = 2)
n_groups | n | ss_total | ss_model | ss_error |
---|---|---|---|---|
3 | 61 | 1037 | 1022 | 15.54 |
Or in math
- Sums of squares total: \(SS_{total} = \sum{(Y_i-\overline{\overline{Y})}}\), where \(Y_i\) is the value of the \(i^{th}\) individual’s response variable, and \(\overline{\overline{Y})}\). This is familiar as the numerator of the variance.
- Sums of squares model (aka sums of squares groups): \(SS_{model} = \sum{(\widehat{Y_i}-\overline{\overline{Y})}}\), where \(\widehat{Y_i}\) is the predicted value of the \(i^{th}\) individual’s response variable.
- Sums of squares error (aka sums of squares residual): \(SS_{model} = \sum{e_i^2}\), where \(e_i^2\) is the \(i^{th}\) individual’s residual deviation from its predicted value.
20.3.2 \(r^2 = \frac{SS_{model}}{SS_{total}}\) as the “proportion variance explained”
We now have a great summary of the effect size, as the proportion of variance explained by our model \(r^2 = \frac{SS_{model}}{SS_{total}}\). For the squirrel example \(r^2 = \frac{1021.7}{1037.2} =0.985\). Or in R
%>%
sums_of_squares mutate(r2 = ss_model / ss_total) %>% kable(digits = 4)
n_groups | n | ss_total | ss_model | ss_error | r2 |
---|---|---|---|---|---|
3 | 61 | 1037 | 1022 | 15.54 | 0.985 |
This is remarkably high – in this experiment ambient temperature explains 98.5% of the variation in squirrel body temperature.
⚠️WARNING⚠️ \(r^2\) is a nice summary of the effect size in a given study, but cannot be extrapolated to beyond that experiment. Importantly \(r^2\) depends on the relative size of each sample, and the factors they experience.
- It’s WRONG to say Ambient temperature explains 98.5% of variance in round-tailed ground squirrel body temperature.
- It’s RIGHT to say In a study of 61 round-tailed ground squirrels, (20 at 10°C (cold), 21 at 35°C (warm), and 20 at 45°C (hot)), Wooden and Walsberg (2004) found that ambient temperature explained 98.5% of variance in round-tailed ground squirrel body temperature.
20.3.3 F as ANOVA’s test statistic
So \(r^2\) is a (somewhat fragile) measure of effect size. But how do we know how often the null model would produce such an extreme result? We do so by comparing the test statistic, \(F\) – the mean squares model (\(MS_{model}\)) divided by the mean squares error (\(MS_{error}\)) – to its sampling distribution under the null.
\[F = \frac{MS_{model}}{MS_{error}}\]
We introduced \(MS_{model}\) as an estimate of the parameter \(n \times \sigma_Y^2\), and \(MS_{error}\) as an estimate of the parameter \(\sigma^2_Y\), in section 20.2.2 above. But how do we calculate them? Well, we take calculate \(SS_{model}\) and \(SS_{error}\) and divide by their degrees of freedom.
20.3.3.1 Calulating \(MS_{groups}\)
The degrees of freedom for the groups is the the number of groups minus one.
\[\begin{equation} \begin{split} df_{model} &= n_{groups} - 1\\ MS_{model} &= \frac{SS_{model}}{df_{model}} \end{split} \end{equation}\]
Recall that \(SS_{model}=\sum(\widehat{Y_i} - \overline{\overline{Y}})^2\).
For fun we can alternatively calculate it as \(SS_{model}=\sum(n_g \times (\widehat{Y_g} - \overline{\overline{Y}})^2)\). Where the subscript, \(_g\) refers to the \(g^{th}\) group, and the subscript, \(_i\) refers to the \(i^{th}\) individual.
20.3.3.2 Calulating \(MS_{error}\)
The degrees of freedom for the error is the total sample size minus the number of groups.
\[\begin{equation} \begin{split} df_{error} &= n -n_{groups}\\ MS_{error} &= \frac{SS_{error}}{df_{error}} \end{split} \end{equation}\]
Recall that \(SS_{error}=\sum(Y_i - \overline{\overline{Y}})^2\).
For fun we can alternatively calculate it as \(SS_{error}=\frac{\sum(s_g^2\times (n_g-1))}{n -n_{groups}}\)
20.3.4 Calculating F
Now finding \(F\) is just a matter of dividing \(MS_{model}\) by \(MS_error\).
<- sums_of_squares %>%
F_summary mutate(df_model = n_groups - 1,
df_error = n - n_groups,
ms_model = ss_model / df_model,
ms_error = ss_error / df_error,
F_val = ms_model / ms_error) %>%
::select(-n, -n_groups)
dplyr
%>% kable(digits = 4) F_summary
ss_total | ss_model | ss_error | df_model | df_error | ms_model | ms_error | F_val |
---|---|---|---|---|---|---|---|
1037 | 1022 | 15.54 | 2 | 58 | 510.8 | 0.268 | 1906 |
20.3.5 Testing the null hypothesis that \(F=1\)
We find a p-value by looking at the proportion of F’s sampling distribution that is as or more extreme than our calculated F value.
F take to different degrees of freedom, \(df_{model}\) and \(df_{error}\) in that order.
Recal that all ways to be extreme are on the right tail of the F distribution – so the alternative model is “F is greater than one”, and we need not multiply our p-value by two.
In R: pf(q = F_val, df1 = df_model, df2 = dr_error, lower.tail = FALSE)
pf(q = 1906, df1 = 2, df2 = 58, lower.tail = FALSE)
. The pf()
function works like the pnorm()
, pt()
and pchisq()
functions, as they all look for the probability that a random sample from the focal distribution will be as large as what we are asking about (if lower.tail = FALSE
)
%>%
F_summary mutate(p_value = pf(q = F_val, df1 = df_model, df2 = df_error, lower.tail = FALSE)) %>% mutate(p_value = paste( round(p_value * 10^53, digits = 3),"x10^-53", sep="" ))%>% kable(digits = 4)
ss_total | ss_model | ss_error | df_model | df_error | ms_model | ms_error | F_val | p_value |
---|---|---|---|---|---|---|---|---|
1037 | 1022 | 15.54 | 2 | 58 | 510.8 | 0.268 | 1906 | 1.24x10^-53 |
So we resoundingly reject the null model and conclude that ambient temperature is associated with elevated body temperature in the round-tailed ground squirrel. Because this was under experimental conditions, I think we can say that increasing ambient temperature directly or indirectly increases round-tailed ground squirrel body temperature.
20.4 An ANOVA in R and understanding its output
We use the aov
function to conduct an anova in R. This builds us an ANOVA table. We then examine signicance with the summary function.
<- aov( body_temp ~ ambient_temp, data = sqrl)
sqrl_anova sqrl_anova
## Call:
## aov(formula = body_temp ~ ambient_temp, data = sqrl)
##
## Terms:
## ambient_temp Residuals
## Sum of Squares 1021.7 15.5
## Deg. of Freedom 2 58
##
## Residual standard error: 0.5176
## Estimated effects may be unbalanced
summary(sqrl_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## ambient_temp 2 1022 511 1906 <2e-16 ***
## Residuals 58 16 0
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We now also can also interpret the R-squared value, as …
%>%
sums_of_squaresmutate(r2 = ss_model / ss_total)
## # A tibble: 1 × 6
## n_groups n ss_total ss_model ss_error r2
## <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 3 61 1037. 1022. 15.5 0.985
20.5 Assumptions of an ANOVA
Like the two sample t-test thw ANOVA assumes
- Independence
- Random sampling (no bias)
- Equal variance among groups (homoscedasticity) and
- Normality within each group – or more specifically normality of each group’s sampling distribution.
20.5.1 The ANOVA asumes equal variance among groups
This makes sense a we used a mean squares error in the two sanova – the very derivation if the ANOVA, assumed that all groups have equal variance.
- Homoscedasticity means that the variance of residuals is independent of the predicted value. In the case of our squirrels we only make three predictions – one per ambient temperature group – so this essentially means that we assume equal variance in each group.
%>%
sqrl group_by(ambient_temp) %>%
summarise(variance = var(body_temp)) %>% kable(digits = 4)
ambient_temp | variance |
---|---|
cold | 0.2762 |
warm | 0.3393 |
hot | 0.1846 |
Because all these variances are within a factor of five, we say that we’re not severely violating the equal variance assumption. Like the two-ample t-test, the ANOVA is robust to minor violations of equal variance within groups.
20.5.2 The ANOVA assumes normally disributed residuals
Note this means we assume the differences from group means are normally distributed, not that the raw data are.
As we saw for a one and two sample t-test in Chapters 18 and 19, respectively, the ANOVA (like all linear models) is quite robust to this assumption, as we are actually assuming a normal sampling distribution within groups, rather than a normal distribution wthin groups.
Let do this our self by making a qqplot and by plotting a histogram of the residuals.
<- sqrl %>%
qq_sqrl ggplot(aes(sample = body_temp)) +
geom_qq() +
geom_qq_line() +
facet_wrap(~ ambient_temp, ncol = 1, scales = "free_y")
<- sqrl %>%
hist_sqrl ggplot(aes(x = body_temp))+
geom_histogram(bins = 30, color = "white") +
facet_wrap(~ ambient_temp, ncol = 1, scales = "free_y")
+ hist_sqrl qq_sqrl
The data are obviously not normal (they look like maybe a uniform), as the small values are too big (above the qq line) and the big values too small (below the qq line) . But the sample size is pretty big, so I would feel comfortable assuming normality here.
20.6 Alternatives to an ANOVA
If you think data do not meet assumptions you can
- Consider this as a caveat when discussing your results, and think through how it could impact your inference.
- Find a more appropriate test.
- Write down your own likelihood function and conduct a likelihood ratio test.
- Permute (but beware)
20.6.1 Permuting as an alternative to the ANOVA based p-value
If we violate the assumption of normal residuals we can permute our data and calculate F, and compare the observed F to its permuted distribution.
The example below is silly because
- The central limit theorem should make our assumptions ok and
- It’s hard to imagine a slight deviation from normality resulting in a p-value of \(10^{-53}\)
But, I’m providing it to show you some fun tricks
<- 1000
n_reps
<- glance(sqrl_lm) %>%
obs_f pull(statistic)
<- sqrl %>%
perm_anova rep_sample_n(size = nrow(sqrl), replace = FALSE, reps = n_reps) %>% # copy the data
mutate(ambient_temp = sample(ambient_temp, replace = FALSE)) %>% # shuffle labels
nest(data = c(ambient_temp, body_temp)) %>% # prepping data for many linear models
mutate(fit = map(.x = data, ~ lm(.x$body_temp ~ .x$ambient_temp )), # running many linear models -- one for each permutatatin
results = map(fit, glance)) %>% # turn results into a tibble with glance()
unnest(results) %>% # make results easier to read
ungroup() %>%
::select(statistic) %>% # grab the F value from permuted data
dplyrmutate(as_or_more = statistic >= obs_f) %>%
summarise(p_val = mean(as_or_more))
20.7 Post-hoc tests and significance groups
So if we find that not all groups have the same mean, how do we know which means differ? The answer is in a “post-hoc test”. We do a post-hoc test if we reject the null.
I first focus on “unplanned comparisons”, in which we look at all pairs of groups, with no preconception of which may differ.
20.7.1 Unplanned comparisons
Post-hoc test gets over the multiple comparisons problem by conditioning on there being at least one difference and numerous pairwise tests, and altering its sampling distribution accordingly,
The Tukey-Kramer method, with test stat \(q = \frac{Y_i-Y_j}{SE}\), where \(SE = \sqrt{MS_{error}(\frac{1}{n_i}+ \frac{1}{n_j})}\) is one common approach to conduct unplanned post-hoc tests.
To do this in R, we
- Pipe the output of
aov
to the TukeyHSD() function
aov( body_temp ~ ambient_temp, data = sqrl) %>%
TukeyHSD() %>%
tidy() %>% mutate(adj.p.value = paste(round(adj.p.value * 10^11, digits = 3),"x10^-11",sep = ""))%>% kable(digits = 4)
term | contrast | null.value | estimate | conf.low | conf.high | adj.p.value |
---|---|---|---|---|---|---|
ambient_temp | warm-cold | 0 | 6.254 | 5.865 | 6.643 | 1.49x10^-11 |
ambient_temp | hot-cold | 0 | 10.000 | 9.606 | 10.394 | 1.49x10^-11 |
ambient_temp | hot-warm | 0 | 3.746 | 3.357 | 4.135 | 1.49x10^-11 |
We find that all pairwise comparisons differ. So we assign cold to significance group A, warm to significance group B and and Hot to group C.
We usually note significance groups on a plot
ggplot(sqrl, aes(x = ambient_temp, y = body_temp, color = ambient_temp))+
geom_jitter(height = 0, width = .2, size = 2, alpha = .35, show.legend = FALSE)+
stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", width = .2, color = "black")+
scale_color_manual(values = c("blue", "orange", "red"))+
geom_text(data = tibble(ambient_temp = c("cold","warm","hot"),
body_temp = c(31,37,41),
sig_group = c("a","b","c")),
aes(label = sig_group), nudge_x = .35, fontface = "bold", show.legend = FALSE)
Note significance groupings can be more challenging. For example,
If there was a treatment that for a “very cold” ambient temperature that did not significantly differ from “cold” it would also be in significance group a.
If there was a treatment that for a “balmy” ambient temperature (between warm and hot) that did not significantly differ significantly from warm or hot, it would be in significance groups b and c.