Chapter 20 More than two samples from normal distributions

This text (roughly) follows Chapter 15 of Whitlock and Schluter (2020). The reading below is required, Whitlock and Schluter (2020) is not.

Motivating scenarios: We are interested to investigate the difference between the means of more than two samples.

Learning goals: By the end of this chapter you should be able to

  • Explain the multiple testing problem.
  • Explain how ANOVA attempts to solve the multiple testing problem.
  • Visually identify and mathematically interpret various “sums of squares”.
  • Calculate an F statistic and relate this to the sampling distribution.
  • Conduct appropriate “post-hoc” tests.

20.1 Setup

We often have more than two groups. For example - we might want to know, not just if a fertilizer is better than a control, but which of a number of fertilizers have different impacts on plant growth.

Or for the extended example in this chapter, if ambient temperature can influence body temperature in rodents, like the round-tailed ground squirrel, Spermophilus tereticaudus. In this experiment, Wooden and Walsberg (2004) put squirrels in environments that where 10°C (cold), 35°C (warm), and 45°C (hot), and measured their body temperature. The data are plotted in

#first lets load the data
sqrl <- read_csv("https://raw.githubusercontent.com/ybrandvain/biostat/master/data/sqrl.csv")
# now lets order data from coldest to warmest for a pretty plot
sqrl <- mutate(sqrl, ambient_temp = fct_relevel(ambient_temp, "cold", "warm", "hot"))

ggplot(sqrl, aes(x = ambient_temp, y = body_temp, color  = ambient_temp))+
  geom_jitter(height = 0, width = .2, size = 2, alpha = .35, show.legend = FALSE)+
  stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", width = .2, color = "black")+
  scale_color_manual(values = c("blue", "orange", "red"))
  
  include_graphics("https://upload.wikimedia.org/wikipedia/commons/thumb/1/19/Spermophilus_tereticaudus_01.jpg/256px-Spermophilus_tereticaudus_01.jpg")

So, we want to test the null hypothesis that ambient temperature and body temperature are unrelated in this experimental study of the round-tailed ground squirrel.

20.1.1 Why not all paiwise t-tests?

We could try conducting three two-sample t-tests, as below. But this is wrong! as now our probability of at least one false negative is \(1-(1-0.05)^3 = .143\), and our p-values are not properly callibrated.

# cold vs warm
sqrl %>%
  dplyr::filter(ambient_temp != "hot")%>%
  t.test(body_temp ~ ambient_temp , var.equal  = TRUE, data = .) %>%
  tidy()
## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic  p.value
##      <dbl>     <dbl>     <dbl>     <dbl>    <dbl>
## 1    -6.25      31.0      37.2     -36.0 1.49e-31
## # … with 5 more variables: parameter <dbl>,
## #   conf.low <dbl>, conf.high <dbl>, method <chr>,
## #   alternative <chr>
# cold vs hot
sqrl %>%
  dplyr::filter(ambient_temp != "warm")%>%
  t.test(body_temp ~ ambient_temp , var.equal  = TRUE, data = .) %>%
  tidy()
## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic  p.value
##      <dbl>     <dbl>     <dbl>     <dbl>    <dbl>
## 1      -10      31.0      41.0     -65.9 8.77e-41
## # … with 5 more variables: parameter <dbl>,
## #   conf.low <dbl>, conf.high <dbl>, method <chr>,
## #   alternative <chr>
# cold vs warm
sqrl %>%
  dplyr::filter(ambient_temp != "cold")%>%
  t.test(body_temp ~ ambient_temp , var.equal  = TRUE, data = .) %>%
  tidy()
## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic  p.value
##      <dbl>     <dbl>     <dbl>     <dbl>    <dbl>
## 1    -3.75      37.2      41.0     -23.3 1.60e-24
## # … with 5 more variables: parameter <dbl>,
## #   conf.low <dbl>, conf.high <dbl>, method <chr>,
## #   alternative <chr>

20.1.2 More on multiple testing

The multiple testing problem according to [xkcd](https://xkcd.com/882/). rollover test: '*So, uh, we did the green study again and got no link. It was probably a-- RESEARCH CONFLICTED ON GREEN JELLY BEAN/ACNE LINK; MORE STUDY RECOMMENDED!*'

FIGURE 20.1: The multiple testing problem according to xkcd. rollover test: ‘So, uh, we did the green study again and got no link. It was probably a– RESEARCH CONFLICTED ON GREEN JELLY BEAN/ACNE LINK; MORE STUDY RECOMMENDED!

This classic xkcd comic provides a great cartoon of the multiple testing problem, as well as the problem of science hype and communication. If we repeatedly test true null hypotheses we’re bound to reject some by chance.

This is a potential problem when looking at differences between groups, as the number of pairwise test rapidly expand with the number of groups. How rapidly, you ask? Let’s revisit the binomial equation, where \(k\) is 2 (because we make pairs of two) and we have \(n\) groups.

\[\begin{equation} \begin{split} \binom{n}{2} &= \frac{n!}{(n-k)! \times 2!} \\ &= \frac{n\times (n-1) \times (n-2)!}{(n-2)\times 2!} \\ &= \frac{n\times (n-1) }{2} \end{split} \tag{20.1} \end{equation}\]

So, with \(n\) groups we have \(n(n-1)/2\) pairwise comparisons, and a probability of rejecting at least one true null of \(1-(1-0.05)^{n(n-1)/2}\). So all pairwise tests with fifteen groups is 105 tests, and we’re basically assured at least one will be a false positive.

20.2 The ANOVA’s solution to the multiple testing problem

Instead of testing each combination of groups, the ANOVA poses a different null hypothesis. The ANOVA tests the null hypothesis that all samples come from the same statistical population.

ANOVA hypotheses

  • \(H_0\): All samples come from the same (statistical) population. Practically, this says that all groups have the same mean.
  • \(H_A\): Not all samples come from the same (statistical) population. Practically this says that not all groups have the same mean.

The ANOVA makes this null by some algebraic rearrangement of the sampling distribution. That is, the ANOVA tests the null hypothesis that all groups come from the same (statistical) population based on the expected variance of estimates from the sampling distribution!!

20.2.1 Mathemagic of the ANOVA

Recall that the standard error of the mean (\(\sigma_{\overline{Y}}\)) of a sample of size \(n\) from a normal sampling distribution is the standard deviation divided by the square root of the sample size: \(\sigma_{\overline{Y}} = \frac{\sigma}{\sqrt{n}}\).

With some algebra (squaring both sides and multiplying both side by n), we see that the sample size times the variance among groups \(n \times \sigma_{\overline{Y}}^2\) equals the variance within groups \(\sigma^2\) (Equation (20.2)).

\[\begin{equation} \begin{split} \sigma_{\overline{Y}} = \sigma / \sqrt{n} \\ \sigma_{\overline{Y}}^2 = \sigma^2 / n \\ n \times \sigma_{\overline{Y}}^2 = \sigma^2 \end{split} \tag{20.2} \end{equation}\]

  • So, if all groups come from the same population (i.e. under the null hypothesis), \(n \times \sigma_{\overline{Y}}^2\) will equal the variance within groups \(\sigma^2\).

  • If all groups do not come from the same population (i.e. under the alternative hypothesis), \(n \times \sigma_{\overline{Y}}^2\) will exceed the variance within groups \(\sigma^2\).

20.2.2 Parameters and estimates in an ANOVA

So we can turn these ideas into parameters to estimate.

Source Parameter Estimate Notation
Among Groups \(n \times \sigma_Y^2\) Mean squares model \(MS_{model}\)
Within Groups \(\sigma^2_Y\) Mean squares error \(MS_{error}\)
Total \(n \times \sigma_Y^2 + \sigma^2_Y\) Mean squares total \(MS_{total}\)

So with this new math and algebra we can restate our null and alternative hypotheses for an ANOVA

  • \(H_0\): Mean squares model is not greater than mean squares error (i.e. among group variance equals zero).

  • \(H_A\): Mean squares model is greater than mean squares error (i.e. among group variance is not zero).

NOTE: I call \(n \times \sigma_Y^2\) mean squares model. A more common name for this is mean squares group. I call this mean squares model to emphasize that this is the same calcualtion we will do for all linear models as we see in the next chapter.

20.3 Concepts and calculations for an ANOVA

20.3.1 ANOVA partitions variance

Partitioning deviations in an ANOVA. **A** Shows the difference between each observation, $Y_i$, and the grand mean, $\overline{\overline{Y}}$. This is the basis for calculating $MS_{total}$.  **B** Shows the difference between each predicted value $\widehat{Y_i}$ and the grand mean, $\overline{\overline{Y}}$. This is the basis for calculating $MS_{model}$. **C** Shows the difference between each observation, $Y_i$, and its predicted value  $\widehat{Y_i}$. This is the basis for calculating $MS_{error}$.

FIGURE 20.2: Partitioning deviations in an ANOVA. A Shows the difference between each observation, \(Y_i\), and the grand mean, \(\overline{\overline{Y}}\). This is the basis for calculating \(MS_{total}\). B Shows the difference between each predicted value \(\widehat{Y_i}\) and the grand mean, \(\overline{\overline{Y}}\). This is the basis for calculating \(MS_{model}\). C Shows the difference between each observation, \(Y_i\), and its predicted value \(\widehat{Y_i}\). This is the basis for calculating \(MS_{error}\).

Figure 20.2, really helps me understand what we are doing in ANOVA – we think about all variability in our data (a) as the sum of the variability among (b), and within groups. To do so, we first square the length of each line in Figure 20.2, and add them up. These are called the sums of squares.

We calculate them as follows

ss_calcs <- sqrl%>%
    mutate(grand_mean = mean( body_temp)) %>%
    group_by(ambient_temp) %>%
    mutate(group_mean = mean( body_temp))%>%
    ungroup()%>%
    mutate(squared_total = (body_temp  - grand_mean)^2,
           squared_model = (group_mean - grand_mean)^2,
           squared_error = (body_temp  - group_mean)^2)
sums_of_squares <- ss_calcs %>%
  summarise(n_groups = n_distinct(ambient_temp) ,
            n        = n(),
            ss_total = sum(squared_total),
            ss_model = sum(squared_model),
            ss_error = sum(squared_error))                                                                         ;                                                                                       kable(sums_of_squares, digits = 2)
n_groups n ss_total ss_model ss_error
3 61 1037 1022 15.54

Or in math

  • Sums of squares total: \(SS_{total} = \sum{(Y_i-\overline{\overline{Y})}}\), where \(Y_i\) is the value of the \(i^{th}\) individual’s response variable, and \(\overline{\overline{Y})}\). This is familiar as the numerator of the variance.
  • Sums of squares model (aka sums of squares groups): \(SS_{model} = \sum{(\widehat{Y_i}-\overline{\overline{Y})}}\), where \(\widehat{Y_i}\) is the predicted value of the \(i^{th}\) individual’s response variable.
  • Sums of squares error (aka sums of squares residual): \(SS_{model} = \sum{e_i^2}\), where \(e_i^2\) is the \(i^{th}\) individual’s residual deviation from its predicted value.

20.3.2 \(r^2 = \frac{SS_{model}}{SS_{total}}\) as the “proportion variance explained”

We now have a great summary of the effect size, as the proportion of variance explained by our model \(r^2 = \frac{SS_{model}}{SS_{total}}\). For the squirrel example \(r^2 = \frac{1021.7}{1037.2} =0.985\). Or in R

sums_of_squares %>%
  mutate(r2 = ss_model / ss_total)                                                                                                   %>% kable(digits = 4)
n_groups n ss_total ss_model ss_error r2
3 61 1037 1022 15.54 0.985

This is remarkably high – in this experiment ambient temperature explains 98.5% of the variation in squirrel body temperature.

⚠️WARNING⚠️ \(r^2\) is a nice summary of the effect size in a given study, but cannot be extrapolated to beyond that experiment. Importantly \(r^2\) depends on the relative size of each sample, and the factors they experience.

  • It’s WRONG to say Ambient temperature explains 98.5% of variance in round-tailed ground squirrel body temperature.
  • It’s RIGHT to say In a study of 61 round-tailed ground squirrels, (20 at 10°C (cold), 21 at 35°C (warm), and 20 at 45°C (hot)), Wooden and Walsberg (2004) found that ambient temperature explained 98.5% of variance in round-tailed ground squirrel body temperature.

20.3.3 F as ANOVA’s test statistic

So \(r^2\) is a (somewhat fragile) measure of effect size. But how do we know how often the null model would produce such an extreme result? We do so by comparing the test statistic, \(F\) – the mean squares model (\(MS_{model}\)) divided by the mean squares error (\(MS_{error}\)) – to its sampling distribution under the null.

\[F = \frac{MS_{model}}{MS_{error}}\]

We introduced \(MS_{model}\) as an estimate of the parameter \(n \times \sigma_Y^2\), and \(MS_{error}\) as an estimate of the parameter \(\sigma^2_Y\), in section 20.2.2 above. But how do we calculate them? Well, we take calculate \(SS_{model}\) and \(SS_{error}\) and divide by their degrees of freedom.

20.3.3.1 Calulating \(MS_{groups}\)

The degrees of freedom for the groups is the the number of groups minus one.

\[\begin{equation} \begin{split} df_{model} &= n_{groups} - 1\\ MS_{model} &= \frac{SS_{model}}{df_{model}} \end{split} \end{equation}\]

Recall that \(SS_{model}=\sum(\widehat{Y_i} - \overline{\overline{Y}})^2\).

For fun we can alternatively calculate it as \(SS_{model}=\sum(n_g \times (\widehat{Y_g} - \overline{\overline{Y}})^2)\). Where the subscript, \(_g\) refers to the \(g^{th}\) group, and the subscript, \(_i\) refers to the \(i^{th}\) individual.

20.3.3.2 Calulating \(MS_{error}\)

The degrees of freedom for the error is the total sample size minus the number of groups.

\[\begin{equation} \begin{split} df_{error} &= n -n_{groups}\\ MS_{error} &= \frac{SS_{error}}{df_{error}} \end{split} \end{equation}\]

Recall that \(SS_{error}=\sum(Y_i - \overline{\overline{Y}})^2\).

For fun we can alternatively calculate it as \(SS_{error}=\frac{\sum(s_g^2\times (n_g-1))}{n -n_{groups}}\)

NOTE: If we only have two groups and ANOVA and two sample t-test give the same answers. One insight into this is to see that MS_error equals the pooled variance if we have two groups.

20.3.4 Calculating F

Now finding \(F\) is just a matter of dividing \(MS_{model}\) by \(MS_error\).

F_summary <- sums_of_squares  %>%
  mutate(df_model = n_groups - 1,
         df_error = n - n_groups,
         ms_model = ss_model / df_model,
         ms_error = ss_error / df_error,
         F_val    = ms_model / ms_error) %>%
  dplyr::select(-n, -n_groups)

F_summary                                                                                                                          %>% kable(digits = 4)
ss_total ss_model ss_error df_model df_error ms_model ms_error F_val
1037 1022 15.54 2 58 510.8 0.268 1906

20.3.5 Testing the null hypothesis that \(F=1\)

We find a p-value by looking at the proportion of F’s sampling distribution that is as or more extreme than our calculated F value.

F take to different degrees of freedom, \(df_{model}\) and \(df_{error}\) in that order.

Recal that all ways to be extreme are on the right tail of the F distribution – so the alternative model is “F is greater than one”, and we need not multiply our p-value by two.

In R: pf(q = F_val, df1 = df_model, df2 = dr_error, lower.tail = FALSE)
pf(q = 1906, df1 = 2, df2 = 58, lower.tail = FALSE). The pf() function works like the pnorm(), pt() and pchisq() functions, as they all look for the probability that a random sample from the focal distribution will be as large as what we are asking about (if lower.tail = FALSE)

F_summary  %>%
  mutate(p_value = pf(q = F_val, df1 = df_model, df2 = df_error, lower.tail = FALSE))                                                                                                            %>% mutate(p_value = paste( round(p_value * 10^53, digits = 3),"x10^-53", sep="" ))%>% kable(digits = 4)
ss_total ss_model ss_error df_model df_error ms_model ms_error F_val p_value
1037 1022 15.54 2 58 510.8 0.268 1906 1.24x10^-53

So we resoundingly reject the null model and conclude that ambient temperature is associated with elevated body temperature in the round-tailed ground squirrel. Because this was under experimental conditions, I think we can say that increasing ambient temperature directly or indirectly increases round-tailed ground squirrel body temperature.

20.3.6 Exploring F’s sampling distribution and critical values.

Play around with the values for the degrees of freedom and see how the F distribution changes.

20.4 An ANOVA in R and understanding its output

We use the aov function to conduct an anova in R. This builds us an ANOVA table. We then examine signicance with the summary function.

sqrl_anova <- aov( body_temp ~ ambient_temp, data = sqrl)
sqrl_anova 
## Call:
##    aov(formula = body_temp ~ ambient_temp, data = sqrl)
## 
## Terms:
##                 ambient_temp Residuals
## Sum of Squares        1021.7      15.5
## Deg. of Freedom            2        58
## 
## Residual standard error: 0.5176
## Estimated effects may be unbalanced
summary(sqrl_anova)
##              Df Sum Sq Mean Sq F value Pr(>F)    
## ambient_temp  2   1022     511    1906 <2e-16 ***
## Residuals    58     16       0                   
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We now also can also interpret the R-squared value, as …

sums_of_squares%>% 
  mutate(r2 = ss_model / ss_total)
## # A tibble: 1 × 6
##   n_groups     n ss_total ss_model ss_error    r2
##      <int> <int>    <dbl>    <dbl>    <dbl> <dbl>
## 1        3    61    1037.    1022.     15.5 0.985

20.5 Assumptions of an ANOVA

Like the two sample t-test thw ANOVA assumes

  • Independence
  • Random sampling (no bias)
  • Equal variance among groups (homoscedasticity) and
  • Normality within each group – or more specifically normality of each group’s sampling distribution.

20.5.1 The ANOVA asumes equal variance among groups

This makes sense a we used a mean squares error in the two sanova – the very derivation if the ANOVA, assumed that all groups have equal variance.

  • Homoscedasticity means that the variance of residuals is independent of the predicted value. In the case of our squirrels we only make three predictions – one per ambient temperature group – so this essentially means that we assume equal variance in each group.
sqrl %>%
  group_by(ambient_temp) %>%
  summarise(variance = var(body_temp))                                                                                                                                                          %>% kable(digits = 4)
ambient_temp variance
cold 0.2762
warm 0.3393
hot 0.1846

Because all these variances are within a factor of five, we say that we’re not severely violating the equal variance assumption. Like the two-ample t-test, the ANOVA is robust to minor violations of equal variance within groups.

20.5.2 The ANOVA assumes normally disributed residuals

Note this means we assume the differences from group means are normally distributed, not that the raw data are.

As we saw for a one and two sample t-test in Chapters 18 and 19, respectively, the ANOVA (like all linear models) is quite robust to this assumption, as we are actually assuming a normal sampling distribution within groups, rather than a normal distribution wthin groups.

Let do this our self by making a qqplot and by plotting a histogram of the residuals.

qq_sqrl <- sqrl %>%
  ggplot(aes(sample = body_temp)) +
  geom_qq()                       +
  geom_qq_line()                  +
  facet_wrap(~ ambient_temp, ncol = 1, scales = "free_y")


hist_sqrl <- sqrl %>%
  ggplot(aes(x = body_temp))+
  geom_histogram(bins = 30, color = "white")  +
  facet_wrap(~ ambient_temp, ncol = 1, scales = "free_y")

qq_sqrl + hist_sqrl

The data are obviously not normal (they look like maybe a uniform), as the small values are too big (above the qq line) and the big values too small (below the qq line) . But the sample size is pretty big, so I would feel comfortable assuming normality here.

20.6 Alternatives to an ANOVA

If you think data do not meet assumptions you can

  • Consider this as a caveat when discussing your results, and think through how it could impact your inference.
  • Find a more appropriate test.
  • Write down your own likelihood function and conduct a likelihood ratio test.
  • Permute (but beware)

20.6.1 Permuting as an alternative to the ANOVA based p-value

If we violate the assumption of normal residuals we can permute our data and calculate F, and compare the observed F to its permuted distribution.

The example below is silly because

  1. The central limit theorem should make our assumptions ok and
  2. It’s hard to imagine a slight deviation from normality resulting in a p-value of \(10^{-53}\)

But, I’m providing it to show you some fun tricks

n_reps <- 1000

obs_f  <- glance(sqrl_lm) %>%
  pull(statistic)

perm_anova <- sqrl %>%
  rep_sample_n(size = nrow(sqrl), replace = FALSE, reps = n_reps)     %>% # copy the data
  mutate(ambient_temp = sample(ambient_temp, replace = FALSE))        %>% # shuffle labels
  nest(data = c(ambient_temp, body_temp))                             %>% # prepping data for many linear models
  mutate(fit = map(.x = data, ~ lm(.x$body_temp ~ .x$ambient_temp )),     # running many linear models -- one for each permutatatin
         results = map(fit, glance))                                  %>% # turn results into a tibble with glance()
  unnest(results)                                                     %>% #  make results easier to read
  ungroup()                                                           %>%
  dplyr::select(statistic)                                            %>% # grab the F value from permuted data           
  mutate(as_or_more = statistic >= obs_f)                             %>%
  summarise(p_val = mean(as_or_more))
NOTE: Permuting DOES NOT solve the issue of unequal variance in groups. Why is that? it is because prmuting will huffle data between groups, resulting in an equal variance, so it doesnt solve this problem.

20.7 Post-hoc tests and significance groups

So if we find that not all groups have the same mean, how do we know which means differ? The answer is in a “post-hoc test”. We do a post-hoc test if we reject the null.

I first focus on “unplanned comparisons”, in which we look at all pairs of groups, with no preconception of which may differ.

20.7.1 Unplanned comparisons

Post-hoc test gets over the multiple comparisons problem by conditioning on there being at least one difference and numerous pairwise tests, and altering its sampling distribution accordingly,

The Tukey-Kramer method, with test stat \(q = \frac{Y_i-Y_j}{SE}\), where \(SE = \sqrt{MS_{error}(\frac{1}{n_i}+ \frac{1}{n_j})}\) is one common approach to conduct unplanned post-hoc tests.

To do this in R, we

  • Pipe the output of aov to the TukeyHSD() function
aov( body_temp ~ ambient_temp, data = sqrl) %>%
  TukeyHSD()                                %>%
  tidy()                                                                                                                                           %>% mutate(adj.p.value = paste(round(adj.p.value * 10^11, digits = 3),"x10^-11",sep = ""))%>% kable(digits = 4)
term contrast null.value estimate conf.low conf.high adj.p.value
ambient_temp warm-cold 0 6.254 5.865 6.643 1.49x10^-11
ambient_temp hot-cold 0 10.000 9.606 10.394 1.49x10^-11
ambient_temp hot-warm 0 3.746 3.357 4.135 1.49x10^-11

We find that all pairwise comparisons differ. So we assign cold to significance group A, warm to significance group B and and Hot to group C.

We usually note significance groups on a plot

ggplot(sqrl, aes(x = ambient_temp, y = body_temp, color  = ambient_temp))+
  geom_jitter(height = 0, width = .2, size = 2, alpha = .35, show.legend = FALSE)+
  stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", width = .2, color = "black")+
  scale_color_manual(values = c("blue", "orange", "red"))+
  geom_text(data = tibble(ambient_temp = c("cold","warm","hot"), 
                          body_temp     = c(31,37,41),
                          sig_group    = c("a","b","c")),
            aes(label = sig_group), nudge_x = .35, fontface = "bold", show.legend = FALSE)

Note significance groupings can be more challenging. For example,

  • If there was a treatment that for a “very cold” ambient temperature that did not significantly differ from “cold” it would also be in significance group a.

  • If there was a treatment that for a “balmy” ambient temperature (between warm and hot) that did not significantly differ significantly from warm or hot, it would be in significance groups b and c. 

20.8 Quiz

References

Whitlock, Michael C, and Dolph Schluter. 2020. The Analysis of Biological Data. Third Edition.
Wooden, K. Mark, and Glenn E. Walsberg. 2004. “Body Temperature and Locomotor Capacity in a Heterothermic Rodent.” Journal of Experimental Biology 207 (1): 41–46. https://doi.org/10.1242/jeb.00717.