Chapter 19 Two samples from normal distributions

This text (roughly) follows Chapter 12 of our textbook. The reading below is required, Whitlock and Schluter (2020) is not.

Motivating scenarios: We are interested to investigate the difference between the means of two samples.

Learning goals: By the end of this chapter you should be able to

Know when and how to use a two-sample t-test.
- Recognize the difference between a paired and two-sample t-test.
Work through the math underlying a two-sample t-test.
State the assumptions of a two-sample t-test.
Know alternatives to a two sample t-test when data break assumptions

19.0.0.1 Example: Lizard survival

Could long spikes protect horned lizards from being killed by the loggerhead shrike? The loggerhead shrike is a small predatory bird that skewers its victims on thorns or barbed wire. Young, Brodie, and Brodie (2004) compared horns from lizards that had been killed by shrikes to 154 live horned lizards. The data are plotted below.

lizards <- read_csv("https://whitlockschluter3e.zoology.ubc.ca/Data/chapter12/chap12e3HornedLizards.csv")  %>% na.omit()

Living Lizard summary stats: From our sample of 154 living lizards, the mean horn length was 24.28, with a sample variance of 6.92.

Killed Lizard summary stats: From our sample of 30 killed lizards, the mean horn length was 22, with a sample variance of 7.34.

Summarizing the difference in Killed and Living lizards: So, surviving lizards have horns that are 2.29 longer than killed lizards.

19.1 Calculations for a two sample t-test

So, we can calculate group means and differences between these mans, as well as variances within each group. But how do we turn this into a estimates of uncertainty in the between group difference (i.e. the standard error and 95% confidence intervals), and how can we test the null hypothesis that groups represent samples from the same population? To start to answer these questions we need to describe the within group (aka pooled) variance in horn length.

19.1.1 The pooled variance

How do we think about this variance? I like to think of it as “how far on average do we expect the square of the difference between one random individual from each group to deviate from the average difference between groups”. A quantity that nearly captures this is the pooled variance, $s_p^2$ , which is the variance in each group weighted by the degrees of freedom in each group, $df_1$ and $df_2$ and divided by the total degrees of freedom, $df_t = df_1+df_2$ .

$\begin{equation} \begin{split} s_p^2 &= \frac{df_1\times s_1^2+df_2\times s_2^2}{df_1+df_2} \end{split} \tag{19.1} \end{equation}$

Calculating the pooled variance in the lizard example

Recall that we had 154 living lizards, with a sample variance in horn length of 6.92 and 30 killed lizards, with a sample variance in horn length of 7.34. So in this case, the pooled variance is
$\begin{equation} \begin{split} s_p^2 &= \frac{df_1\times s_1^2+df_2\times s_2^2}{df_1+df_2}\\ &= \frac{df_\text{living}\times s_\text{living}^2+df_\text{killed}\times s_\text{killed}^2}{df_\text{living}+df_\text{killed}}\\ &= \frac{(154-1)\times 6.92 + (30-1)\times 7.34}{154 + 30 -2} \\ &= 6.99 \end{split} \tag{19.2} \end{equation}$

We can do this math in R

# First we get the summaries  
lizard_summaries <- lizards %>%
  group_by(Survival) %>%
  summarise(n         = n(),                       # sample sizes for each group
            df        = n - 1,                     # degrees of fredom for each group
            mean_horn = mean(squamosalHornLength), # means for each group
            var_horn  = var(squamosalHornLength))  # variances for each group


lizard_summaries # lets check this out!

## # A tibble: 2 × 5
##   Survival     n    df mean_horn var_horn
##   <chr>    <int> <dbl>     <dbl>    <dbl>
## 1 killed      30    29      22.0     7.34
## 2 living     154   153      24.3     6.92

# Then we calculate the pooled variance
lizard_summaries %>% 
  summarise(pooled_variance = sum(df * var_horn) / sum(df) ) # calculate the pooled variance

## # A tibble: 1 × 1
##   pooled_variance
##             <dbl>
## 1            6.99

Effect size of the difference in Killed and Living lizards: Above we found that surviving lizards have horns that are 2.29 longer than killed lizards. But recall that our estimate of effect size, Cohen’s D, standardizes this difference by the standard deviation to give it more context. The math below shows that the average surviving lizard has a horn about 0.87 standard deviations larger than the average killed lizard. That’s pretty substantial.

$\text{Cohen's d } = \frac{\text{distance from null}}{s} = \frac{24.28 - 21.99}{\sqrt{ 6.99}} = \frac{2.29}{\sqrt{ 2.64}} = 0.87$

19.1.2 The standard error

So Cohen’s d showed a very large difference in horn lengths. But how does this difference compare to what we expect by sampling error?

We could, of course, get a sense of this sampling error by resampling from each group with replacement, and calculating the mean difference for each “bootstrapped” replicate to approximate a sampling distribution, and use this to calculate a standard error and confidence interval

lizard_boot <- replicate(n = 10000, simplify = FALSE,
  lizards %>%
  group_by( Survival)                      %>%
  slice_sample(prop = 1, replace = TRUE)   %>%
  dplyr::summarise(mean_horn = mean(squamosalHornLength)) %>%
  dplyr::summarise(mean_horn_diff = diff(mean_horn))) %>%
  bind_rows()

lizard_boot %>%
  summarise(se_mean_diff = sd(mean_horn_diff),
            lower_CI_horn_diff = quantile(mean_horn_diff, prob = 0.025),
            upper_CI_horn_diff = quantile(mean_horn_diff, prob = 0.975))

Alternatively, we can use math tricks to estimate the standard error. Specifically, the standard error for the difference in means, $SE_{\overline{x_1}-\overline{x_2}}$ equals $SE_{\overline{x_1}-\overline{x_2}} = \sqrt{s_p^2 \Big(\frac{1}{n_1} + \frac{1}{n_2}\Big)}$ , where $s_p^2$ is the pooled variance from Equation (19.1). In this case, $SE_{\overline{x_1}-\overline{x_2}} = \sqrt{6.99 \Big(\frac{1}{154} + \frac{1}{30}\Big)} 0.58$ .

As in Chapter 18, we can use equation (18.1) – $(1-\alpha)\%\text{ CI } = \overline{x} \pm t_{\alpha/2} \times SE_x$ – to find the upper and lower confidence intervals.

Again, we can do these calculations in R. We see below that these mathematical answers are really close to the bootstrapped based estimates of uncertainty.

alpha <- 0.05
lizard_summaries %>% 
  summarise(est_diff        = abs(diff(mean_horn)),
            pooled_variance = sum(df * var_horn) / sum(df),
            se              = sqrt(sum(pooled_variance/n)),
            crit_t_95       = qt(p = alpha/2, df = sum(df), lower.tail = FALSE),
            lower_95CI       = est_diff  - se * crit_t_95,
            upper_95CI       = est_diff  + se * crit_t_95)

## # A tibble: 1 × 6
##   est_diff pooled_variance    se crit_t_95 lower_95CI
##      <dbl>           <dbl> <dbl>     <dbl>      <dbl>
## 1     2.29            6.99 0.528      1.97       1.25
## # … with 1 more variable: upper_95CI <dbl>

19.1.3 Quantifying surprise

How surprised would we be to see such an extreme difference if we expected no difference at all? The p-value quantifies this surprise.

We can, of course estimate this by permutation

# To do so, we first find the absolute value of the difference in group means

abs_obs_diff <- lizard_summaries %>% 
  dplyr::summarise(diff(mean_horn)) %>% 
  pull() %>% 
  abs()

lizard_perm <- replicate(50000, simplify = FALSE, 
  lizards %>%
  mutate(perm_survival = sample(Survival, replace = FALSE))%>% # shuffle
  group_by(perm_survival) %>%
  dplyr::summarise(mean_horn = mean(squamosalHornLength)) %>%  # calculate permuted group means
  dplyr::summarise(mean_horn_diff = diff(mean_horn))) %>%      # calculate difference in permuted group means
  bind_rows()

lizard_perm %>%
    mutate(as_or_more_extreme = abs(mean_horn_diff) >= abs_obs_diff) %>%
    dplyr::summarise(p_value =  mean(as_or_more_extreme))

# To do so, we first find the absolute value of the difference in group means

abs_obs_diff <- lizard_summaries %>% 
  dplyr::summarise(diff(mean_horn)) %>% 
  pull() %>% 
  abs()

lizard_perm <- read_csv("data/lizard_perm.csv")
lizard_perm %>%
    mutate(as_or_more_extreme = abs(mean_horn_diff) >= abs_obs_diff) %>%
    dplyr::summarise(p_value =  mean(as_or_more_extreme))

We find a very small p-value and reject the null hypothesis.

Alternatively, we could

Calculate a t-value (as we did in Chapter 18).
Compare the observed t to t’s sampling distribution with 182 degrees of freedom
Find the area that exceeds our observed $t$ on either tail of the distribution.

So, we first find our t-value as $\frac{est - \mu_0}{se} = \frac{2.294-0}{0.5274} = 4.35$ . We then compare it to its null sampling distribution. Figure 19.1 makes it clear that the p-value is very small – we need to zoom in on the null sampling distribution substantially to see anything as or more extreme

The null sampling distribution for t with 182 degrees of freedom. From **a** it is clear that our t is exceptional. Zooming in (**b**) we see very little of the sampling distribution is as or more extreme than our test stat of 4.35 (blue).

FIGURE 19.1: The null sampling distribution for t with 182 degrees of freedom. From a it is clear that our t is exceptional. Zooming in (b) we see very little of the sampling distribution is as or more extreme than our test stat of 4.35 (blue).

The exact p-value is 2 x pt(4.35, df = 182, lower.tail = FALSE) = 2.3^{-5}. We can calcualte this in R

lizard_summaries %>% 
  summarise(est_diff        = abs(diff(mean_horn)),
            pooled_variance = sum(df * var_horn) / sum(df),
            se              = sqrt(sum(pooled_variance/n)),
            t               = est_diff / se,
            p               = 2 * pt(t, df = sum(df), lower.tail = FALSE))

## # A tibble: 1 × 5
##   est_diff pooled_variance    se     t         p
##      <dbl>           <dbl> <dbl> <dbl>     <dbl>
## 1     2.29            6.99 0.528  4.35 0.0000227

19.2 A two sample t-test in R

There are a few ways to conduct a two sample t-test in R –

By hand (as above).
With the t.test() function.
With the lm() function (as we will see in the future).

19.2.1 With the t.test() function

First, we could use the t.test() function.

# Either of these two approaches would work. Note  that data = ., tells R we need the data that is piped in
# lizards %>%
#    t.test(squamosalHornLength ~ Survival, data = ., var.equal = TRUE)

t.test(squamosalHornLength ~ Survival, data = lizards, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  squamosalHornLength by Survival
## t = -4.3, df = 182, p-value = 2e-05
## alternative hypothesis: true difference in means between group killed and group living is not equal to 0
## 95 percent confidence interval:
##  -3.335 -1.254
## sample estimates:
## mean in group killed mean in group living 
##                21.99                24.28

Piping this outpot to tidy makes it more compact

library(broom)    
t.test(squamosalHornLength ~ Survival, data = lizards, var.equal = TRUE) %>%
  tidy()

## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic   p.value
##      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
## 1    -2.29      22.0      24.3     -4.35 0.0000227
## # … with 5 more variables: parameter <dbl>,
## #   conf.low <dbl>, conf.high <dbl>, method <chr>,
## #   alternative <chr>

Reassuringly, R’s output matches our calculations (but it swaps group 1 and 2, so watch out for that)

19.3 Assumptions

Before we worry about assumptions, let’s take a minute to think about assumptions in statistics. What is an assumption? What happens when we break one?

An assumption is basically a statement we make for mathematical convenience. We do this to make stats tractable (otherwise we’d either be doing different math for each data set or bootstrapping and permuting only).

If data meet assumptions, we can be confident that our statistics will work as intended.
If data break assumptions, the stats could still work. Or they could not. There are no guarantees, and no refunds.

But often breaking assumptions does not break a test. In fact, many statistical tests are remarkably robust to violations of their assumptions (but some are not).

How can we know if breaking an assumption will break our stats or if it’s safe?

The easiest way is to ask someone who knows.
The best way is to simulate.

We will simulate soon to show you what I mean, but for now, trust me, I know.

19.3.1 Assumptions of the two-sample t-test

A two sample t-test assumes: Eaul variance in each group, Independence, Normality (of residual values, or more specifically, of their sampling distribution), and that Data are collected without bias.

We can ignore the linearity assumption for a two sample t-test.
While we can never ignore bias or non-independence, we usually need more details of study design to get at them, so let’s focus on homoscedasticity and normality.

19.3.2 The two sample t-test assumes equal variance among groups

This makes sense as we used a “pooled variance” in the two sample t-test – that is, in the math we assume that both groups have equal variance equal to the pooled variance.

Homoscedasticity means that the variance of residuals is independent of the predicted value. In this case we only make two predictions – one per group – so this essentially means that we assume equal variance in each group. Let’s evaluate this prediction by making a histogram of the residuals – remembering that the comparing the variance in each group (we already calculated this in lizard_summaries)

lizard_summaries

## # A tibble: 2 × 5
##   Survival     n    df mean_horn var_horn
##   <chr>    <int> <dbl>     <dbl>    <dbl>
## 1 killed      30    29      22.0     7.34
## 2 living     154   153      24.3     6.92

How similar must variances be to make a two-sample t-test ok? We’ll explore this soon, when we’ll see that the t-test is robust to violations of this assumption. In fact, as a rule of thumb, if variance in one group is less than five times the variance in the other group, we’re usually ok.

19.3.3 The two-sample t-test assumes normally disributed residuals

Note this means we assume the differences from group means are normally distributed, not that the raw data are.

As we saw for a one sample t-test in Chapter 18, the two sample t-test is quite robust to this assumption, as we are actually assuming a normal sampling distribution within each group, rather than a normal distribution of values regardless of group.

Let’s do this by making a qqplot

lizards %>%
    ggplot(aes(sample = squamosalHornLength))+
    geom_qq()+
    geom_qq_line()+
    facet_wrap(~Survival)+
    labs(title = "qq plots of lizard data by survival group")

The data are obviously not normal, as the small values are too small (below the qq line). But the sample size is pretty big, so I would feel comfortable assuming normality here.

19.4 Alternatives to a two sample t-test

So, what to do if we break assumptions.

We could ignore the violation, especially if it is minor and we believe the test is robust.
We could transform the data, as described in Chapter 17.
Or we could find a better test.

Let’s start at the bottom of the list – what are alternatives to the two-sample t-test?

19.4.1 Permutation / bootstrapping

We saw above that we can bootstrap (i.e. resample with replacement) to estimate uncertainty in the estimate of the mean difference, and we can permute (i.e. shuffle labels) to generate a null sampling distribution for hypothesis testing.

These approaches make nearly no assumptions and most always work, but generally require modest sample sizes of at least fifteen in each group or so, such that resampling from them with replacement generates enough combinations to reasonably approximate the sampling distribution from a population.

19.4.2 Welch’s t-test when variances differ by group

If the variances differ by group, we could do a Welch’s t-test, which does not assume equal variance. Here we use a different equation to calculate the t value and degrees of freedom 19.6.1, which you need never know.

Do know

That you can use Welch’s t-test when variance differs by group.
You can easily conduct a Welch’s t-test in R with the t.test() function by setting var.equal equal to FALSE. In fact, this is the default behavior of tis function in R.

lizards %>%
  t.test(squamosalHornLength ~ Survival, data = ., var.equal = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  squamosalHornLength by Survival
## t = -4.3, df = 40, p-value = 1e-04
## alternative hypothesis: true difference in means between group killed and group living is not equal to 0
## 95 percent confidence interval:
##  -3.382 -1.207
## sample estimates:
## mean in group killed mean in group living 
##                21.99                24.28

All interpretations of results from the Welch’s t-test are the same as a interpretations of stats from a two sample t-test

The t value is still the distance, in standard errors, between the observed difference in sample means and the null (0) away from one.
The p-value is still the proportion of draws from the null that will be as or more extreme than our test statistic.

The only difference is that now the sampling distribution uses the variance from each group, rather than combining the variance into its average, “pooled variance”.

19.4.3 Mann-Whitney U and Wilcoxon rank sum tests

If residuals are not normal, we would usually do a permutation test nowadays. Historically, however, people used “rank based” “non parametric” tests when data weren’t normal. Here we

Convert all data from their actual values to their “rank” (e.g. first, second, third, etc..) in the data set.
Turn the distribution of ranks into a test statistic.
Find a sampling distribution for ranks under a null model.
Calculate a p-value, and reject or fail to reject the null hypothesis.

Here’s an example in R

lizards %>%
  wilcox.test(squamosalHornLength ~ Survival, data = .)

## 
##  Wilcoxon rank sum test with continuity
##  correction
## 
## data:  squamosalHornLength by Survival
## W = 1182, p-value = 2e-05
## alternative hypothesis: true location shift is not equal to 0

These tests are falling out of fashion because they don’t provide any estimate of uncertainty, they make their own assumptions about the distribution of data, and because permutation and bootstrapping can be done on our home computers. So I keep this here just to expose you to it. Know it’s a thing you might hear about some time, and know that we’re not worrying about it here.

19.5 Quiz

19.6 Optional

19.6.1 Boring math

You don’t need to know this, but the formula for Welch’s t-test is

$t\quad =\quad {\;\overline {X}_{1}-\overline {X}_{2}\; \over {\sqrt {\;{s_{1}^{2} \over N_{1}}\;+\;{s_{2}^{2} \over N_{2}}\quad }}}\,$

With degrees of freedom equal to

$df \quad \approx \quad {{\left(\;{s_{1}^{2} \over N_{1}}\;+\;{s_{2}^{2} \over N_{2}}\;\right)^{2}} \over {\quad {s_{1}^{4} \over N_{1}^{2}\times df _{1}}\;+\;{s_{2}^{4} \over N_{2}^{2} \times df _{2}}\quad }}$

19.7 Other thigns to do when data do not meet assumptions

19.7.1 Fit your own likelihood model

If you can write down a likelihood function (like in Chapter 18.8.3) you can customize your assumptions to fit your data.

19.8 Simulations to evaluate test performance

We often want to know what to expect from a statistical test. For example, we might want to

Evaluate how robust a test is to violation of its assumptions, or
Evaluate the power a test has to reject a false null

We can do both of these things by simulation!!! Here I focus on robustness of the two sample t-test to violations of the assumption of equal variance.

First we set up the parameters – I am going to do 10000 reps with variance in group 1 always equal to one, and variance in group 2 equal to 1,2,5,

n_experiments_per_pair  <- 10000 
vars          <- c(1,1,1,2,1,5,1,10,1,25,1,50,1,100)
var_grp_2     <- vars[1:length(vars) %%2==0]   # a trick to grab the even valus

I then make data for samples of size ten and fifty. The code here is a bit awakard but basically for each experiment I

Assign n individuals in group 1 or 2,
Match it up so group 1 has a variance of one, and group 2 has the variance it should (1, 2,5,10, 25,50, 100),
I then simulate data for each individual from a normal distribution with mean equals zero, and standard deviation equal to the square root of the specified variance.

# sample size of 10
n <- 10
sim_n10 <- tibble(this_var        = rep(rep(vars , each = n  ), times = n_experiments_per_pair),
       this_group      = rep(rep(rep(1:2, each =n ), times = length(var_grp_2)),times = n_experiments_per_pair),
       this_experiment = rep(1: (n_experiments_per_pair * length(var_grp_2)) , each = 2 * n),
       var_g2          = rep(rep(var_grp_2  , each = 2* n  ), times = n_experiments_per_pair),
       n               = n) %>%
  mutate(sim_val       = rnorm(mean = 0, sd = sqrt(this_var), n = n()))

# sample size of 50
n <- 50
sim_n50 <- tibble(this_var        = rep(rep(vars , each = n  ), times = n_experiments_per_pair),
       this_group      = rep(rep(rep(1:2, each =n ), times = length(var_grp_2)),times = n_experiments_per_pair),
       this_experiment = rep(1: (n_experiments_per_pair * length(var_grp_2)) , each = 2 * n),
       var_g2          = rep(rep(var_grp_2  , each = 2* n  ), times = n_experiments_per_pair),
       n               = n) %>%
  mutate(sim_val       = rnorm(mean = 0, sd = sqrt(this_var), n = n()))

We then find the p-value for all of these data simulated under a null, but breaking test assumptions

sim_tests <- bind_rows(sim_n10, sim_n50) %>%
  group_by(this_experiment,n) %>%
  nest(data = c(this_group, this_var,sim_val)) %>%
  mutate(fit = map(.x = data, ~ t.test(.x$sim_val ~ .x$this_group, var.equal = TRUE)),
         results = map(fit, tidy)) %>%
unnest(results) %>%
  dplyr::select(this_experiment, var_g2,     n,  p.value)

Finally, we assume a traditional $\alpha = 0.05$ and look for the proportion of times we reject the false null.

sim_summary <- sim_tests          %>%
  mutate(reject = p.value <=0.05) %>%
  group_by(n, var_g2)             %>%
  summarise(actual_alpha = mean(reject)) %>%
  mutate(`Sample size` = factor(n))

## `summarise()` has grouped output by 'n'. You can
## override using the `.groups` argument.

ggplot(sim_summary, aes(x = var_g2, y = actual_alpha, color = `Sample size`))+
  geom_line()+
  geom_point()+
  scale_x_continuous(trans = "log2", breaks = c(1,2,5,10,25,50,100)) +
  geom_hline(yintercept = 0.05)+
  labs(x = "population variance in group 2\n(population variance in group 1 is always 1)")

The plot shows that violating the assumption of equal variance seems to matter most when sample size is small and violations are extreme.

References

Whitlock, Michael C, and Dolph Schluter. 2020. The Analysis of Biological Data. Third Edition.

Young, Kevin V., Edmund D. Brodie, and Edmund D. Brodie. 2004. “How the Horned Lizard Got Its Horns.” Science 304 (5667): 65–65. https://doi.org/10.1126/science.1094790.