Chapter 19 Two samples from normal distributions
Motivating scenarios: We are interested to investigate the difference between the means of two samples.
Learning goals: By the end of this chapter you should be able to
- Know when and how to use a two-sample t-test.
- Recognize the difference between a paired and two-sample t-test.
- Recognize the difference between a paired and two-sample t-test.
- Work through the math underlying a two-sample t-test.
- State the assumptions of a two-sample t-test.
- Know alternatives to a two sample t-test when data break assumptions
19.0.0.1 Example: Lizard survival
Could long spikes protect horned lizards from being killed by the loggerhead shrike? The loggerhead shrike is a small predatory bird that skewers its victims on thorns or barbed wire. Young, Brodie, and Brodie (2004) compared horns from lizards that had been killed by shrikes to 154 live horned lizards. The data are plotted below.
<- read_csv("https://whitlockschluter3e.zoology.ubc.ca/Data/chapter12/chap12e3HornedLizards.csv") %>% na.omit() lizards
Living Lizard summary stats: From our sample of 154 living lizards, the mean horn length was 24.28, with a sample variance of 6.92.
Killed Lizard summary stats: From our sample of 30 killed lizards, the mean horn length was 22, with a sample variance of 7.34.
Summarizing the difference in Killed and Living lizards: So, surviving lizards have horns that are 2.29 longer than killed lizards.
19.1 Calculations for a two sample t-test
So, we can calculate group means and differences between these mans, as well as variances within each group. But how do we turn this into a estimates of uncertainty in the between group difference (i.e. the standard error and 95% confidence intervals), and how can we test the null hypothesis that groups represent samples from the same population? To start to answer these questions we need to describe the within group (aka pooled) variance in horn length.
19.1.1 The pooled variance
How do we think about this variance? I like to think of it as “how far on average do we expect the square of the difference between one random individual from each group to deviate from the average difference between groups”. A quantity that nearly captures this is the pooled variance, \(s_p^2\), which is the variance in each group weighted by the degrees of freedom in each group, \(df_1\) and \(df_2\) and divided by the total degrees of freedom, \(df_t = df_1+df_2\).
\[\begin{equation} \begin{split} s_p^2 &= \frac{df_1\times s_1^2+df_2\times s_2^2}{df_1+df_2} \end{split} \tag{19.1} \end{equation}\]
Calculating the pooled variance in the lizard example
Recall that we had 154 living lizards, with a sample variance in horn length of 6.92 and 30 killed lizards, with a sample variance in horn length of 7.34. So in this case, the pooled variance is
\[\begin{equation}
\begin{split}
s_p^2 &= \frac{df_1\times s_1^2+df_2\times s_2^2}{df_1+df_2}\\
&= \frac{df_\text{living}\times s_\text{living}^2+df_\text{killed}\times s_\text{killed}^2}{df_\text{living}+df_\text{killed}}\\
&= \frac{(154-1)\times 6.92 + (30-1)\times 7.34}{154 + 30 -2} \\
&= 6.99
\end{split}
\tag{19.2}
\end{equation}\]
We can do this math in R
# First we get the summaries
<- lizards %>%
lizard_summaries group_by(Survival) %>%
summarise(n = n(), # sample sizes for each group
df = n - 1, # degrees of fredom for each group
mean_horn = mean(squamosalHornLength), # means for each group
var_horn = var(squamosalHornLength)) # variances for each group
# lets check this out! lizard_summaries
## # A tibble: 2 × 5
## Survival n df mean_horn var_horn
## <chr> <int> <dbl> <dbl> <dbl>
## 1 killed 30 29 22.0 7.34
## 2 living 154 153 24.3 6.92
# Then we calculate the pooled variance
%>%
lizard_summaries summarise(pooled_variance = sum(df * var_horn) / sum(df) ) # calculate the pooled variance
## # A tibble: 1 × 1
## pooled_variance
## <dbl>
## 1 6.99
Effect size of the difference in Killed and Living lizards: Above we found that surviving lizards have horns that are 2.29 longer than killed lizards. But recall that our estimate of effect size, Cohen’s D, standardizes this difference by the standard deviation to give it more context. The math below shows that the average surviving lizard has a horn about 0.87 standard deviations larger than the average killed lizard. That’s pretty substantial.
\[\text{Cohen's d } = \frac{\text{distance from null}}{s} = \frac{24.28 - 21.99}{\sqrt{ 6.99}} = \frac{2.29}{\sqrt{ 2.64}} = 0.87\]
19.1.2 The standard error
So Cohen’s d showed a very large difference in horn lengths. But how does this difference compare to what we expect by sampling error?
We could, of course, get a sense of this sampling error by resampling from each group with replacement, and calculating the mean difference for each “bootstrapped” replicate to approximate a sampling distribution, and use this to calculate a standard error and confidence interval
<- replicate(n = 10000, simplify = FALSE,
lizard_boot %>%
lizards group_by( Survival) %>%
slice_sample(prop = 1, replace = TRUE) %>%
::summarise(mean_horn = mean(squamosalHornLength)) %>%
dplyr::summarise(mean_horn_diff = diff(mean_horn))) %>%
dplyrbind_rows()
%>%
lizard_boot summarise(se_mean_diff = sd(mean_horn_diff),
lower_CI_horn_diff = quantile(mean_horn_diff, prob = 0.025),
upper_CI_horn_diff = quantile(mean_horn_diff, prob = 0.975))
Alternatively, we can use math tricks to estimate the standard error. Specifically, the standard error for the difference in means, \(SE_{\overline{x_1}-\overline{x_2}}\) equals \(SE_{\overline{x_1}-\overline{x_2}} = \sqrt{s_p^2 \Big(\frac{1}{n_1} + \frac{1}{n_2}\Big)}\), where \(s_p^2\) is the pooled variance from Equation (19.1). In this case, \(SE_{\overline{x_1}-\overline{x_2}} = \sqrt{6.99 \Big(\frac{1}{154} + \frac{1}{30}\Big)} 0.58\).
As in Chapter 18, we can use equation (18.1) – \((1-\alpha)\%\text{ CI } = \overline{x} \pm t_{\alpha/2} \times SE_x\) – to find the upper and lower confidence intervals.
Again, we can do these calculations in R. We see below that these mathematical answers are really close to the bootstrapped based estimates of uncertainty.
<- 0.05
alpha %>%
lizard_summaries summarise(est_diff = abs(diff(mean_horn)),
pooled_variance = sum(df * var_horn) / sum(df),
se = sqrt(sum(pooled_variance/n)),
crit_t_95 = qt(p = alpha/2, df = sum(df), lower.tail = FALSE),
lower_95CI = est_diff - se * crit_t_95,
upper_95CI = est_diff + se * crit_t_95)
## # A tibble: 1 × 6
## est_diff pooled_variance se crit_t_95 lower_95CI
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2.29 6.99 0.528 1.97 1.25
## # … with 1 more variable: upper_95CI <dbl>
19.1.3 Quantifying surprise
How surprised would we be to see such an extreme difference if we expected no difference at all? The p-value quantifies this surprise.
We can, of course estimate this by permutation
# To do so, we first find the absolute value of the difference in group means
<- lizard_summaries %>%
abs_obs_diff ::summarise(diff(mean_horn)) %>%
dplyrpull() %>%
abs()
<- replicate(50000, simplify = FALSE,
lizard_perm %>%
lizards mutate(perm_survival = sample(Survival, replace = FALSE))%>% # shuffle
group_by(perm_survival) %>%
::summarise(mean_horn = mean(squamosalHornLength)) %>% # calculate permuted group means
dplyr::summarise(mean_horn_diff = diff(mean_horn))) %>% # calculate difference in permuted group means
dplyrbind_rows()
%>%
lizard_perm mutate(as_or_more_extreme = abs(mean_horn_diff) >= abs_obs_diff) %>%
::summarise(p_value = mean(as_or_more_extreme)) dplyr
# To do so, we first find the absolute value of the difference in group means
<- lizard_summaries %>%
abs_obs_diff ::summarise(diff(mean_horn)) %>%
dplyrpull() %>%
abs()
<- read_csv("data/lizard_perm.csv")
lizard_perm %>%
lizard_perm mutate(as_or_more_extreme = abs(mean_horn_diff) >= abs_obs_diff) %>%
::summarise(p_value = mean(as_or_more_extreme)) dplyr
We find a very small p-value and reject the null hypothesis.
Alternatively, we could
- Calculate a t-value (as we did in Chapter 18).
- Compare the observed t to t’s sampling distribution with 182 degrees of freedom
- Find the area that exceeds our observed \(t\) on either tail of the distribution.
So, we first find our t-value as \(\frac{est - \mu_0}{se} = \frac{2.294-0}{0.5274} = 4.35\). We then compare it to its null sampling distribution. Figure 19.1 makes it clear that the p-value is very small – we need to zoom in on the null sampling distribution substantially to see anything as or more extreme
The exact p-value is 2 x pt(4.35, df = 182, lower.tail = FALSE)
= 2.3^{-5}. We can calcualte this in R
%>%
lizard_summaries summarise(est_diff = abs(diff(mean_horn)),
pooled_variance = sum(df * var_horn) / sum(df),
se = sqrt(sum(pooled_variance/n)),
t = est_diff / se,
p = 2 * pt(t, df = sum(df), lower.tail = FALSE))
## # A tibble: 1 × 5
## est_diff pooled_variance se t p
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2.29 6.99 0.528 4.35 0.0000227
19.2 A two sample t-test in R
There are a few ways to conduct a two sample t-test in R –
- By hand (as above).
- With the
t.test()
function. - With the
lm()
function (as we will see in the future).
19.2.1 With the t.test() function
- First, we could use the
t.test()
function.
# Either of these two approaches would work. Note that data = ., tells R we need the data that is piped in
# lizards %>%
# t.test(squamosalHornLength ~ Survival, data = ., var.equal = TRUE)
t.test(squamosalHornLength ~ Survival, data = lizards, var.equal = TRUE)
##
## Two Sample t-test
##
## data: squamosalHornLength by Survival
## t = -4.3, df = 182, p-value = 2e-05
## alternative hypothesis: true difference in means between group killed and group living is not equal to 0
## 95 percent confidence interval:
## -3.335 -1.254
## sample estimates:
## mean in group killed mean in group living
## 21.99 24.28
Piping this outpot to tidy makes it more compact
library(broom)
t.test(squamosalHornLength ~ Survival, data = lizards, var.equal = TRUE) %>%
tidy()
## # A tibble: 1 × 10
## estimate estimate1 estimate2 statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -2.29 22.0 24.3 -4.35 0.0000227
## # … with 5 more variables: parameter <dbl>,
## # conf.low <dbl>, conf.high <dbl>, method <chr>,
## # alternative <chr>
Reassuringly, R’s output matches our calculations (but it swaps group 1 and 2, so watch out for that)
19.3 Assumptions
Before we worry about assumptions, let’s take a minute to think about assumptions in statistics. What is an assumption? What happens when we break one?
An assumption is basically a statement we make for mathematical convenience. We do this to make stats tractable (otherwise we’d either be doing different math for each data set or bootstrapping and permuting only).
- If data meet assumptions, we can be confident that our statistics will work as intended.
- If data break assumptions, the stats could still work. Or they could not. There are no guarantees, and no refunds.
But often breaking assumptions does not break a test. In fact, many statistical tests are remarkably robust to violations of their assumptions (but some are not).
How can we know if breaking an assumption will break our stats or if it’s safe?
- The easiest way is to ask someone who knows.
- The best way is to simulate.
We will simulate soon to show you what I mean, but for now, trust me, I know.
19.3.1 Assumptions of the two-sample t-test
A two sample t-test assumes: Eaul variance in each group, Independence, Normality (of residual values, or more specifically, of their sampling distribution), and that Data are collected without bias.
We can ignore the linearity assumption for a two sample t-test.
While we can never ignore bias or non-independence, we usually need more details of study design to get at them, so let’s focus on homoscedasticity and normality.
19.3.2 The two sample t-test assumes equal variance among groups
This makes sense as we used a “pooled variance” in the two sample t-test – that is, in the math we assume that both groups have equal variance equal to the pooled variance.
- Homoscedasticity means that the variance of residuals is independent of the predicted value. In this case we only make two predictions – one per group – so this essentially means that we assume equal variance in each group. Let’s evaluate this prediction by making a histogram of the residuals – remembering that the comparing the variance in each group (we already calculated this in
lizard_summaries
)
lizard_summaries
## # A tibble: 2 × 5
## Survival n df mean_horn var_horn
## <chr> <int> <dbl> <dbl> <dbl>
## 1 killed 30 29 22.0 7.34
## 2 living 154 153 24.3 6.92
How similar must variances be to make a two-sample t-test ok? We’ll explore this soon, when we’ll see that the t-test is robust to violations of this assumption. In fact, as a rule of thumb, if variance in one group is less than five times the variance in the other group, we’re usually ok.
19.3.3 The two-sample t-test assumes normally disributed residuals
Note this means we assume the differences from group means are normally distributed, not that the raw data are.
As we saw for a one sample t-test in Chapter 18, the two sample t-test is quite robust to this assumption, as we are actually assuming a normal sampling distribution within each group, rather than a normal distribution of values regardless of group.
Let’s do this by making a qqplot
%>%
lizards ggplot(aes(sample = squamosalHornLength))+
geom_qq()+
geom_qq_line()+
facet_wrap(~Survival)+
labs(title = "qq plots of lizard data by survival group")
The data are obviously not normal, as the small values are too small (below the qq line). But the sample size is pretty big, so I would feel comfortable assuming normality here.
19.4 Alternatives to a two sample t-test
So, what to do if we break assumptions.
- We could ignore the violation, especially if it is minor and we believe the test is robust.
- We could transform the data, as described in Chapter 17.
- Or we could find a better test.
Let’s start at the bottom of the list – what are alternatives to the two-sample t-test?
19.4.1 Permutation / bootstrapping
We saw above that we can bootstrap (i.e. resample with replacement) to estimate uncertainty in the estimate of the mean difference, and we can permute (i.e. shuffle labels) to generate a null sampling distribution for hypothesis testing.
These approaches make nearly no assumptions and most always work, but generally require modest sample sizes of at least fifteen in each group or so, such that resampling from them with replacement generates enough combinations to reasonably approximate the sampling distribution from a population.
19.4.2 Welch’s t-test when variances differ by group
If the variances differ by group, we could do a Welch’s t-test, which does not assume equal variance. Here we use a different equation to calculate the t value and degrees of freedom 19.6.1, which you need never know.
Do know
- That you can use Welch’s t-test when variance differs by group.
- You can easily conduct a Welch’s t-test in R with the
t.test()
function by settingvar.equal
equal toFALSE
. In fact, this is the default behavior of tis function in R.
%>%
lizards t.test(squamosalHornLength ~ Survival, data = ., var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: squamosalHornLength by Survival
## t = -4.3, df = 40, p-value = 1e-04
## alternative hypothesis: true difference in means between group killed and group living is not equal to 0
## 95 percent confidence interval:
## -3.382 -1.207
## sample estimates:
## mean in group killed mean in group living
## 21.99 24.28
All interpretations of results from the Welch’s t-test are the same as a interpretations of stats from a two sample t-test
- The t value is still the distance, in standard errors, between the observed difference in sample means and the null (0) away from one.
- The p-value is still the proportion of draws from the null that will be as or more extreme than our test statistic.
The only difference is that now the sampling distribution uses the variance from each group, rather than combining the variance into its average, “pooled variance”.
19.4.3 Mann-Whitney U and Wilcoxon rank sum tests
If residuals are not normal, we would usually do a permutation test nowadays. Historically, however, people used “rank based” “non parametric” tests when data weren’t normal. Here we
- Convert all data from their actual values to their “rank” (e.g. first, second, third, etc..) in the data set.
- Turn the distribution of ranks into a test statistic.
- Find a sampling distribution for ranks under a null model.
- Calculate a p-value, and reject or fail to reject the null hypothesis.
Here’s an example in R
%>%
lizards wilcox.test(squamosalHornLength ~ Survival, data = .)
##
## Wilcoxon rank sum test with continuity
## correction
##
## data: squamosalHornLength by Survival
## W = 1182, p-value = 2e-05
## alternative hypothesis: true location shift is not equal to 0
These tests are falling out of fashion because they don’t provide any estimate of uncertainty, they make their own assumptions about the distribution of data, and because permutation and bootstrapping can be done on our home computers. So I keep this here just to expose you to it. Know it’s a thing you might hear about some time, and know that we’re not worrying about it here.
19.6 Optional
19.6.1 Boring math
You don’t need to know this, but the formula for Welch’s t-test is
\[t\quad =\quad {\;\overline {X}_{1}-\overline {X}_{2}\; \over {\sqrt {\;{s_{1}^{2} \over N_{1}}\;+\;{s_{2}^{2} \over N_{2}}\quad }}}\,\]
With degrees of freedom equal to
\[{\displaystyle df \quad \approx \quad {{\left(\;{s_{1}^{2} \over N_{1}}\;+\;{s_{2}^{2} \over N_{2}}\;\right)^{2}} \over {\quad {s_{1}^{4} \over N_{1}^{2}\times df _{1}}\;+\;{s_{2}^{4} \over N_{2}^{2} \times df _{2}}\quad }}}\]
19.7 Other thigns to do when data do not meet assumptions
19.7.1 Fit your own likelihood model
If you can write down a likelihood function (like in Chapter 18.8.3) you can customize your assumptions to fit your data.
19.8 Simulations to evaluate test performance
We often want to know what to expect from a statistical test. For example, we might want to
- Evaluate how robust a test is to violation of its assumptions, or
- Evaluate the power a test has to reject a false null
We can do both of these things by simulation!!! Here I focus on robustness of the two sample t-test to violations of the assumption of equal variance.
First we set up the parameters – I am going to do 10000 reps with variance in group 1 always equal to one, and variance in group 2 equal to 1,2,5,
<- 10000
n_experiments_per_pair <- c(1,1,1,2,1,5,1,10,1,25,1,50,1,100)
vars <- vars[1:length(vars) %%2==0] # a trick to grab the even valus var_grp_2
I then make data for samples of size ten and fifty. The code here is a bit awakard but basically for each experiment I
- Assign n individuals in group 1 or 2,
- Match it up so group 1 has a variance of one, and group 2 has the variance it should (1, 2,5,10, 25,50, 100),
- I then simulate data for each individual from a normal distribution with mean equals zero, and standard deviation equal to the square root of the specified variance.
# sample size of 10
<- 10
n <- tibble(this_var = rep(rep(vars , each = n ), times = n_experiments_per_pair),
sim_n10 this_group = rep(rep(rep(1:2, each =n ), times = length(var_grp_2)),times = n_experiments_per_pair),
this_experiment = rep(1: (n_experiments_per_pair * length(var_grp_2)) , each = 2 * n),
var_g2 = rep(rep(var_grp_2 , each = 2* n ), times = n_experiments_per_pair),
n = n) %>%
mutate(sim_val = rnorm(mean = 0, sd = sqrt(this_var), n = n()))
# sample size of 50
<- 50
n <- tibble(this_var = rep(rep(vars , each = n ), times = n_experiments_per_pair),
sim_n50 this_group = rep(rep(rep(1:2, each =n ), times = length(var_grp_2)),times = n_experiments_per_pair),
this_experiment = rep(1: (n_experiments_per_pair * length(var_grp_2)) , each = 2 * n),
var_g2 = rep(rep(var_grp_2 , each = 2* n ), times = n_experiments_per_pair),
n = n) %>%
mutate(sim_val = rnorm(mean = 0, sd = sqrt(this_var), n = n()))
We then find the p-value for all of these data simulated under a null, but breaking test assumptions
<- bind_rows(sim_n10, sim_n50) %>%
sim_tests group_by(this_experiment,n) %>%
nest(data = c(this_group, this_var,sim_val)) %>%
mutate(fit = map(.x = data, ~ t.test(.x$sim_val ~ .x$this_group, var.equal = TRUE)),
results = map(fit, tidy)) %>%
unnest(results) %>%
::select(this_experiment, var_g2, n, p.value) dplyr
Finally, we assume a traditional \(\alpha = 0.05\) and look for the proportion of times we reject the false null.
<- sim_tests %>%
sim_summary mutate(reject = p.value <=0.05) %>%
group_by(n, var_g2) %>%
summarise(actual_alpha = mean(reject)) %>%
mutate(`Sample size` = factor(n))
## `summarise()` has grouped output by 'n'. You can
## override using the `.groups` argument.
ggplot(sim_summary, aes(x = var_g2, y = actual_alpha, color = `Sample size`))+
geom_line()+
geom_point()+
scale_x_continuous(trans = "log2", breaks = c(1,2,5,10,25,50,100)) +
geom_hline(yintercept = 0.05)+
labs(x = "population variance in group 2\n(population variance in group 1 is always 1)")
The plot shows that violating the assumption of equal variance seems to matter most when sample size is small and violations are extreme.