Chapter 24 Hypothesis Tests About Two Means
24.1 Paired Samples t-test
Here, we will cover how to test a hypothesis involving paired samples (also called dependent samples or matched pairs). This is when we take two measurements on the same individual and wish to see if some sort of significant change has occurred. Common scenarios include pre-test vs post-test, before vs after, etc. Notice we are not comparing two independent groups (such as comparing a treatment group with a control group); this is a two-sample t-test which we’ll discuss later.

The paired samples t-test really is just a special kind of one-sample t-test, where we will comput a difference score D=Y−X and perform the t-test (or compute a confidence interval) on the differences D. This is the most basic kind of repeated measures analysis; more advanced methods that won’t be considred in this class include repeated measures ANOVA and multilevel models (also called linear mixed models).
Usually the null value will be zero (although it doesn’t have to be) and it is quite common to use one-sided/directional alternative hypotheses. For example, if I gave a class a pre-test at the beginning of the semester to see what you knew about statistics and a post-test at the end of the semester, I would hope that the mean of the post-test scores are significantly higher than the pre-test scores. If a group of overweight people were going on a diet to try and lose weight, we would expect their weight after the diet to be significantly less than before the diet.
24.2 Example of Paired Samples
An experiment is being conducted to see if one’s reaction time is affected by drinking alcohol. A random sample of n=10 people are tested, with their reaction time measured both before and after consuming alcohol. Suppose we expect that the reaction time will be significantly slower after drinking alcohol.
Before doing the paired-samples t-test, I’m going to compute some basic statistics, and review computing the sample variance and standard deviation in a step-by-step process (as you might need to do this on the final exam).
Note that ˉD=0.09. This is the mean difference in means; the average subject had a reaction time that was 0.09 seconds slower after drinking alcohol. Notice subjects 5 and 7 have negative differencea, as they actually had a faster reaction time after drinking, the opposite of what was expected. This would be like if you did worse on the post-test or if a dieter actually gains weight on the diet.
Subject | X (Before) | Y (After) | D=Y−X | D−ˉD | (D−ˉD)2 |
---|---|---|---|---|---|
1 | 0.40 | 0.56 | 0.16 | 0.07 | 0.0049 |
2 | 0.50 | 0.50 | 0.00 | -0.09 | 0.0081 |
3 | 0.65 | 0.78 | 0.13 | 0.04 | 0.0016 |
4 | 0.45 | 0.67 | 0.22 | 0.13 | 0.0169 |
5 | 0.60 | 0.50 | -0.10 | -0.19 | 0.0361 |
6 | 0.44 | 0.54 | 0.10 | 0.01 | 0.0001 |
7 | 0.60 | 0.55 | -0.05 | -0.14 | 0.0196 |
8 | 0.50 | 0.70 | 0.20 | 0.11 | 0.0121 |
9 | 0.65 | 0.82 | 0.17 | 0.08 | 0.0064 |
10 | 0.45 | 0.52 | 0.07 | -0.02 | 0.0004 |
Notice that ˉD=∑Dn=0.9010=0.09
s2D=∑(D−ˉD)2n−1=0.106210−1=0.0118
sD=√0.0118=0.1086
I will also double-check to make sure that the distribution of D is “nearly normal” and doesn’t contain any outliers with an informal outlier check.
Min=−0.10,Q1=0,M=0.115,Q3=0.17,Max=0.22
IQR=0.17−0=0.17,Step=1.5×IQR=0.255
Q1−Step=0−0.255=−0.255,Q3+Step=0.17+0.255=0.425 There are no outliers. I will proceed with the t-test. What are called nonparametric tests are often used if there are outliers present that can’t be removed from the data; examples include the Wilcoxon Signed Rank Test and the Mann-Whitney Test (we will not study the formulas for these tests).
Step One: Write Hypotheses
HO:μD=0 HA:μD>0
Step Two: Choose Level of Significance
Let α=0.05
Step Three: Choose Test & Check Assumptions I am satsified that we have a sample that is random, paired, with the difference satisfying the “nearly normal” condition, so I will use the paired samples t-test.
Step Four: Calculate the Test Statistic
t=ˉD−μ0sD/√n t=0.09−00.1086/√10
t=0.090.0343
t=2.620
df=n−1=9
Step Five: Make a Statistical Decision (via the p-value)
The p-value is P(t>2.62). With df=9, notice our test statistic is between the critical values t∗=2.262 and t∗=2.821, which correspond to one-tail probabilities of 0.025 and 0.01, respectively. This means that 0.01<p−value<0.025
Below is a picture showing the p-value, which can be computed with software to be p=0.0139, which as promised is more than α=0.01 and less than α=0.025.
Since we are using α=0.05 and p<α, we REJECT the null hypothesis.
Step Six: Conclusion
The mean difference in reaction time decrease significantly after the subject drank alcohol.
Below is what statistical output from the software package R would look like for this problem. I also included output for the Wilcoxon Signed Rank test (even though we will not cover the formula for this test). The software includes a one-sided confidence interval that I’ll ignore.
##
## Paired t-test
##
## data: Before and After
## t = -2.62, df = 9, p-value = 0.01391
## alternative hypothesis: true mean difference is less than 0
## 95 percent confidence interval:
## -Inf -0.0270305
## sample estimates:
## mean difference
## -0.09
##
## Wilcoxon signed rank test with continuity correction
##
## data: Before and After
## V = 4, p-value = 0.01648
## alternative hypothesis: true location shift is less than 0
24.3 Independent Samples t-test
Here, we will cover how to test a hypothesis involving independent samples (also called the two-sample t-test). This is when we take measurements on subjects from two groups that are independnet of each other, to see if some sort of significant change has occurred.
Difference of Two Independent Means
Often we want to test the difference of means between two independent groups. For example, we want to compare the mean age of subjects assigned to the treatment group (call it μ1) and the mean age of subjects assigned to the control group (call it μ2). If we assume equal variances, the test statistic is:
t=¯x1−¯x2√s2p(1n1+1n2)
where s2p, the pooled variance is:
s2p=(n1−1)s21+(n2−1)s22n1+n2−2
and the degrees of freedom are df=n1+n2−2
We have a sample of n1=16 from the treatment group and n2=14 people from the control group. The summary statistics are: ¯x1=55.7 years, s1=3.2, ¯x2=53.2, s2=2.8
We are testing for a significant difference (i.e. no direction), as we are interested in differences in either direction (the treatment group might be significantly older or younger).
Step One: Write Hypotheses
H0:μ1=μ2 Ha:μ1≠μ2
Step Two: Choose Level of Significance
Let α=0.05.
Step Three: Choose Test/Check Assumptions
I will choose the two-sample t-test. Assumptions include that the samples are independent, random, and normally distributed with equal variances. For this example, we will assume these are met without any further checks. The Wilcoxon Rank Sum (or Mann-Whitney) test is a nonparametric alternative if this nearly normal condition is violated.
Step Four: Compute the Test Statistic
s2p=(16−1)(3.22)+(14−1)(2.82)16+14−2=255.5228=9.1257 t=¯x1−¯x2√s2p(1n1+1n2)=55.7−53.2√9.1257(116+114)=2.5√1.2222 t=2.261 df=n1+n2−2=16+14−2=28
Step 5: Find p-value, Make Decision
In our problem, we have the two-sided alternative hypothesis Ha:μ1≠μ2, a test statistic t=2.261, and df=28.
We refer to the row for df=28. Notice our test statistic t=2.261 falls between the critical values t∗=2.048 and t∗=2.467.
Because we have a two-sided test, our p-value is .02<p-value<.05.
Reject H0 at α=.05. We would have failed to reject H0 if we had used α=0.01.

Step Six: Conclusion
Since we rejected the null hypothesis, a proper conclusion would be:
The mean age of people in the treatment group is than the mean height of people in the control group.
Notice that the effect is said to be significantly different and NOT significantly greater. This is because we chose a two-sided alternative hypothesis rather than a one-sided hypothesis. This decision should be made a priori (before collecting data) and NOT changed just to get a ‘desired’ result.
24.4 Unequal Variances
The assumption of equal variances is sometimes not met, and an alternative version of the t-test exists when we assume unequal variances. This test is often called Welch’s t′-test.
t′=¯x1−¯x2√s21n1+s22n2
Notice we do not need to compute s2p, pooled variance. However, the degrees of freedom are no longer equal to n1+n2−2. Instead, df is either computed by an ugly formula (details hidden), approximated with an easier formula, or approximated even further by treating it as a z-test (for large samples).
Let’s look at another example, this time assuming unequal variances.
Example: Suppose I am comparing final exam scores for two different sections of the same course, with one class having taken their final exams on Monday morning and the other on Friday afternoon. We wish to see if there is a significant difference in scores.
Step One: Write Hypotheses
H0:μF=μM HA:μF≠μM
Step Two: Choose Level of Significance
Let α=0.05.
Step Three: Choose Test/Check Assumptions
I will choose the two-sample t-test. Assumptions include that the samples are independent, random, and normally distributed with unequal variances. Notice there is an outlier in the Friday class, but it is a real score and I will not change or drop the score.
## ________________________________
## 1 | 2: represents 12, leaf unit: 1
## Monday$exam Friday$exam
## ________________________________
## | 2 |9
## | 3 |
## 9| 4 |57
## 3| 5 |
## 3| 6 |23578
## 875111| 7 |24
## 987766311| 8 |027
## 7| 9 |36
## 0| 10 |
## ________________________________
## n: 20 15
## ________________________________
## class min Q1 median Q3 max mean sd n missing
## 1 F 29 62 68 82 96 68.66667 18.35626 15 0
## 2 M 49 71 81 87 100 78.65000 13.05565 20 0
Step 4: Compute the test statistic
t′=¯x1−¯x2√s21n1+s22n2 t′=68.66667−78.65√18.35626215+13.05565220 t′=−1.793 Instead of using the complicated formula for degrees of freedom, I will use an approximation that will underestimate the df and thus provide a statistically conservative test that keeps α below the stated level of 0.05.
df≈min(n1−1,n2−1)≈min(14,19)≈14
Step 5: Find p-value, Make Decision
In our problem, we have the two-sided alternative hypothesis Ha:μ1≠μ2, a test statistic t`=-1.793, and df \approx 14.
We refer to the row for df=14. Notice the absolute value of our test statistic |t'|=1.793 falls between the critical values t^*=1.761 and t^*=2.145.

Because we have a two-sided test, our p-value is .05<\text{p-value}<.10.
Fail to reject H_0 at \alpha=.05. (What would it have been if I had made a one-sided hypothesis that scores were lower on Friday?)
Step 6: Make Conclusion The means of the final exam scores are not statistically different between Monday and Friday.
Here’s what the results look like when I used software. Notice the exact df and p-value is computed by software so that an approximation for df and a table are not needed. This result is more accurate. The software also computes a 95% confidence interval for the difference in the two means.
If you decided to NOT use a t-test based on the outlier, I’ve also included results from the Mann-Whitney test (called the Wilcoxon Rank Sum Test by my software), a non-parametric alternative (we will not cover its formula).
##
## Welch Two Sample t-test
##
## data: exam by class
## t = -1.7935, df = 24.084, p-value = 0.08547
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
## -21.469922 1.503255
## sample estimates:
## mean in group F mean in group M
## 68.66667 78.65000
##
## Wilcoxon rank sum test with continuity correction
##
## data: exam by class
## W = 96.5, p-value = 0.07706
## alternative hypothesis: true location shift is not equal to 0