Chapter 19 Hypothesis Testing
19.1 Introduction to hypothesis testing
In estimation, we are interested in asking ourselves the question what is the value of some particular parameter of interest in the population. For example, what is the average annual income of residents in the UK?
Often there are times in statistics when we are not interested in the specific value of the parameter, but rather are interested in asserting some statement regarding the parameter of interest. Some examples:
- We want to claim that the average annual income of UK residents is more than or equal to £35,000.
- We want to assess whether the average annual income of men in academia in the UK is the same as that of women in similar ranks.
- We want to determine whether the number of cars crossing a certain intersection follows a Poisson distribution or whether it is more likely to come from a geometric distribution.
To perform a statistical hypothesis test, one needs to specify two disjoint hypotheses in terms of the parameters of the distribution that are of interest. They are
- H0: Null Hypothesis,
- H1: Alternative Hypothesis.
Traditionally, we choose H0 to be the claim that we would like to assert.
Returning to our examples:
- We want to claim that the average annual income of UK residents is more than or equal to £35,000. We test
H0:μ≥35,000vs.H1:μ<35,000. - We want to assess whether the average annual income of men in academia in the UK is the same as that of women at similar ranks. We test
H0:μmen=μwomenvs.H1:μmen≠μwomen. - We want to determine whether the number of cars crossing a certain intersection follows a Poisson distribution or whether it is more likely to come from a geometric distribution. We test
H0:X∼Po(2)vs.H1:X∼Geom(0.5).
Hypotheses where the distribution is completely specified are called simple hypotheses. For example, H0 and H1 in the car example and H0 in the gender wage example are all simple hypotheses.
Hypotheses where the distribution is not completely specified are called composite hypotheses. For example, H0 and H1 in the average annual income example and H1 in the gender wage example are all composite hypotheses.
Note that in the average annual income and gender wage examples, the null and alternative hypotheses cover all possibilities, whereas for the car example there are many other choices of distributions which could be hypothesized.
The conclusion of a hypothesis test
We will reject H0 if there is sufficient information from our sample that indicates that the null hypothesis cannot be true thereby concluding the alternative hypothesis is true.
We will not reject H0 if there is not sufficient information in the sample to refute our claim.
The remainder of this section is structured as follows. We define Type I and Type II errors, which are the probability of making the wrong decision in a hypothesis test. In Section 19.3 we show how to construct hypothesis tests starting with hypothesis tests for the mean of a normal distribution with known variance. This is extended to the case where the variance is unknown and where we have two samples we want to compare. We introduce p-values which give a measure of how likely (unlikely) the observed data are if the null hypothesis is true.We then consider hypothesis testing in a wide range of scenarios:-
19.2 Type I and Type II errors
Type I error
A Type I error occurs when one chooses to incorrectly reject a true null hypothesis.
A Type I error is also commonly referred to as a false positive.
Type II error
A Type II error occurs when one fails to reject a false null hypothesis.
A Type II error is also commonly referred to as a false negative.
Type I error and Type II error are summarised in the following decision table.
One accepts the Null | One rejects the Null | |
---|---|---|
Null hypothesis is true | Correct Conclusion | Type I Error |
Null hypothesis is false | Type II Error | Correct Conclusion |
Significance level
The significance level or size of the test is
Typical choices for α are 0.01, 0.05 and 0.10.
Probability of Type II error
The probability of a Type II error is
Consider the following properties of α and β:
- It can be shown that there is an inverse relationship between α and β, that is as α increases, β decreases and vice versa. Therefore for a fixed sample size one can only choose to control one of the types of error. In hypothesis testing we choose to control Type I error and select our hypotheses initially so the “worse” error is the Type II error.
- The value of both α and β depend on the value of the underlying parameters. Consequently, we can control α by first choosing H0 to include an equality of the parameter, and then showing that the largest the Type I error can be is at this point of equality. Therefore we may as well choose the parameter to be the size. To illustrate in the average annual income example above
α=P(rejecting H0|μ=35,000)≥P(rejecting H0|μ≥35,000).
Therefore H0:μ≥35,000 is often just written as H0:μ=35,000.
- Because H0 describes an equality, H1 is therefore a composite hypothesis. Therefore β=P(Type II error) is a function of the parameter within the alternative parameter space.
Power of a Test
The power of the test is
The power of a test can be thought of as the probability of making a correct decision.
19.3 Tests for normal means, σ known
In this section we study a number of standard hypothesis tests that one might perform on a random sample.
We assume throughout this section that x1,x2,…,xn are i.i.d. samples from X with E[X]=μ, where μ is unknown and var(X)=σ2 is known.
Test 1: H0:μ=μ0 vs. H1:μ<μ0; σ2 known.
Watch Video 28 for the construction Hypothesis Test 1.
Video 28: Hypothesis Test 1
A summary of the construction of Hypothesis Test 1 is given below.
Data assumptions. We assume either
- X1,X2,…,Xn are a random sample from a normal distribution with known variance σ2;
- The sample size n is sufficiently large so that we can assume ˉX is approximately normally distributed by the Central Limit Theorem, and that either the variance is known or that the sample variance s2≈σ2.
Step 1: Choose a test statistic based upon the random sample for the parameter we want to base our claim on. For example, we are interested in μ so we want to choose a good estimator of μ as our test statistic. That is, ˆμ=ˉX.
Step 2: Specify a decision rule. The smaller ˉX is, the more the evidence points towards the alternative hypothesis μ<μ0. Therefore our decision rule is to reject H0 if ˉX<c, where c is called the cut-off value for the test.
Step 3: Based upon the sampling distribution of the test statistic and the specified significance level of the test, solve for the specific value of the cut-off value c. To find c,Since P(Z<−zα)=α, where zα can be found using qnorm(1-alpha)
(P(Z<zα)=1−α) or statistical tables, then
−zα=c−μ0σ/√n
and c=μ0−zα⋅σ√n.
So, the decision rule is to reject H0 if ˉX<μ0−zα⋅σ√n or, equivalently, Z=ˉX−μ0σ√n<−zα.
Test 2: H0:μ=μ0 vs. H1:μ<μ0; σ2 known.
This is similar to the previous test, except the decision rule is to reject H0 if ˉX>μ0+zασ√n or, equivalently, Z=ˉX−μ0σ/√n>zα.
Note that both these tests are called one-sided tests, since the rejection region falls on only one side of the outcome space.
Test 3: H0:μ=μ0 vs. H1:μ≠μ0; σ2 known.
The test statistic ˉX does not change but the decision rule will. The decision rule is to reject H0 if ˉX is sufficiently far (above or below) from μ0. Specifically, reject H0 if ˉX<μ0−zα/2⋅σ√n or ˉX>μ0+zα/2⋅σ√n. Equivalent to both of these is |Z|=|ˉX−μ0σ/√n|>zα/2.
This is called a two-sided test because the decision rule partitions the outcome space into two disjoint intervals.
Coffee machine.
Suppose that a coffee machine is designed to dispense 6 ounces of coffee per cup with a standard deviation σ=0.2, where we assume the amount of coffee dispensed is normally distributed. A random sample of n=20 cups gives ˉx=5.94.Test whether the machine is correctly filling the cups.
We test H0:μ=6.0 vs. H1:μ≠6.0 at significance level α=0.05.
Using a two-sided test with known variance, the decision rule is to reject H0 if |Z|=|ˉx−6.00.2/√20|>z0.05/2=z0.025=1.96. NowTherefore, we conclude that there is not enough statistical evidence to reject H0 at α=0.05.
19.4 p values
When our sample information determines a particular conclusion to our hypothesis test, we only report that we either reject or do not reject H0 at a particular significance level α. Hence when we report our conclusion the reader doesn’t know how sensitive our decision is to the choice of α.
To illustrate, in Example 19.3.1 (Coffee Machine) we would have reached the same conclusion that there is not enough statistical evidence to reject H0 at α=0.05 if |Z|=1.95 rather than |Z|=1.34. Whereas, if the significance level was α=0.10, we would have rejected H0 if |Z|=1.95>z0.10/2=1.6449, but we would not reject H0 if |Z|=1.34<z0.10/2=1.6449.
Note that the choice of α should be made before the test is performed; otherwise, we run the risk of inducing experimenter bias!
p-value
The p-value of a test is the probability of obtaining a test statistic at least as extreme as the observed data, given H0 is true.
So the p-value is the probability of rejecting H0 with the value of the test statistic obtained from the data given H0 is true. That is, it is the critical value of α with regards to the hypothesis test decision.
If we report the conclusion of the test, as well as the p value then the reader can decide how sensitive our result was to our choice of α.
Coffee machine (continued).
Compute the p value for the test in Example 19.3.1.
In Example 19.3.1 (Coffee machine), we were given ˉx=5.94, n=20 and σ=0.2. Our decision rule was to reject H0 if |Z|=|ˉx−6.00.2/√20|>z0.025.
To compute the p-value for the test assume H0 is true, that is, μ=6.0. We want to find,Consider the following remarks on Example 19.4.2.
- The multiplication factor of 2 has arisen since we are computing the p value for a two-sided test, so there is an equal-sized rejection region at both tails of the distribution. For a one-tailed test we only need to compute the probability of rejecting in one direction.
- The p value implies that if we had chosen an α of at least 0.1802 then we would have been able to reject H0.
- In applied statistics, the p value is interpreted as the sample providing:
{strong evidence against H0,if p≤0.01,evidence against H0,if p≤0.05,slight evidence against H0,if p≤0.10,no evidence against H0,if p>0.10.
19.5 Tests for normal means, σ unknown
Assume X1,X2,…,Xn is a random sample from a normal distribution with unknown variance σ2.
Test 4: H0:μ=μ0 vs. H1:μ<μ0; σ2 unknown.
where s2 is the sample variance.
Hence,qt
function in R with tn−1,α= qt(alpha,n-1)
or using statistical tables similar to those of the normal tables in Section 5.7. Therefore
Test 5: H0:μ=μ0 vs. H1:μ>μ0; σ2 unknown.
Test 6: H0:μ=μ0 vs. H1:μ≠μ0; σ2 unknown.
Coffee machine (continued).
Suppose that σ is unknown in Example 19.3.1, though we still assume the amount of coffee dispensed is normally distributed. A random sample of n=20 cups gives mean ˉx=5.94 and sample standard deviation s=0.1501.
Test whether the machine is correctly filling the cups.
We test H0:μ=6.0 vs. H1:μ≠6.0 at significance level α=0.05.
The decision rule is to reject H0 if |T|=|ˉx−6.00.1501/√20|>t20−1,0.05/2=t19,0.025=2.093.
Now
19.6 Confidence intervals and two-sided tests
Consider the two-sided t-test of size α. We reject H0 if |T|=|ˉX−μ0s/√n|>tn−1,α/2. This implies we do not reject H0 ifis a 100(1−α)% confidence interval for μ. Consequently, if μ0, the value of μ under H0, falls within the 100(1−α)% confidence interval for μ, then we will not reject H0 at significance level α.
In general, therefore, there is a correspondence between the “acceptance region” of a statistical test of size α and the related 100(1−α)% confidence interval. Therefore, we will not reject H0:θ=θ0 vs. H1:θ≠θ0 at level α if and only if θ0 lies within the 100(1−α)% confidence interval for θ.
Coffee machine (continued).
For the coffee machine in Example 19.5.1 (Coffee machine - continued) we wanted to test H0:μ=6.0 vs. H1:μ≠6.0 at significance level α=0.05. We were given a random sample of n=20 cups with ˉx=5.94 and s2=0.15012.
Construct a 95% confidence interval for μ.
The limits of a 95% confidence interval for μ are
so the 95% confidence interval for μ is
If we use the confidence interval to perform our test, we see that
so we will not reject H0 at α=0.05.
19.7 Distribution of the variance
Thus far we have considered hypothesis testing for the mean but we can also perform hypothesis tests for the variance of a normal distribution. However, first we need to consider the distribution of the sample variance.
Suppose that Z1,Z2,…,Zn∼N(0,1). Then we have shown thatin Section 14.2.
This can be extended to show thatNote that the degrees of freedom of χ2 is n−1, the number of observations n minus 1 for the estimation of μ by ˉX.
It follows that19.8 Other types of tests
Test 7: H0:σ21=σ22 vs. H1:σ21≠σ22.
Let X1,X2,…,Xm∼N(μ1,σ21) and Y1,Y2,…,Yn∼N(μ2,σ22) be two independent random samples from normal populations.
The test statistic is F=s21s22, whereqf(alpha/2,m-1,n-1)
and qf(1-alpha/2,m-1,n-1)
. Alternatively, Statistical Tables can be used. For the latter you may need to use the identity
to obtain the required values from the table.
Test 8: H0:μ1=μ2 vs. H1:μ1≠μ2; σ2 unknown.
Assume X1,X2,…,Xm∼N(μ1,σ2) and Y1,Y2,…,Yn∼N(μ2,σ2) are two independent random samples with unknown but equal variance σ2.
Note that
- (ˉX−ˉY)∼N((μ1−μ2),σ2(1m+1n)) which implies
(ˉX−ˉY)−(μ1−μ2)√σ2(1m+1n)∼N(0,1); - (m+n−2)s2pσ2∼χ2m+n−2;
- s2p is independent of ˉX−ˉY.
where s2p=(m−1)s2X+(n−1)s2Ym+n−2 is the pooled sample variance.
Blood bank.
Suppose that one wants to test whether the time it takes to get from a blood bank to a hospital via two different routes is the same on average. Independent random samples are selected from each of the different routes and we obtain the following information:
Route X | m=10 | ˉx=34 | s2X=17.111 |
Route Y | n=12 | ˉy=30 | s2Y=9.454 |

Figure 19.1: Routes from blood bank to hospital.
Test H0:μX=μY vs. H1:μX≠μY at significance level α=0.05, where μ1 and μ2 denote the mean travel times on routes X and Y, respectively.
Attempt Example 19.8.1: Blood bank and then watch Video 29 for the solutions.
Video 29: Blood bank
Solution to Example 19.8.1: Blood bank
Compute
- F=s2Xs2Y=17.1119.454=1.81;
- F9,11,0.975=1F11,9,0.025=13.915=0.256;
- F9,11,0.025=3.588.
Hence F9,11,0.975<F<F9,11,0.025, so we do not reject H0 at α=0.05. Therefore we can assume the variances from the two samples are the same.
Now we test H0:μX=μY vs. H1:μX≠μY at significance level α=0.05
The decision rule is to reject H0 if
Test 9: H0:μ1=μ2 vs. H1:μ1≠μ2; non-independent samples.
Suppose that we have two groups of observations X1,X2,…,Xn and Y1,Y2,…,Yn where there is an obvious pairing between the observations. For example consider before and after studies or comparing different measuring devices. This means the samples are no longer independent.
An equivalent hypothesis test to the one stated is H0:μd=μ1−μ2=0 vs. H1:μd=μ1−μ2≠0. With this in mind define Di=Xi−Yi for i=1,…,n, and assume D1,D2,…,Dn∼N(μd,σ2d) and are i.i.d.
The decision rule is to reject H0 ifDrug Trial.
In a medical study of patients given a drug and a placebo, sixteen patients were paired up with members of each pair having a similar age and being the same sex. One of each pair received the drug and the other recieved the placebo. The response score for each patient was found.
Pair Number | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
Given Drug | 0.16 | 0.97 | 1.57 | 0.55 | 0.62 | 1.12 | 0.68 | 1.69 |
Given Placebo | 0.11 | 0.13 | 0.77 | 1.19 | 0.46 | 0.41 | 0.40 | 1.28 |
Are the responses for the drug and placebo significantly different?
This is a “matched-pair” problem, since we expect a relation between the values of each pair. The difference within each pair is
Pair Number | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
Di=yi−xi | 0.05 | 0.84 | 0.80 | −0.64 | 0.16 | 0.71 | 0.28 | 0.41 |
We consider the Di’s to be a random sample from N(μD,σ2D). We can calculate that ˉD=0.326, s2D=0.24 so sD=0.49.
To test H0:μD=0 vs H1:μD≠0, the decision rule is to reject H0 if
Now t7,0.05=1.895, so we would not reject H0 at the 10% level (just).
19.9 Sample size calculation
We have noted that for a given sample x1,x2,…,xn, if we decrease the Type I error α then we increase the Type II error β, and visa-versa.
To control for both Type I and Type II error, ensure that α and β are both sufficiently small, we need to choose an appropriate sample size n.
Sample size calculations are appropriate when we have two simple hypotheses to compare. For example, we have a random variable X with unknown mean μ=E[X] and known variance σ2=Var(X). We compare the hypotheses:
- H0:μ=μ0,
- H1:μ=μ1.
Without loss of generality we will assume that μ0<μ1.
Suppose that x1,x2,…,xn represent i.i.d. samples from X. Then by the central limit theoremNote that as n increases, the cut-off for rejecting H0 decreases towards μ0.
We now consider the choice of n to ensure that the Type II error is at most β, or equivalently, that the power of the test is at least 1−β.
The Power of the test is:Lemma 19.9.1 (Sample size calculation) gives the smallest sample size n to bound Type I and II errors by α and β in the case where the variance, σ2 is known.
Sample size calculation.
Suppose that X is a random variable with unknown mean μ and known variance σ2.
The required sample size, n, to ensure significance level α and power 1−β for comparing hypotheses:
- H0:μ=μ0
- H1:μ=μ1
is: n=(σμ1−μ0(zα−z1−β))2.
The details of the proof of Lemma 19.9.1 (Sample size calculation) are provided but can be omitted.
Proof of Sample Size calculations.
Note:
- We need larger n as σ increases. (More variability in the observations.)
- We need larger n as μ1−μ0 gets closer to 0. (Harder to detect a small difference in mean.)
- We have that α,β<0.5, so zα>0 and z1−β<0. Hence, zα−z1−β becomes larger as α and β decrease. (Smaller errors requires larger n.)
The following R Shiny App lets you explore the effect of μ1−μ0, σ and α on the sample size n or power 1−β.
R Shiny app: Sample size calculation app
Task: Session 10
Attempt the R Markdown file for Session 10:
Session 10: Confidence intervals and hypothesis testing
Student Exercises
Attempt the exercises below.
Note that throughout the exercises, for a random variable X and 0<β<1, cβ satisfies P(X>cβ)=β.
Eleven bags of sugar, each nominally containing 1 kg, were randomly selected from a large batch. The weights of sugar were:
You may assume these values are from a normal distribution.
- Calculate a 95% confidence interval for the mean weight for the batch.
- Test the hypothesis H0:μ=1 vs H1:μ≠1. Give your answer in terms of a p-value.
Note that
Solution to Exercise 19.1.
Hence, the sample standard deviation is s=√0.00092=0.03033.
- The 95% confidence interval for the mean is given by ˉx±tn−1,0.025s/√n. Now t10,0.025=2.2281. Hence the confidence interval is
1.04±2.2281(0.03033√11)=1.04±0.0204=(1.0196,1.0604) - The population variance is unknown so we apply a t test with test statistic
t=ˉx−μ0s/√n=1.04−10.03033/√11=4.3741 and n−1 degrees of freedom. The p value is P(|t|>4.3741). From the critical values given, P(t10>4.1437)=0.001 and P(t10>4.5869)=0.0005, so 0.0005<P(t10>4.3741)<0.001. Hence 0.001<p<0.002. Therefore, there is strong evidence that μ≠1.
Random samples of 13 and 11 chicks, respectively, were given from birth a protein supplement, either oil meal or meat meal. The weights of the chicks when six weeks old are recorded and the following sample statistics obtained:
- Carry out an F-test to examine whether or not the groups have significantly different variances or not.
- Calculate a 95% confidence interval for the difference between weights of 6-week-old chicks on the two diet supplements.
- Do you consider that the supplements have a significantly different effect? Justify your answer.
Note that
F10,12:Solution to Exercise 19.2.
We regard the data as being from two independent normal distributions with unknown variances.
- F-test: H0:σ21=σ22 vs. H1:σ21≠σ22.
We reject H0 if
F=s21s22>Fn1−1,n2−1,1−α/2=1Fn2−1,n1−1,α/2 or F=s21s22<Fn1−1,n2−1,α/2. Now, F12,10,0.025=3.6209 and F12,10,0.975=13.3736=0.2964. From the data F=s21/s22=0.7158 so we do not reject H0. There is no evidence against equal population variances.
- Assume σ21=σ22=σ2 (unknown). The pooled estimate of the common variance σ2 is
s2p=(n1−1)s21+(n2−1)s22n1+n2−2=3453.75, so sp=58.77. The 95% confidence limits for μ1−μ2 are
ˉx1−ˉx2±t22,0.025sp√1n1+1n2=(247.9‘−275.5)±(2.0739×0.4097×58.77)=−27.6±49.9355. So the interval is (−77.5355,22.3355).
- Since the confidence interval in (b) includes zero (where μ1−μ2=0, μ1=μ2) we conclude that the diet supplements do not have a significantly different effect (at 5% level).
A random sample of 12 car drivers took part in an experiment to find out if alcohol increases the average reaction time. Each driver’s reaction time was measured in a laboratory before and after drinking a specified amount of alcoholic beverage. The reaction times were as follows:
Let μB and μA be the population mean reaction time, before and after drinking alcohol.
- Test H0:μB=μA vs. H1:μB≠μA assuming the two samples are independent.
- Test H0:μB=μA vs. H1:μB≠μA assuming the two samples contain `matched pairs’.
- Which of the tests in (a) and (b) is more appropriate for these data, and why?
Note that
and the critical values for t22 are given above in Exercise 19.2.
Solution to Exercise 19.3.
- The summary statistics of the reaction times before alcohol are ˉx=0.7275 and s2x=0.0103. Similarly the summary statistics after alcohol are ˉy=0.775 and s2y=0.0088. Assuming both samples are from normal distributions with the same variance, the pooled variance estimator is
s2p=(n−1)s2x+(n−1)s2y2(n−1)=0.0096. The null hypothesis is rejected at α=0.05 if
t=|ˉx−ˉysp√1n+1n|>t22,0.025=2.0739. From the data,
t=|0.7275−0.7750.098√212|=1.1873 Hence, the null hypothesis is not rejected. There is no significant difference between the reaction times.
- The difference in reaction time for each driver is
after−before=(0.05,−0.02,0.10,0.07,0.05,0.15,0.04,0.00,0.09,0.06,0.00,−0.02) The sample mean and variance of the differences are ˉd=0.0475 and sd=0.0517. Assuming the differences are samples from a normal distribution, the null hypothesis is rejected at α=0.05 if
t=|ˉdsd/√11|>t11,0.025=2.201. From the data,
t=|0.04750.0517/√11|=3.1827. Hence, the null hypothesis is rejected. There is a significant difference between the reaction times.
- The matched pair test in (b) is more appropriate. By recording each driver’s reaction time before and after, and looking at the difference for each driver we are removing the driver effect. The driver effect says that some people are naturally slow both before and after alcohol, others are naturally quick. By working with the difference we have removed this factor.