33 Tests for the mean difference (paired data)
You have learnt to ask a RQ, design a study, classify and summarise the data, construct confidence intervals, and perform some hypothesis tests. In this chapter, you will learn to:
- identify situations where conducting a test for a proportion is appropriate.
- conduct hypothesis tests for the mean difference with paired data.
- determine whether the conditions for using these methods apply in a given situation.
33.1 Introduction: soil nitrogen
Lambie, Mudge, and Stevenson (2021) compared the percentage nitrogen (%N) in soils from intensively-grazed irrigated and non-irrigated pastures. The researchers paired similar irrigated and non-irrigated sites (p. 338):
The irrigated and non-irrigated pairs within each site were within \(100\)of each other and were on the same soil, landform and usually the same farm with the same farm management on both treatments.
Pairing (Sect. 26.1) is a form of blocking (Sect. 7.2) and is used to manage confounding. The comparison is within individuals. One RQ in the study was:
For intensively grazed pastures sites, is there a mean reduction in percentage soil nitrogen (%N) when sites are irrigated, compared to non-irrigated?
The data are shown in the table below. The parameter is \(\mu_d\), the population mean reduction in %N from non-irrigated to irrigated sites; that is, non-irrigated %N minus irrigated %N.
Explaining how the differences are computed is important. The differences here are the %N in non-irrigated sites minus %N in irrigated sites.
The differences could be computed as the %N in irrigated sites minus %N in non-irrigated sites. Either is fine, as long as you remain consistent throughout. The meaning of any conclusions will be the same.
Irrigated | Not irrigated | Reduction |
---|---|---|
\(\phantom{0}0.35\) | \(\phantom{0}0.38\) | \(\phantom{-}0.03\) |
\(\phantom{0}0.42\) | \(\phantom{0}0.43\) | \(\phantom{-}0.01\) |
\(\phantom{0}0.27\) | \(\phantom{0}0.23\) | \(-0.04\) |
\(\phantom{0}0.18\) | \(\phantom{0}0.24\) | \(\phantom{-}0.06\) |
\(\phantom{0}0.56\) | \(\phantom{0}0.58\) | \(\phantom{-}0.02\) |
\(\phantom{0}0.34\) | \(\phantom{0}0.26\) | \(-0.08\) |
\(\phantom{0}0.26\) | \(\phantom{0}0.25\) | \(-0.01\) |
\(\phantom{0}0.58\) | \(\phantom{0}0.44\) | \(-0.14\) |
\(\phantom{0}0.50\) | \(\phantom{0}0.49\) | \(-0.01\) |
\(\phantom{0}0.47\) | \(\phantom{0}0.55\) | \(\phantom{-}0.08\) |
\(\phantom{0}0.55\) | \(\phantom{0}0.55\) | \(\phantom{-}0.00\) |
\(\phantom{0}0.41\) | \(\phantom{0}0.45\) | \(\phantom{-}0.04\) |
\(\phantom{0}0.51\) | \(\phantom{0}0.54\) | \(\phantom{-}0.03\) |
\(\phantom{0}0.47\) | \(\phantom{0}0.56\) | \(\phantom{-}0.09\) |
\(\phantom{0}0.27\) | \(\phantom{0}0.33\) | \(\phantom{-}0.06\) |
\(\phantom{0}0.29\) | \(\phantom{0}0.31\) | \(\phantom{-}0.02\) |
\(\phantom{0}0.40\) | \(\phantom{0}0.43\) | \(\phantom{-}0.03\) |
\(\phantom{0}0.26\) | \(\phantom{0}0.26\) | \(\phantom{-}0.00\) |
\(\phantom{0}0.52\) | \(\phantom{0}0.53\) | \(\phantom{-}0.01\) |
\(\phantom{0}0.30\) | \(\phantom{0}0.41\) | \(\phantom{-}0.11\) |
\(\phantom{0}0.20\) | \(\phantom{0}0.32\) | \(\phantom{-}0.12\) |
\(\phantom{0}0.30\) | \(\phantom{0}0.30\) | \(\phantom{-}0.00\) |
\(\phantom{0}0.24\) | \(\phantom{0}0.26\) | \(\phantom{-}0.02\) |
\(\phantom{0}0.49\) | \(\phantom{0}0.67\) | \(\phantom{-}0.18\) |
\(\phantom{0}0.27\) | \(\phantom{0}0.29\) | \(\phantom{-}0.02\) |
\(\phantom{0}0.44\) | \(\phantom{0}0.47\) | \(\phantom{-}0.03\) |
\(\phantom{0}0.27\) | \(\phantom{0}0.28\) | \(\phantom{-}0.01\) |
\(\phantom{0}0.40\) | \(\phantom{0}0.50\) | \(\phantom{-}0.10\) |
The data are summarised graphically (Fig. 33.1) and numerically (Table 33.2), using information provided by software (Fig. 33.2).
Mean | Standard deviation | Standard error | Sample size | |
---|---|---|---|---|
Irrigated sites | \(0.376\) | \(0.1186\) | \(0.0224\) | \(28\) |
Non-irrigated sites | \(0.404\) | \(0.1266\) | \(0.0239\) | \(28\) |
Change | \(0.028\) | \(0.0618\) | \(0.0117\) | \(28\) |
33.2 Statistical hypotheses and notation
The RQ asks if the mean %N reduces in the population when sites are irrigated. If the differences are defined as the %N for non-irrigated sites minus irrigated sites, then positive values refer to non-irrigated sites having larger %N than irrigated sites, on average. Then, the RQ is asking if the mean difference is zero, or if it is greater than zero. The parameter is the population mean difference. The notation used (recapping Sect. 26.4) is:
- \(\mu_d\): the mean difference in the population (in %N).
- \(\bar{d}\): the mean difference in the sample (in %N).
- \(s_d\): the sample standard deviation of the differences (in %N).
- \(n\): the number of differences.
The null hypothesis is that 'there is no mean change in %N, in the population' (Sect. 32.2):
- \(H_0\): \(\mu_d = 0\).
This hypothesis, which we initially assume to be true, postulates that the mean reduction may not be zero in the sample, due to sampling variation.
Since the RQ asks specifically if mean %N decreases, the alternative hypothesis is one-tailed (Sect. 32.2). According to how the differences have been defined, the alternative hypothesis is:
- \(H_1\): \(\mu_d > 0\) (i.e., one-tailed).
This hypothesis says that the mean change in the population is greater than zero, because of the wording of the RQ, and because of how the differences were defined. (If the differences were defined in the opposite way---as '%N in irrigated sites minus non-irrigated sites'--- then the alternative hypothesis would be \(\mu_d < 0\), which has the same meaning.)
33.3 Tests for the mean difference: \(t\)-test
The sample mean %N reduction varies depending on which one of the countless possible samples is randomly obtained, even if the mean reduction in the population is zero. That is, the value of \(\bar{d}\) varies across all possible samples even if \(\mu_d = 0\). The value of \(\bar{d}\) has a sampling distribution.
Definition 33.1 (Sampling distribution of a sample mean difference) The sampling distribution of the sample mean difference is (when certain conditions are met; Sect. 33.5) described by
- an approximate normal distribution,
- centred around the sampling mean, whose value is \(\mu_d\) (from \(H_0\)),
- with a standard deviation (called the standard error of \(\bar{d}\)) of \[\begin{equation} \text{s.e.}(\bar{d}) = \frac{s_d}{\sqrt{n}}, \tag{33.1} \end{equation}\] where \(n\) is the number of differences, and \(s_d\) is the standard deviation of the sample differences.
The value of the standard error of the differences here is \[ \text{s.e.}(\bar{d}) = \frac{s_d}{\sqrt{n}} = \frac{0.0618}{\sqrt{28}} = 0.0117, \] as shown in Fig. 33.2 (top table). This sampling distribution of \(\bar{d}\) is shown in Fig. 33.3.
The sample mean difference can be located on the sampling distribution (Fig. 33.3) by computing the \(t\)-score: \[ t = \frac{\bar{d} - \mu_{d}}{\text{s.e.}(\bar{d})} = \frac{0.0282 - 0}{0.0117} = 2.41, \] following the ideas in Eq. (31.2). Software displays the same \(t\)-score (Fig. 33.2, top table).
A \(P\)-value determines if the sample data are consistent with the assumption (Table 32.1). Since \(t = 2.41\), and since \(t\)-scores are like \(z\)-scores, the one-tailed \(P\)-value is smaller than \(0.025\) (based on the \(68\)--\(95\)--\(99.7\) rule). Software (Fig. 33.2) reports that the two-tailed \(P\)-value is \(0.02279\). Hence, the one-tailed \(P\)-value is \(0.02279/2 = 0.0114\) (indeed smaller than \(0.025\)).
The software clarifies how the differences have been computed.
At the left of the output (Fig. 33.2), the order implies the differences are found as NonirrigatedN
minus IrrigatedN
, the same as our definition.
33.4 Writing conclusions
The one-tailed \(P\)-value is \(0.0114\), suggesting moderate evidence (Table 32.1) to support \(H_1\). (That is, there is compelling evidence to support the alternative hypothesis.) A conclusion requires an answer to the RQ, a summary of the evidence leading to that conclusion, and some summary statistics:
Moderate evidence exists in the sample (paired \(t = 2.41\); one-tailed \(P = 0.0114\)) of a mean reduction in percentage soil N from non-irrigated to irrigated sites (mean reduction: \(0.0282\)%; \(n = 28\); \(95\)% CI from \(0.0042\)% to \(0.0522\)%).
The CI is found using the process described in Chap. 26.1. The wording implies the direction of the differences.
Saying 'there is evidence of a difference' is insufficient. You must state which measurement is, on average, higher (that is, what the differences mean).
33.5 Statistical validity conditions
As with any hypothesis test, these results apply under certain conditions. For a hypothesis test for the mean of paired data, these conditions are the same as for the CI for the mean difference for paired data (Sect. 26.5), and similar to those for one sample mean.
Statistical validity can be assessed using these criteria:
- When \(n \ge 25\), the test is statistically valid. (If the distribution of the differences is highly skewed, the sample size may need to be larger.)
- When \(n < 25\), the test is statistically valid only if the data come from a population of differences with a normal distribution.
The sample size of \(25\) is a rough figure; some books give other values (such as \(30\)).
This condition ensures that the distribution of the sample mean differences has an approximate normal distribution (so that, for example, the \(68\)--\(95\)--\(99.7\) rule can be used). Provided \(n \ge 25\), this will be approximately true even if the distribution of the differences in the population does not have a normal distribution.
The units of analysis are also assumed to be independent (e.g., from a simple random sample).
If the statistical validity conditions are not met, other similar options include using a sign test of the differences or a Wilcoxon signed-rank test of the differences (Conover 2003), or using resampling methods (Efron and Hastie 2021). For more than two within-individuals comparisons, repeated-measures analysis of variance (ANOVA) can be used. For paired qualitative data, McNemar's test may be suitable.
Example 33.1 (Statistical validity) For the %N data, the sample size is \(n = 28\), so the test is statistically valid.
33.6 Example: invasive plants
[Dataset: Flowering
]
(This study was seen in Sect. 26.6.)
Skypilot is a alpine wildflower native to the Colorado Rocky Mountains (USA).
In recent years, a willow shrub (Salix) has been encroaching on skypilot territory and, because willow often flowers early, Kettenbach et al. (2017) studied whether the willow may 'negatively affect pollination regimes of resident alpine wildflower species' (p. 6 965).
Data for both species was collected at \(n = 25\) different sites, so the data are paired by site (Sect. 26.1). The data are shown in Sect. 26.6. The parameter is \(\mu_d\), the population mean difference in day of first flowering for skypilot, less the day of first flowering for willow. A positive value for the difference means that the skypilot values are larger, and hence that willow flowered first. The RQ is:
In the Colorado Rocky Mountains, is there a mean difference between first-flowering day for the native skypilot and encroaching willow?
The hypotheses are \[ \text{$H_0$: $\mu_d = 0$}\quad\text{and}\quad\text{$H_1$: $\mu_d\ne 0$}, \] where the alternative hypothesis is two-tailed, and \(\mu\) is the mean difference between first-flowering day for the native skypilot and encroaching willow.
Explaining how the differences are computed is important. The differences here are skypilot minus willow first-flowering days.
However, the differences could be computed as willow minus skypilot first-flowering days. Either is fine, as long as you remain consistent. The meaning of any conclusions will be the same.
The data are summarised graphically in Fig. 14.4. The numerical summary (Table 33.3) and software output (Fig. 33.4) are repeated here.
Mean | Standard deviation | Standard error | Sample size | |
---|---|---|---|---|
Willow (encroaching) | \(189.40\) | \(12.200\) | \(2.440\) | \(25\) |
Skypilot (native) | \(190.76\) | \(13.062\) | \(2.612\) | \(25\) |
Differences | \(\phantom{0}\phantom{0}1.36\) | \(\phantom{0}4.698\) | \(0.940\) | \(25\) |
The standard error of the mean difference is \(\text{s.e.}(\bar{d}) = 0.9396\) (Fig. 14.3 or Table 14.4). The value of the test statistic (i.e., the \(t\)-score) is \[\begin{align*} t = \frac{\bar{d} - \mu_d}{\text{s.e.}(\bar{d})} = \frac{1.36 - 0}{0.9396} = 1.45, \end{align*}\] as in the output. This is a relatively small value of \(t\), so a large \(P\)-value is expected using the \(68\)--\(95\)--\(99.7\) rule. Indeed, the output shows that \(P = 0.161\): there is no evidence of a mean difference in first-flowering day (i.e., the sample mean difference could reasonably be explained by sampling variation if \(\mu_d = 0\)). We write:
No evidence exists (\(t = 1.45\); two-tailed \(P = 0.161\)) that the day of first-flowering is different for the encroaching willow and the native skypilot (mean difference: \(1.36\) days earlier for willow; \(95\)% CI from \(-0.57\) to \(3.30\); \(n = 25\)).
The test is statistically valid since \(n = 25\).
We do not say whether the evidence supports the null hypothesis. We assume the null hypothesis is true, so we state how strong the evidence is to support the alternative hypothesis. The current sample presents no evidence to contradict the assumption (but future evidence may emerge).
Be clear in your conclusion about how the differences are computed. Make sure to interpret the CI consistent with how the differences were defined.
33.7 Example: chamomile tea
(This study was seen in Sects. 26.7 and 27.8.) Rafraf, Zemestani, and Asghari-Jafarabadi (2015) studied patients with Type 2 diabetes mellitus (T2DM). They randomly allocated \(32\) patients into a control group (who drank hot water), and another \(32\) patients to receive chamomile tea. Summary data are shown in Table 26.3.
In Sect. 26.7, a CI was formed for the mean reduction in total glucose (TG) for the hot water and the tea groups separately. However, we can also ask about whether the mean differences in TG are non-zero due to sampling variation or not. The following descriptive RQs can be asked:
- For patients with T2DM, is there a mean change in TG after eight weeks drinking chamomile tea?
- For patients with T2DM, is there a mean change in TG after eight weeks drinking hot water?
The hypotheses are (where \(\mu\) represent the mean change in TG (in mg.dl\(-1\)) after eight weeks):
- \(H_0\): \(\mu_d = 0\)\(H_1\): \(\mu_d \ne 0\).
- \(H_0\): \(\mu_d = 0\)\(H_1\): \(\mu_d \ne 0\).
For the two groups, the test statistics are (using the standard errors in Sect. 26.7): \[ t_T = \frac{38.62 - 0}{5.37} = 7.19\qquad\text{and}\qquad t_W = \frac{-7.12 - 0}{6.48} = -1.10, \] where the subscripts \(T\) and \(W\) refer to the tea and hot-water groups respectively. The \(t\)-score for the tea-drinking group is huge, so the two-tailed \(P\)-value will be very small using the \(68\)--\(95\)--\(99.7\) rule, and certainly smaller than \(0.001\). This means that there is evidence that chamomile tea had an impact on the mean change in TG.
In contrast, the \(t\)-score for the water-drinking group is small, so the two-tailed \(P\)-value will be large using the \(68\)--\(95\)--\(99.7\) rule, and certainly larger than \(0.10\). This means there is no evidence that placebo treatment (hot water) had any impact on mean change in TG (as one might expect for a placebo).
We can write:
There is very strong evidence (\(t = 7.19\); two-tailed \(P < 0.001\)) of a mean change in TG for the chamomile-drinking groups (mean reduction: \(38.62\).dl\(-1\); \(n = 32\); approx. \(95\)% CI: \(27.88\) to \(49.36\).dl\(-1\)), but no evidence (\(t = -1.10\); two-tailed \(P > 0.10\)) of a mean change in the hot-water drinking group (mean reduction: \(-7.12\).dl\(-1\); \(n = 32\); approx. \(95\)% CI: \(-20.08\) to \(5.84\).dl\(-1\)).
The sample sizes are larger than \(25\), so the results are statistically valid.
The two groups appear to show different mean reductions in TG, depending on which group the subject is in. This may suggest that chamomile tea reduces TG compared to the control, but perhaps the difference in TG between the hot water and tea groups is simply due to sampling variation. This is considered in Sect. 34.6.
33.8 Chapter summary
To test a hypothesis about a population mean difference \(\mu_d\):
- Write the null hypothesis (\(H_0\)) and the alternative hypothesis (\(H_1\)).
- Initially assume the value of \(\mu_d\) in the null hypothesis to be true.
- Then, describe the sampling distribution, which describes what to expect from the sample mean difference based on this assumption: under certain statistical validity conditions, the sample mean difference varies with:
- an approximate normal distribution,
- with sampling mean whose value is the value of \(\mu_d\) (from \(H_0\)), and
- having a standard deviation of \(\displaystyle \text{s.e.}(\bar{d}) =\frac{s_d}{\sqrt{n}}\).
- Compute the value of the test statistic: \[ t = \frac{ \bar{d} - \mu}{\text{s.e.}(\bar{d})}, \] where \(\mu_d\) is the hypothesised value given in the null hypothesis.
- The \(t\)-value is like a \(z\)-score, and so an approximate \(P\)-value can be estimated using the \(68\)--\(95\)--\(99.7\) rule, or found using software.
- Make a decision, and write a conclusion.
- Check the statistical validity conditions.
The following short video may help explain some of these concepts:
33.9 Quick review questions
Bacho et al. (2019) compared joint pain in stroke patients before and after a supervised exercise treatment. The same participants (\(n = 34\)) were assessed before and after treatment.
The mean improvement in joint pain after \(13\) weeks was \(1.27\) (with a standard error of \(0.57\)) measured using a standardised tool.
- True or false: Only 'before and after' studies can be paired.
- True or false: The null hypothesis is about the population mean difference.
- What is the value of the test statistic (to two decimal places)?
- What is the approximate value of the two-tailed \(P\)-value?
- True or false: The 'test statistic' is a \(t\)-score for this test.
33.10 Exercises
Answers to odd-numbered exercises are available in App. E.
Exercise 33.1 A group of primary school children were asked to complete a certain task on both a personal computer (PC) and using a tablet computer.
If the differences were defined as the time to complete the task on the PC, minus the time to complete the same task on a tablet (one difference for each child), what do the difference mean?
Exercise 33.2 Suppose water quality was recorded \(500\) m upstream and \(500\)downstream of \(28\) different copper mines.
If the differences were defined as the pH downstream minus the water pH upstream for each river, what do the differences mean?
Exercise 33.3 [Dataset: Fruit
]
(These data were also seen in Exercise 26.3.)
Mukherjee, Deb, and Devy (2019) studied the effect of rainfall on growing Chayote squash (Sechium edule).
They compared the size of the fruit in a year with normal rainfall (2015) compared to a dry year (2014) on \(24\) farms:
For Chayote squash grown in Bangalore, what is the mean difference in fruit weight between a normal and dry year?
Ten fruits were gathered from each farm in both years, and the average (mean) weight of the fruit recorded for the farm. Since the same farms are used in both years, the data are paired (Exercise 26.3). Data is missing for Farm 20 in the dry year (2014), so there are \(n = 23\) differences.
- Construct a suitable graph to display the differences.
- What is the parameter? Carefully describe what it means.
- Write down the hypotheses.
- Sketch the sampling distribution.
- Compute the \(t\)-score.
- Determine the \(P\)-value.
- Write a conclusion.
- Is the test statistically valid?
Exercise 33.4 (This study was also seen in Exercise 26.4.) In a study of hypertension (Hand et al. 1996; MacGregor et al. 1979), patients were given a drug (Captopril) and their systolic blood pressure measured (in mm Hg) immediately before and two hours after being given the drug (data with Exercise 26.4).
The aim is to see if there is evidence of a reduction in blood pressure after taking Captopril. Use the data (Table 14.6) and the software output (Fig. 33.5) to answer these questions.
- Explain why it is probably more sensible to compute differences as the Before minus the After measurements. What do the differences mean when computed this way?
- What is the parameter? Carefully describe what it means.
- Construct a suitable graph to display the differences.
- Write down the hypotheses.
- Sketch the sampling distribution.
- Write down the \(t\)-score.
- Write down the \(P\)-value.
- Write a conclusion.
- Is the test statistically valid?
Exercise 33.5 (These data were also seen in Exercise 26.5.) People often struggle to eat the recommended intake of vegetables. Fritts et al. (2018) explored ways to increase vegetable intake in teens. Teens rated the taste of raw broccoli, and raw broccoli served with a specially-made dip.
Each teen (\(n = 100\)) had a pair of measurements: the taste rating of the broccoli with and without dip. Taste was assessed using a '\(100\)visual analog scale', where a higher score means a better taste. In summary:
- For raw broccoli, the mean taste rating was \(56.0\) (with a standard deviation of \(26.6\));
- For raw broccoli served with dip, the mean taste rating was \(61.2\) (with a standard deviation of \(28.7\)).
Because the data are paired, the differences are the best way to describe the data. The mean difference in the ratings was \(5.2\), with standard error of \(3.06\).
Perform a hypothesis test to see if the use of dip increases the mean taste rating.
Exercise 33.6 (This study was also seen in Exercise 26.6.) Allen et al. (2018) examined the effect of exercise on smoking. Men and women were assessed on their 'intention to smoke', both before and after exercise for each subject (using two quantitative questionnaires). Smokers ('smoking at least five cigarettes per day') aged \(18\) to \(40\) were enrolled for the study. For the \(23\) women in the study, the mean intention to smoke after exercise reduced by \(0.66\) (with a standard error of \(0.37\)).
Perform a hypothesis test to determine if there is evidence of a population mean reduction in intention-to-smoke for women after exercising.
Exercise 33.7 In a study (Cressie, Sheffield, and Whitford 1984) conducted at the Adelaide Children's Hospital (p. 107; emphasis added):
... a group of beta thalassemia patients [...] were treated by a continuous infusion of desferrioxamine, in order to reduce their ferritin content...
Using the data shown below, conduct a hypothesis test to determine if there is evidence that the treatment reduces the ferritin content, as intended.
September | March | Reduction |
---|---|---|
\(6630\) | \(5100\) | \(\phantom{-}1530\) |
\(4590\) | \(3510\) | \(\phantom{-}1080\) |
\(3510\) | \(6600\) | \(-3090\) |
\(6375\) | \(8000\) | \(-1625\) |
\(2500\) | \(2800\) | \(-300\) |
\(1400\) | \(2860\) | \(-1460\) |
\(4580\) | \(3640\) | \(\phantom{-}\phantom{0}940\) |
\(6885\) | \(9030\) | \(-2145\) |
\(4200\) | \(4420\) | \(-220\) |
\(5600\) | \(7910\) | \(-2310\) |
Exercise 33.8 (This study was also seen in Exercise 26.8.) The concentration of beta-endorphins in the blood is a sign of stress. Hoaglin, Mosteller, and Tukey (2011) measured the beta-endorphin concentration for \(19\) patients about to undergo surgery (Hand et al. 1996). Each patient had their beta-endorphin concentrations measured \(12\)--\(14\) hours before surgery, and also \(10\) minutes before surgery.
A numerical summary (Table 33.5) can be produced from jamovi output. Use the output to test the RQ.
Means | Std.deviation | Std.Error | Sample.size | |
---|---|---|---|---|
12--14 hours before surgery | 8.352632 | 4.396763 | 1.008687 | 19 |
10 minutes before surgery | 16.052632 | 12.508769 | 2.869708 | 19 |
Increase | 7.700000 | 13.519163 | 3.101509 | 19 |