34 Tests for the mean difference (paired data)

So far, you have learnt to ask a RQ, design a study, classify and summarise the data, form confidence intervals, and conduct hypothesis tests. In this chapter, you will learn to:

  • conduct hypothesis tests for the mean difference with paired data
  • determine whether the conditions for using these methods apply in a given situation.

34.1 Introduction: soil nitrogen

Lambie, Mudge, and Stevenson (2021) compared the percentage nitrogen (%N) in soils from irrigated and non-irrigated intensively-grazed pastures. The researchers paired irrigated and non-irrigated sites (p. 338):

The irrigated and non-irrigated pairs within each site were within \(100\) m of each other and were on the same soil, landform and usually the same farm with the same farm management on both treatments.

Pairing (Sect. 27.1) is a form of blocking (Sect. 7.2) and used to manage confounding. One RQ in the study was:

For intensively grazed pastures sites, is there a mean reduction in percentage soil nitrogen (%N) when sites are irrigated, compared to non-irrigated?

The data are shown in the table below. The parameter is \(\mu_d\), which we define as the population mean reduction in %N when sites are irrigated, compared to non-irrigated.

Explaining how the differences are computed is important.

The differences here are the %N in non-irrigated sites minus %N in irrigated sites. However, the differences could be computed as the %N in irrigated sites minus %N in non-irrigated sites. Either is fine, as long as you remain consistent throughout. The meaning of any conclusions will be the same.

TABLE 27.1: The percentage total nitrogen in irrigated and non-irrigated soils in \(n = 28\) sites
Irrigated Not irrigated Reduction
\(\phantom{0}0.35\) \(\phantom{0}0.38\) \(\phantom{-}0.03\)
\(\phantom{0}0.42\) \(\phantom{0}0.43\) \(\phantom{-}0.01\)
\(\phantom{0}0.27\) \(\phantom{0}0.23\) \(-0.04\)
\(\phantom{0}0.18\) \(\phantom{0}0.24\) \(\phantom{-}0.06\)
\(\phantom{0}0.56\) \(\phantom{0}0.58\) \(\phantom{-}0.02\)
\(\phantom{0}0.34\) \(\phantom{0}0.26\) \(-0.08\)
\(\phantom{0}0.26\) \(\phantom{0}0.25\) \(-0.01\)
\(\phantom{0}0.58\) \(\phantom{0}0.44\) \(-0.14\)
\(\phantom{0}0.50\) \(\phantom{0}0.49\) \(-0.01\)
\(\phantom{0}0.47\) \(\phantom{0}0.55\) \(\phantom{-}0.08\)
\(\phantom{0}0.55\) \(\phantom{0}0.55\) \(\phantom{-}0.00\)
\(\phantom{0}0.41\) \(\phantom{0}0.45\) \(\phantom{-}0.04\)
\(\phantom{0}0.51\) \(\phantom{0}0.54\) \(\phantom{-}0.03\)
\(\phantom{0}0.47\) \(\phantom{0}0.56\) \(\phantom{-}0.09\)
\(\phantom{0}0.27\) \(\phantom{0}0.33\) \(\phantom{-}0.06\)
\(\phantom{0}0.29\) \(\phantom{0}0.31\) \(\phantom{-}0.02\)
\(\phantom{0}0.40\) \(\phantom{0}0.43\) \(\phantom{-}0.03\)
\(\phantom{0}0.26\) \(\phantom{0}0.26\) \(\phantom{-}0.00\)
\(\phantom{0}0.52\) \(\phantom{0}0.53\) \(\phantom{-}0.01\)
\(\phantom{0}0.30\) \(\phantom{0}0.41\) \(\phantom{-}0.11\)
\(\phantom{0}0.20\) \(\phantom{0}0.32\) \(\phantom{-}0.12\)
\(\phantom{0}0.30\) \(\phantom{0}0.30\) \(\phantom{-}0.00\)
\(\phantom{0}0.24\) \(\phantom{0}0.26\) \(\phantom{-}0.02\)
\(\phantom{0}0.49\) \(\phantom{0}0.67\) \(\phantom{-}0.18\)
\(\phantom{0}0.27\) \(\phantom{0}0.29\) \(\phantom{-}0.02\)
\(\phantom{0}0.44\) \(\phantom{0}0.47\) \(\phantom{-}0.03\)
\(\phantom{0}0.27\) \(\phantom{0}0.28\) \(\phantom{-}0.01\)
\(\phantom{0}0.40\) \(\phantom{0}0.50\) \(\phantom{-}0.10\)

Since the raw data are available, the data should be summarised graphically (Fig. 34.1) and numerically (Table 34.1), using information provided by software (Fig. 34.2).

TABLE 34.1: The soil nitrogen in irrigated and non-irrigated soils.
Mean Std. dev. Std. error Sample size
Irrigated sites \(0.3757\) \(0.1186\) \(0.0224\) \(28\)
Non-irrigated sites \(0.4039\) \(0.1266\) \(0.0239\) \(28\)
Change \(0.0282\) \(0.0618\) \(0.0117\) \(28\)
The reduction in percentage N when sites are irrigated, compared to non-irrigated. Left: A histogram, Right: a case-profile plot (solid lines, solid dots for lower percentage N in irrigated sites).

FIGURE 34.1: The reduction in percentage N when sites are irrigated, compared to non-irrigated. Left: A histogram, Right: a case-profile plot (solid lines, solid dots for lower percentage N in irrigated sites).

jamovi output for the nitrogen datajamovi output for the nitrogen data

FIGURE 34.2: jamovi output for the nitrogen data

34.2 Statistical hypotheses and notation

The RQ asks if the mean %N reduces in the population when sites are irrigated. If the difference is defined as the %N in non-irrigated sites minus irrigated sites, the RQ is asking if the mean difference is zero, or if it is greater than zero. The parameter is the population mean difference. The notation used (recapping Sect. 27.5) is:

  • \(\mu_d\): The mean difference in the population (in %).
  • \(\bar{d}\): The mean difference in the sample (in %).
  • \(s_d\): The sample standard deviation of the differences (in %).
  • \(n\): The number of differences.

The hypotheses, therefore, can be written in terms of the parameter \(\mu_d\). The null hypothesis is 'there is no mean change in %N, in the population' (Sect. 33.2):

  • \(H_0\): \(\mu_d = 0\).

This hypothesis, the initial assumption, postulates that the mean reduction may not be zero in the sample due to sampling variation.

Since the RQ asks specifically if mean %N decreases, the alternative hypothesis is one-tailed (Sect. 33.2). According to how the differences have been defined, the alternative hypothesis is:

  • \(H_1\): \(\mu_d > 0\) (i.e., one-tailed).

This hypothesis says that the mean change in the population is greater than zero, because of the wording of the RQ, and because of how we defined the differences. (If the differences had been defined in the opposite way---as '%N in irrigated sites minus non-irrigated sites'--- then the alternative hypothesis would be \(\mu_d < 0\), which has the same meaning.)

34.3 Describing the sampling distribution

The sample mean %N reduction will vary depending on which one of the many possible samples is randomly obtained, even if the mean reduction in the population is zero. That is, the value of \(\bar{d}\) will vary across all possible samples even if \(\mu_d = 0\). The sample mean changes \(\bar{d}\) have a sampling distribution.

Definition 34.1 (Sampling distribution of a sample mean difference) The sampling distribution of the sample mean difference is (when certain conditions are met; Sect. 34.7) described by

  • an approximate normal distribution,
  • centred around the sampling mean difference, whose value is \(\mu_d\) (from \(H_0\)),
  • with a standard deviation (called the standard error of \(\bar{d}\)) of
    \[\begin{equation} \text{s.e.}(\bar{d}) = \frac{s_d}{\sqrt{n}}, \tag{34.1} \end{equation}\] where \(n\) is the number of differences, and \(s_d\) is the standard deviation of the differences. In general, the approximation gets better as the sample size gets larger.

The value of the standard error of the differences here is
\[ \text{s.e.}(\bar{d}) = s_d/\sqrt{n} = 0.0618/\sqrt{28} = 0.0117. \] This describes what can be expected from the possible values of \(\bar{d}\) (Fig. 34.3), just through sampling variation if \(\mu_d = 0\).

The sampling distribution is a normal distribution; it describes how the sample mean reduction in percentage N varies in samples of size $n = 28$ when the population mean reduction is $0$.

FIGURE 34.3: The sampling distribution is a normal distribution; it describes how the sample mean reduction in percentage N varies in samples of size \(n = 28\) when the population mean reduction is \(0\).

34.4 Computing the value of the test statistic

The sample mean difference can be located on the sampling distribution (Fig. 34.3) by computing the \(t\)-score:
\[ t = \frac{\bar{d} - \mu_{d}}{\text{s.e.}(\bar{d})} = \frac{0.0282 - 0}{0.0117} = 2.41, \] following the ideas in Eq. (32.2). Software also displays the \(t\)-score (Fig. 34.2). The \(t\)-score locates the observed sample statistic on the sampling distribution (Fig. 34.3).

34.5 Determining \(P\)-values

A \(P\)-value determines if the sample data are consistent with the assumption (Table 33.1). Since \(t = 2.41\), and since \(t\)-scores are like \(z\)-scores, the one-tailed \(P\)-value is small, based on the \(68\)--\(95\)--\(99.7\) rule. Software (Fig. 34.2) reports that the two-tailed \(P\)-value is \(0.02279\). Hence, the one-tailed \(P\)-value is \(0.02279/2 = 0.0114\).

The jamovi software clarifies how the differences have been computed: At the left of the output (Fig. 34.2), the order implies the differences are found as NonirrigatedN minus IrrigatedN, the same as our definition.

34.6 Writing conclusions

The one-tailed \(P\)-value is \(0.0114\), suggesting moderate evidence (Table 33.1) to support \(H_1\). A conclusion requires an answer to the RQ, a summary of the evidence leading to that conclusion, and some summary statistics, including a CI (indicating the precision of the statistic; Chap. 27.1):

Moderate evidence exists in the sample (paired \(t = 2.41\); one-tailed \(P = 0.0114\)) of a mean reduction in percentage soil N from non-irrigated to irrigated sites (mean reduction: \(0.0282\)%; \(n = 28\); \(95\)% CI from \(0.0042\)% to \(0.0522\)%).

The wording implies the direction of the differences.

Saying 'there is evidence of a difference' is insufficient. You must state which measurement is, on average, higher (that is, what the differences mean).

34.7 Statistical validity conditions

As with any hypothesis test, these results apply under certain conditions. For a hypothesis test for the mean of paired data, these conditions are the same as for the CI for the mean difference for paired data (Sect. 27.9), and similar to those for one sample mean.

The test above is statistically valid if one of these conditions is true:

  1. The number of differences \(n\) is at least \(25\); or
  2. The number of differences \(n\) is less than \(25\), and the population of differences has an approximate normal distribution.

The sample size of \(25\) is a rough figure here, and some books give other values (such as \(30\)). This condition ensures that the distribution of the sample mean differences has an approximate normal distribution (so that, for example, the \(68\)--\(95\)--\(99.7\) rule can be used). Provided the \(n > 25\), this will be approximately true even if the distribution of the differences in the population does not have a normal distribution. That is, when \(n > 25\) the sample mean differences generally have an approximate normal distribution, even if the differences themselves don't have a normal distribution.

Example 34.1 (Statistical validity) For the %N data, the sample size is \(n = 28\), so the test is statistically valid.

34.8 Example: invasive plants

(This study was seen in Sect. 27.10.) Skypilot (Polemonium viscosum) is a native alpine wildflower growing in the Colorado Rocky Mountains (USA). In recent years, a willow shrub (Salix) has been encroaching on skypilot territory and, because willow often flowers early, researchers (Kettenbach et al. 2017) are concerned that the willow may 'negatively affect pollination regimes of resident alpine wildflower species' (p. 6965).

Data for both species was collected at \(25\) different sites, so the data are paired (Sect. 27.1) and shown in Sect. 27.10. The parameter is \(\mu_d\), which we define as the population mean difference in day of first flowering for skypilot, less the day of first flowering for willow. Hence, a positive value for the difference means that the skypilot values are larger, and hence that willow flowered first. The RQ is:

In the Colorado Rocky Mountains, is there a mean difference between first-flowering day for the native skypilot and encroaching willow?

The hypotheses are \[ \text{$H_0$: $\mu_d = 0$}\quad\text{and}\quad\text{$H_1$: $\mu_d\ne 0$} \] where the alternative hypothesis is two-tailed.

Explaining how the differences are computed is important.

The differences here are skypilot first-flowering days minus willow first-flowering days. However, the differences could be computed as willow first-flowering days minus skypilot first-flowering days. Either is fine, as long as you remain consistent throughout. The meaning of any conclusions will be the same.

Since the raw data are available, the data were summarised graphically in Fig. 15.4. The numerical summary (Table 34.2) and software output (Fig. 34.4) are repeated here.

jamovi output for the flowering-day data

FIGURE 34.4: jamovi output for the flowering-day data

TABLE 34.2: The day of first flowering for encroaching willow and native skypilot
Mean Std. dev. Std. error Sample size
Willow (encroaching) \(189.40\) \(12.200\) \(2.440\) \(25\)
Skypilot (native) \(190.76\) \(13.062\) \(2.612\) \(25\)
Differences \(\phantom{0}\phantom{0}1.36\) \(\phantom{0}4.698\) \(0.940\) \(25\)

The standard error of the mean difference is \(\text{s.e.}(\bar{d}) = 0.9396\), from Fig. 15.3 or Table 15.4. The value of the test statistic (i.e., the \(t\)-score) is
\[\begin{align*} t = \frac{\bar{d} - \mu_d}{\text{s.e.}(\bar{d})} = \frac{1.36 - 0}{0.9396} = 1.45, \end{align*}\] as in the output. This is a small value of \(t\), so a large \(P\)-value is expected using the \(68\)--\(95\)--\(99.7\) rule. Indeed, the output shows that \(P = 0.161\), so there is no evidence of a mean difference in first-flowering day: the sample difference could be explained by sampling variation. We write:

No evidence exists (\(t = 1.45\); two-tailed \(P = 0.161\)) that the day of first-flowering is different for the encroaching willow and the native skypilot (mean difference: \(1.36\) days earlier for willow; s.e.: \(0.940\); \(n = 25\)).

The CI should be statistically valid since \(n = 25\).

Be clear in your conclusion about how the differences are computed. Make sure to interpret the CI consistent with how the differences were defined.

34.9 Example: chamomile tea

(This study was seen in Sects. 27.11 and 28.10.) A study of patients with Type 2 diabetes mellitus (T2DM) randomly allocated \(32\) patients into a control group (who drank hot water), and \(32\) to receive chamomile tea (Rafraf, Zemestani, and Asghari-Jafarabadi (2015), p. 164). Summary data are given in Table 27.4.

In Sect. 27.11, a CI was formed for the mean reduction in total glucose (TG) for each group separately. However, we can also ask about whether the mean differences in TG are non-zero due to sampling variation or not. The following descriptive RQs can be asked:

  • For patients with T2DM, is there a mean change in TG after eight weeks drinking chamomile tea?
  • For patients with T2DM, is there a mean change in TG after eight weeks drinking hot water?

The hypotheses are:

\[\begin{align*} \text{Tea group}: \qquad & \text{$H_0$: } \mu_d = 0\quad \text{vs}\quad\text{$H_1$: } \mu_d \ne 0;\\ \text{Hot water group:}\qquad & \text{$H_0$: } \mu_d = 0\quad \text{vs}\quad\text{$H_1$: } \mu_d \ne 0. \end{align*}\]

For the two groups, using the standard errors found in Sect. 27.11, the test statistics are
\[ t_T = \frac{38.62 - 0}{5.37} = 7.19\qquad\text{and}\qquad t_W = \frac{-7.12 - 0}{6.48} = -1.10, \] where the subscripts \(T\) and \(W\) refer to the tea and hot-water groups respectively. The \(t\)-score for the tea-drinking group is huge, so the two-tailed \(P\)-value will be very small using the \(68\)--\(95\)--\(99.7\) rule, and certainly smaller than \(0.001\). This means that there is evidence that chamomile tea had an impact on the mean change in TG.

In contrast, the \(t\)-score for the water-drinking group is small, so the two-tailed \(P\)-value will be large using the \(68\)--\(95\)--\(99.7\) rule, and certainly larger than \(0.10\). This means that there is no evidence that placebo treatment (hot water) had any impact on mean change in TG, as one might expect.

We can write:

There is very strong evidence (\(t = 7.19\); two-tailed \(P < 0.001\)) of a mean change in TG for the chamomile-drinking groups (mean reduction: \(38.62\) mg.dl-1; \(n = 32\); approx. \(95\)% CI: \(27.88\) to \(49.36\) mg.dl-1), but no evidence (\(t = -1.10\); two-tailed \(P > 0.10\)) of a mean change in the hot-water drinking group (mean reduction: \(-7.12\) mg.dl-1; \(n = 32\); approx. \(95\)% CI: \(-20.08\) to \(5.84\) mg.dl-1).

These interval have a \(95\)% chance of straddling the difference between the mean reductions in TG. The sample sizes are larger than \(25\), so the results are statistically valid.

The two groups appear to show different mean reductions in TG, depending on which group the subject is in. This may suggest that chamomile tea reduces TG compared to the control... but perhaps the difference is simply due to sampling variation. This is studied in Sect. 35.9.

34.10 Chapter summary

Consider testing a hypothesis about a population mean difference \(\mu_d\), based on the value of the sample mean difference \(\bar{d}\). Under certain statistical validity conditions, the sample mean difference varies with an approximate normal distribution centred around \(\mu_{\bar{d}}\) (whose value is the hypothesised value of \(\mu_d\)), and with a standard deviation of
\[ \text{s.e.}(\bar{d}) = \frac{s_d}{\sqrt{n}}. \] This distribution describes what values of the sample mean difference could be expected if the value of \(\mu_d\) in the null hypothesis was true. The test statistic is
\[ t = \frac{ \bar{d} - \mu_{\bar{d}}}{\text{s.e.}(\bar{d})}, \] where \(\mu(\bar{d})\) is mean of all the possible sample mean differences; its value is the hypothesised value in the null hypothesis. The \(t\)-score describes what value of \(\bar{d}\) was observed in the sample, relative to what was expected. The \(t\)-value is like a \(z\)-score, so an approximate \(P\)-value can be estimated using the \(68\)--\(95\)--\(99.7\) rule, or is found using software. The \(P\)-values helps determine if the sample evidence is consistent with the assumption, or contradicts the assumption.

The following short video may help explain some of these concepts:

34.11 Quick review questions

A study (Bacho et al. 2019) compared joint pain in stroke patients before and after a supervised exercise treatment. The same participants (\(n = 34\)) were assessed before and after treatment.

The mean improvement in joint pain after \(13\) weeks was \(1.27\) (with a standard error of \(0.57\)) using a standardised tool.

  1. True or false? Only 'before and after' studies can be paired.
  2. True or false? The null hypothesis is about the population mean difference.
  3. What is the value of the test statistic (to two decimal places)?
  4. What is the two-tailed \(P\)-value?
  5. True or false? The 'test statistic' is a \(t\)-score.

34.12 Exercises

Selected answers are available in App. E.

Exercise 34.1 [Dataset: Fruit] (These data were also seen in Exercise 27.3.) The effect of rainfall on growing Chayote squash (Sechium edule) was studied (Mukherjee, Deb, and Devy 2019), comparing the size of the fruit in a year with normal rainfall (2015) compared to fruit in a dry year (2014) on \(24\) farms:

For Chayote squash grown in Bangalore, what is the mean difference in fruit weight between a normal and dry year?

Ten fruits were gathered from each farm in both years, and the average (mean) weight recorded for the farm. Since the same farms are used in both years, the data are paired (see above). Data is missing for Farm 20 in the dry year (2014), so there are \(n = 23\) differences.

  1. Write down the hypotheses.
  2. Compute the \(t\)-score.
  3. Determine the \(P\)-value.
  4. Write a conclusion.

Exercise 34.2 (This study was also seen in Exercise 27.4.) In a study of hypertension (Hand et al. 1996; MacGregor et al. 1979), patients were given a drug (Captopril) and their systolic blood pressure measured (in mm Hg) immediately before and two hours after being given the drug (data with Exercise 27.4).

The aim is to see if there is evidence of a reduction in blood pressure after taking Captopril. Using these data and the software output (Fig. 34.5):

  1. Explain why it is probably more sensible to compute differences as the Before minus the After measurements. What do the differences mean when computed this way?
  2. Compute the differences.
  3. Construct a suitable graph for the differences.
  4. Write down the hypotheses.
  5. Write down the \(t\)-score.
  6. Write down the \(P\)-value.
  7. Write a conclusion.
jamovi (top) and SPSS (bottom) output for the Captopril datajamovi (top) and SPSS (bottom) output for the Captopril data

FIGURE 34.5: jamovi (top) and SPSS (bottom) output for the Captopril data

Exercise 34.3 (These data were also seen in Exercise 27.5.) People often struggle to eat the recommended intake of vegetables. In one study exploring ways to increase vegetable intake in teens (Fritts et al. 2018), teens rated the taste of raw broccoli, and raw broccoli served with a specially-made dip.

Each teen (\(n = 100\)) had a pair of measurements: the taste rating of the broccoli with and without dip. Taste was assessed using a '\(100\) mm visual analog scale', where a higher score means a better taste. In summary:

  • For raw broccoli, the mean taste rating was \(56.0\) (with a standard deviation of \(26.6\));
  • For raw broccoli served with dip, the mean taste rating was \(61.2\) (with a standard deviation of \(28.7\)).

Because the data are paired, the differences are the best way to describe the data. The mean difference in the ratings was \(5.2\), with standard error of \(3.06\).

Perform a hypothesis test to see if the use of dip increases the mean taste rating.

Exercise 34.4 (This study was also seen in Exercise 27.6.) A study (Allen et al. 2018) examined the effect of exercise on smoking. Men and women were assessed on a range of measures, including the 'intention to smoke'.

'Intention to smoke', and other measures, were assessed both before and after exercise for each subject, using two quantitative questionnaires. Smokers (defined as people smoking at least five cigarettes per day) aged \(18\) to \(40\) were enrolled for the study. For the \(23\) women in the study, the mean intention to smoke after exercise reduced by \(0.66\) (with a standard error of \(0.37\)).

Perform a hypothesis test to determine if there is evidence of a population mean reduction in intention-to-smoke for women after exercising.

Exercise 34.5 In a study (Cressie, Sheffield, and Whitford 1984) conducted at the Adelaide Children's Hospital (p. 107; emphasis added):

...a group of beta thalassemia patients [...] were treated by a continuous infusion of desferrioxamine, in order to reduce their ferritin content...

Using the data shown below, conduct a hypothesis test to determine if there is evidence that the treatment reduces the ferritin content, as intended.

TABLE 34.3: The ferritin content (in \(\mu\)g/L) for 20 thalassemia patients at the Adelaide Children's Hospital (first ten observations)
September March Reduction
\(6630\) \(5100\) \(\phantom{-}1530\)
\(4590\) \(3510\) \(\phantom{-}1080\)
\(3510\) \(6600\) \(-3090\)
\(6375\) \(8000\) \(-1625\)
\(2500\) \(2800\) \(-300\)
\(1400\) \(2860\) \(-1460\)
\(4580\) \(3640\) \(\phantom{-}\phantom{0}940\)
\(6885\) \(9030\) \(-2145\)
\(4200\) \(4420\) \(-220\)
\(5600\) \(7910\) \(-2310\)

Exercise 34.6 (This study was also seen in Exercise 27.8.) The concentration of beta-endorphins in the blood is a sign of stress. One study (Hand et al. (1996), Dataset 232; Hoaglin, Mosteller, and Tukey (2011)) measured the beta-endorphin concentration for \(19\) patients about to undergo surgery. The RQ was: "For patients approaching surgery, is there a mean increase in beta-endorphin concentrations?"

Each patient had their beta-endorphin concentrations measured \(12\)--\(14\) hours before surgery, and also \(10\) minutes before surgery. A numerical summary can be produced from jamovi output (Fig. 34.4). Use the output to test the RQ.

TABLE 34.4: The surgery-stress data
Means Std.deviation Std.Error Sample.size
12--14 hours before surgery 8.352632 4.396763 1.008687 19
10 minutes before surgery 16.052632 12.508769 2.869708 19
Increase 7.700000 13.519163 3.101509 19