31 Tests for the mean difference (paired data)

So far, you have learnt to ask a RQ, design a study, describe and summarise the data, understand the decision-making process and to work with probabilities. You have been introduced to the construction of confidence intervals, and have studied some hypothesis tests.

In this chapter, you will learn about hypothesis tests for the mean difference (i.e., for paired data, or within individuals comparisons). You will learn to:

  • conduct hypothesis tests for the mean difference with paired data
  • determine whether the conditions for using these methods apply in a given situation.

31.1 Introduction: insulation

The Electricity Council in Bristol wanted to determine if a certain type of wall-cavity insulation reduced energy consumption in winter, on average (The Open University 1983). Their RQ was:

Is there a mean saving in energy consumption due to adding insulation?

The data collected are shown in the table below. These data were used in Sect. 23, where a CI was constructed for the mean energy saving.

For these data, finding the difference in energy consumption for each house seems sensible. The data are paired (Sect. 23.1): the same unit of analysis is measured twice on the same variable (before and after), and the mean change is of interest. That is, the comparison is within individuals, and not between different groups of individuals.

Pairing the values for each house makes sense; hence finding the difference in energy consumption at each house makes sense. The parameter is \(\mu_d\), the population mean saving in energy consumption.

Making clear how the differences are computed is important

Here, the differences could be computed as the Before minus After (the energy consumption saving), or the After minus Before (the energy consumption increase). Either is fine, as long as you are consistent throughout. The meaning of any conclusions will be the same.

In this case, discussing energy savings seems most natural, so we compute the differences as Before minus After.

31.2 Hypotheses and notation: mean differences

The RQ asks if the mean energy saving in the population is zero or not. The parameter is the population mean difference. To make things clear, notation is needed (recapping Sect. 23.3):

  • \(\mu_d\): The population mean difference.
  • \(\bar{d}\): The sample mean difference.
  • \(s_d\): The sample standard deviation of the differences.
  • \(n\): The number of differences.
  • \(\text{s.e.}(\bar{d})\): The standard error of the mean differences, where \(\displaystyle \text{s.e.}(\bar{d}) = \frac{s_d}{\sqrt{n}}\).

The hypotheses, therefore, can be written in terms of the parameter \(\mu_d\). The null hypothesis is 'there is no change in the energy consumption, in the population':

  • \(H_0\): \(\mu_d = 0\).

As noted in Sect. 30.2, the null hypothesis states that there is 'no difference, no change, no relationship', as measured by a parameter value. This hypothesis, the initial assumption, postulates that the mean difference may not be zero in the sample due to sampling variation.

Since the RQ asks specifically if the insulation saves energy, the alternative hypothesis will be a one-tailed hypothesis:

  • \(H_1\): \(\mu_d > 0\) (one-tailed).

This hypothesis says that the mean energy saving in the population is greater than zero. The alternative hypothesis is one-tailed because of the wording of the RQ. Recall that the differences are defined as energy savings.

31.3 Sampling distribution: mean differences

Initially, assume that \(\mu_d = 0\). However, the sample mean energy saving will vary depending on which one of the many possible samples is randomly obtained, even if the mean saving in the population is zero: the sample mean energy saving has sampling variation and hence a standard error. The sampling distribution of \(\bar{d}\) can be described, which describes what values of the statistic might be expected in the sample if the \(\mu_d = 0\).

Answering the RQ requires data. The data should be summarised numerically (using your calculator or software, such as jamovi (Fig. 31.1) or SPSS (Fig. 31.2)), and graphically (Fig. 23.1).

jamovi output for the insulation data

FIGURE 31.1: jamovi output for the insulation data

SPSS output for the insulation data

FIGURE 31.2: SPSS output for the insulation data

The sample mean difference is \(\bar{d} = 0.5400\)... but this value of \(\bar{d}\) will vary from sample to sample even if \(\mu_d = 0\). The amount of variation in \(\bar{d}\) is quantified using the standard error. More precisely, the possible values of the sample mean differences can, under certain conditions, described using

  • an approximate normal distribution; with
  • a mean of \(\mu_d = 0\) (taken from \(H_0\)); and
  • a standard deviation (called the standard error) of \(\text{s.e.}(\bar{d}) = s_d/\sqrt{n} = 0.3212\), where \(s_d\) is the standard deviation of the differences.

This describes what can be expected from the possible values of \(\bar{d}\) (Fig. 31.3), just through sampling variation (chance) if \(\mu_d = 0\).

The sampling distribution of sample means, if the energy saving in the population really was zero

FIGURE 31.3: The sampling distribution of sample means, if the energy saving in the population really was zero

31.4 The test statistic: mean differences

The sample mean difference can be located on the sampling distribution (Fig. 31.3) by computing the \(t\)-score:

\[ t = \frac{0.54 - 0}{0.3212} = 1.681, \] following the ideas in Equation (29.1). Software computes the \(t\)-score too (jamovi: Fig. 31.1; SPSS: Fig. 31.2). The \(t\)-score locates the observed sample statistic on the sampling distribution.

31.5 \(P\)-values: mean differences

A \(P\)-value determines if the sample data are consistent with the assumption (Table 30.1). Since \(t = 1.681\), and since \(t\)-scores are like \(z\)-scores, the one-tailed \(P\)-value is between 2.5% and 16% based on the 68--95--99.7 rule. This is a wide, and inconclusive, interval for the \(P\)-value.

Software gives a more precise \(P\)-value (jamovi: Fig. 31.1; SPSS: Fig. 31.2): the two-tailed \(P\)-value is \(0.127\), so the one-tailed \(P\)-value is \(0.127/2 = 0.0635\).

The software clarifies how the differences have been computed:

  • jamovi: At the left of the output (Fig. 31.1), the order implies the differences are found as Before minus After.

  • SPSS: At the left of the output (Fig. 31.2), the difference is described as Before - After.

31.6 Conclusions: mean differences

The one-tailed \(P\)-value is \(0.0635\), suggesting only slight evidence (see Table 30.1) supporting \(H_1\). To write a conclusion, an answer to the RQ is needed, plus evidence leading to that conclusion; and some summary statistics, including a CI (indicating the precision of the statistic):

Slight evidence exists in the sample (paired \(t = 1.68\); one-tailed \(P = 0.0635\)) of a mean energy saving in the population (mean saving: 0.54 MWh; \(n = 10\); 95% CI from \(-0.19\) to \(1.27\) MWh) after adding the insulation.

The wording implies the direction of the differences (by talking of 'savings').

Statistically validity should be checked, which was done in Sect. 23.7, but the validity conditions are given again in the next section, for completeness.

Example 31.1 (COVID lockdown) A study of \(n = 213\) Spanish health students (Romero-Blanco et al. 2020) measured (among other things) the number of minutes of vigorous physical activity (PA) performed by students during and before the COVID-19 lockdown (from March to April 2020 in Spain). These numerical summary of the data are shown in Exercise 23.6, so are not repeated here. We define the differences as the number of minutes of vigorous PA during the COVID lockdown, minus the number of minutes of vigorous PA before the COVID lockdown. A difference is computed for each participant, so the data are paired.

Using this definition, a positive difference means the during value is higher; hence, the differences correspond to the increase in PA during (compared to before) the COVID lockdown. Similarly, a negative value means that the before value is higher. The RQ is

For Spanish health students, is there a mean change in the amount of vigorous PA during and before the COVID lockdown?

In this situation, the parameter of interest is the population mean difference \(\mu_d\), the mean decrease in vigorous PA during (compared to before) the lockdown. The hypotheses are:

\[ \text{$H_0$}: \mu_d = 0\quad\text{and}\quad \text{$H_1$}: \mu_d \ne 0 \quad\text{(i.e., two-tailed)} \] The mean difference is \(\bar{d} = -2.68\) minutes, with a standard deviation of \(s_d = 51.30\) minutes. However, we know that the sample mean difference could vary from sample to sample, so has a standard error: \(\text{s.e.}(\bar{d}) = s_d\div \sqrt{n} = {51.30}\div{\sqrt{213}} = 3.515018\). The test statistic is

\[ t = \frac{\bar{d} - \mu_d}{\text{s.e.}(\bar{d})} = \frac{2.68 - 0}{3.515018} = 0.76. \]

This is a very small value, so (using the 68--95--99.7 rule) the \(P\)-value will be very large: a sample mean difference of 2.68 minutes could easily have happened by chance even if the population mean difference was zero.

We write:

No evidence (paired \(t = 0.76\), \(P > 0.10\)) exists in the sample of a mean difference in the mean change in vigorous PA during (compared to before) lockdown (sample mean 2.68 minutes greater during lockdown; standard deviation: 51.30 minutes; approximate 95% CI: \(-4.35\) to \(9.71\)) in the population.

31.7 Statistical validity conditions: Mean differences

As with any inferential procedure, these results apply under certain conditions. For a hypothesis test for the mean of paired data, these conditions are the same as for the CI for the mean difference for paired data (Sect. 23.7), and similar to those for one sample mean.

The test above is statistically valid if one of these conditions is true:

  1. The sample size of differences is at least 25; or
  2. The sample size of differences is smaller than 25, and the population of differences has an approximate normal distribution.

The sample size of 25 is a rough figure here, and some books give other values (such as 30). This condition ensures that the distribution of the sample means has an approximate normal distribution so the 68--95--99.7 rule can be used.

Provided the sample size is larger than about 25, this will be approximately true even if the distribution of the individuals in the population does not have a normal distribution. That is, when \(n > 25\) the sample means generally have an approximate normal distribution, even if the data themselves don't have a normal distribution.

Example 31.2 (Statistical validity) For the insulation data used above, the sample size is small, so the test will be statistically valid if the differences in the population follow a normal distribution. We don't know if they do, though the sample data (Fig. 23.1) don't identify any obvious doubts. So the test is possibly statistically valid, but we can't be sure.

Example 31.3 (COVID lockdown) In Example 31.1 concerning COVID lockdowns, the sample size was 213 Spanish health students. Since the sample size is much larger than 25, the test is statistically valid.

31.8 Example: endangered species

A study of endangered species (Harnish and Nataraajan 2020) examined

...whether perceived physical attractiveness of a species impacted participants' attitudes toward supporting and protecting the species...

--- Harnish and Nataraajan (2020), p. 1703

To do so, 210 undergraduate students were surveyed about 14 animals on various aspects of supporting and protecting them. Part of the data are summarised below, for two animals when asked about 'support to protect the animal from illicit trade'. Larger values means greater support for protecting the animal from illicit trade.

Species Mean score Standard deviation
Bay Checkerspot Butterfly 3.10 1.06
Valley Elderberry Longhorn Beetle 2.33 1.13
Difference 0.77 1.07

(Notice that the standard deviation of the difference is not the difference between the two given values of the standard deviation.)

The difference is defined as each student's score for the butterfly (deemed more attractive) minus their score for the beetle (deemed less attractive). A positive value therefore means more support (on average) for the butterfly. The RQ is whether there is a mean difference between support for each animal, so the parameter is \(\mu_d\), the population mean difference.

The researchers wished to test if

...animals perceived as more physically attractive [i.e., the butterfly] compared to those which are perceived as less physically attractive [i.e., the beetle] will receive relatively more support to prevent the species from illicit trade

--- Harnish and Nataraajan (2020), p. 1704

Given how the difference are defined, the hypotheses are:

\[ H_0: \mu_d = 0\quad\text{and}\quad H_1: \mu_d > 0 \quad\text{(i.e., one-tailed, based on the researchers' purpose)} \] The mean difference is \(\bar{d} = 0.77\) and \(s_d = 1.07\). The value of \(\bar{d}\) will vary from sample to sample, so has a standard error:

\[ \text{s.e.}(\bar{d}) = \frac{s_d}{\sqrt{n}} = \frac{1.07}{\sqrt{210}} = 0.073837. \] The value of the test statistic is

\[ t = \frac{\bar{d} - \mu_d}{\text{s.e.}(\bar{d})} = \frac{0.77 - 0}{0.073837} = 10.43, \] which is a very large value. Hence, the \(P\)-value will be very small, certainly less than \(0.05\). Since the sample size is much larger than 25, the test will be statistically valid.

We write:

There is very strong evidence (\(t = 10.43\); one-tailed \(P<0.001\)) that the mean difference in support for protecting the Bay Checkerspot Butterfly from illicit trade is greater than support for protecting the Valley Elderberry Longhorn Beetle from illict trade (mean difference: 0.77; standard deviation: 1.07; 95% CI for the difference: 0.62 to 0.92).

31.9 Example: blood pressure

A US study (Willems et al. 1997; Schorling et al. 1997) was conducted to determine how CHD risk factors were assessed among parts of the population. Subjects were required to report to the clinic on multiple occasions.

One RQ of interest is:

Is there a mean difference in blood pressure measurements between the first and second visits?

The parameter is \(\mu_d\), the population mean reduction in blood pressure. Each person has a pair of diastolic blood pressure (DBP) measurements: One each from their first and second visits. The data shown below, are from 141 people. (These data were also seen in Sect. 23.10.) The differences could be computed in one of two ways:

  • The observation from the first visit, minus the observation from the second visit: the reduction in BP; or
  • The observation from the second visit, minus the observation from the first visit: the increase in BP.

Either way is fine, as long as the order remains consistent, and the direction is made clear. Here, the observations from the first visit minus the observation from the second visit will be used, so that the differences represent the decrease in BP from the first to second measurement.

The appropriate graphical summary is a histogram of differences (Fig. ??); the numerical summary is shown in Table 31.1. Notice that having the information about the differences is essential, as the RQ is about the differences.

TABLE 31.1: The numerical summary for the diabetes data (in mm Hg). The differences are the second visit value minus the first visit value: the decreases in diastolic blood pressure from the first to second visit
Mean Standard deviation Standard error Sample size
DBP: First visit 94.48 11.473 0.966 141
DBP: Second visit 92.52 11.555 0.973 141
Decrease in DBP 1.95 8.026 0.676 141

As always (Sect. 30.2), the null hypothesis is the 'no difference, no change, no relationship' position, proposing that the mean difference in the population is non-zero due to sampling variation:

  • \(H_0\): \(\mu_d = 0\) (differences: \(\text{first} - \text{second}\)); \(H_1\): \(\mu_d \ne 0\).

The alternative hypothesis is two-tailed because of the wording of the RQ. As usual, assume that \(H_0\) is true, and then the evidence is evaluated to determine if it contradicts this assertion.

The sampling distribution describes how the sample mean difference is expected to vary from sample to sample due to sampling variation, when \(\mu_d = 0\). Under certain circumstances, the sample mean differences are likely to vary with a normal distribution, with a mean of 0 (from \(H_0\)) and a standard deviation of \(\text{s.e.}(\bar{d}) = 0.676\).

The relative value of the observed sample statistic is found by computing a \(t\)-score, using software (jamovi: Fig. 31.4; SPSS: Fig. 31.5), or manually (Eq. (29.1), using the information in Table 31.1):

\[\begin{align*} t &= \frac{\text{sample statistic} - \text{assumed population parameter}} {\text{standard error of the statistic}}\\ &= \frac{1.950 - 0}{0.676} = 2.885. \end{align*}\] Either way, the \(t\)-score is the same.

jamovi output for the diabetes data

FIGURE 31.4: jamovi output for the diabetes data

SPSS output for the diabetes data

FIGURE 31.5: SPSS output for the diabetes data

A \(P\)-value is then needed to decide if the sample is consistent with the assumption. Using the 68--95--99.7 rule, the approximate two-tailed \(P\)-value is much smaller than 0.05. Alternatively, the software output (Fig. 31.4; Fig. 31.5) reports the two-tailed \(P\)-value as \(P = 0.005\). We conclude:

Strong evidence exists in the sample (paired \(t = 2.855\); two-tailed \(P = 0.005\)) of a population mean difference between the first and second DBP readings (mean difference \(1.95\) mm Hg higher for first reading; 95% CIĀ from \(0.61\) to \(3.3\) mm Hg; \(n = 141\)).

Since \(n > 25\), the results are statistically valid.

Just saying 'there is evidence of a difference' is insufficient. You must state which measurement is, on average, higher (that is, what the differences mean).

31.10 Summary

Consider testing a hypothesis about a population mean difference \(\mu_d\), based on the value of the sample mean difference \(\bar{d}\). Under certain statistical validity conditions, the sample mean difference varies with an approximate normal distribution centered around the hypothesised value of \(\mu_d\), with a standard deviation of

\[ \text{s.e.}(\bar{d}) = \frac{s_d}{\sqrt{n}}. \] This distribution describes what values of the sample mean difference could be expected if the value of \(\mu_d\) in the null hypothesis was true. The test statistic is

\[ t = \frac{ \bar{d} - \mu_d}{\text{s.e.}(\bar{d})}, \] where \(\mu_d\) is the hypothesised value in the null hypothesis. The \(t\)-score describes what value of \(\bar{d}\) was observed in the sample, relative to what was expected. The \(t\)-value is like a \(z\)-score, so an approximate \(P\)-value can be estimated using the 68--95--99.7 rule, or is found using software. The \(P\)-values helps determine if the sample evidence is consistent with the assumption, or contradicts the assumption.

The following short video may help explain some of these concepts:

31.11 Quick review questions

A study (Bacho et al. 2019) compared joint pain in stroke patients before and after a supervised exercise treatment. Participants (\(n = 34\)) were assessed before and after treatment.

The mean change in joint pain after 13 weeks was 1.27 (with a standard error of 0.57) using a standardised tool.

  1. True or false? Only 'before and after' studies can be paired.

  2. True or false? The null hypothesis is about the population mean difference.

  3. The value of the test statistic (to two decimal places) is

  4. The two-tailed \(P\)-value will be

Progress:

31.12 Exercises

Selected answers are available in Sect. D.29.

Exercise 31.1 People often struggle to eat the recommended intake of vegetables. In one study exploring ways to increase vegetable intake in teens (Fritts et al. 2018), teens rated the taste of raw broccoli, and raw broccoli served with a specially-made dip. (These data were also seen in Exercise 23.1.)

Each teen (\(n = 101\)) had a pair of measurements: the taste rating of the broccoli with and without dip. Taste was assessed using a '100 mm visual analog scale', where a higher score means a better taste. In summary:

  • For raw broccoli, the mean taste rating was \(56.0\) (with a standard deviation of \(26.6\));
  • For raw broccoli served with dip, the mean taste rating was \(61.2\) (with a standard deviation of \(28.7\)).

Because the data are paired, the differences are the best way to describe the data. The mean difference in the ratings was \(5.2\), with standard error of \(3.06\).

Perform a hypothesis test to see if the use of dip increases the mean taste rating.

Exercise 31.2 In a study of hypertension (Hand et al. 1996; MacGregor et al. 1979), patients were given a drug (Captopril) and their systolic blood pressure measured immediately before and two hours after being given the drug (data shown).

The aim is to see if there is evidence of a reduction in blood pressure after taking Captopril. (This study was also seen in Exercise 23.2.)

Using these data and the software output (jamovi: Fig. 31.6; SPSS: Fig. 31.7):

  1. Explain why it is probably more sensible to compute differences as the Before minus the After measurements. What do the differences mean when computed this way?
  2. Compute the differences.
  3. Construct a suitable graph for the differences.
  4. Write down the hypotheses.
  5. Write down the \(t\)-score.
  6. Write down the \(P\)-value.
  7. Write a conclusion.
TABLE 31.2: The Captopril data: before after after systolic blood pressures (in mm Hg)
Before After Before After
210 201 173 147
169 165 146 136
187 166 174 151
160 157 201 168
167 147 198 179
176 145 148 129
185 168 154 131
206 180
jamovi output for the Captoril data

FIGURE 31.6: jamovi output for the Captoril data

SPSS output for the Captoril data

FIGURE 31.7: SPSS output for the Captoril data

Exercise 31.3 A study (A. M. Allen et al. 2018) examined the effect of exercise on smoking. Men and women were assessed on a range of measures, including the 'intention to smoke'. (This study was also seen in Exercise 23.3.)

'Intention to smoke', and other measures, were assessed both before and after exercise for each subject, using the 10-item quantitative Questionnaire of Smoking Urges -- Brief scale (Cox, Tiffany, and Christen 2001), and the quantitative Minnesota Nicotine Withdrawal Scale (Shiffman, West, and Gilbert 2004).

Smokers (defined as people smoking at least five cigarettes per day) aged 18 to 40 were enrolled for the study. For the 23 women in the study, the mean intention to smoke after exercise reduced by 0.66 (with a standard error of 0.37).

Perform a hypothesis test to determine if there is evidence of a population mean reduction in intention to smoke for women after exercising.

Exercise 31.4 In a study (Cressie, Sheffield, and Whitford 1984) conducted at the Adelaide Children's Hospital:

...a group of beta thalassemia patients [...] were treated by a continuous infusion of desferrioxamine, in order to reduce their ferritin content...

--- Cressie, Sheffield, and Whitford (1984), p. 107; emphasis added

Using the data (shown below), conduct a hypothesis test to determine if there is evidence that the treatment reduces the ferritin content, as intended.

Exercise 31.5 The concentration of beta-endorphins in the blood is a sign of stress. One study (Hand et al. (1996), Dataset 232; Hoaglin, Mosteller, and Tukey (2011)) measured the beta-endorphin concentration for 19 patients about to undergo surgery. (This study was also seen in Exercise 31.5.) The RQ was: "For patients approaching surgery, is there a mean increase in beta-endorphin concentrations?"

Each patient had their beta-endorphin concentrations measured 12--14 hours before surgery, and also 10 minutes before surgery. A numerical summary can be produced from jamovi output (Fig. 23.7). Use the output to test the RQ.