23 CIs for mean differences (paired data)

So far, you have learnt to ask a RQ, design a study, describe and summarise the data, understand the decision-making process and to work with probabilities. You have also been introduced to confidence intervals for proportions and means. In this chapter, you will learn to construct confidence intervals for mean differences (i.e., for paired data). You will learn to:

  • produce a confidence interval for a mean difference.
  • determine whether the conditions for using the confidence interval apply in a given situation.

23.1 Introduction

What happens to students' weight when students start at university? Among other changes, many students will be responsible for their own meals for the first time. Perhaps these students forgo healthy foods for more convenient, but less healthy, foods. This may mean that, on average, their weight increases.

One approach to studying this is to obtain some students' weights as they begin university, and then obtain some students' weight at some later time. If the same students have their weights measured at both times, we would be comparing within individuals (Sect. 2.3.4).

This would be a descriptive RQ: the Outcome is the mean change in weight, and the response variable is the weight change for each individual student. There is no between-individuals comparison.

Alternatively, the researchers could take a sample of students who are beginning university and measure their weight, and then take a different sample of students at some later time and measure their weight. This would be comparing between individuals (Sect. 2.3.4).

This would be a relational RQ: the Outcome is the mean weight, and the response variable is the weight for each student. The Comparison is between the units of analysis weighed at the start of university, and those weighed at some later time.

Either study is possible, and each has advantages and disadvantages. Here the first (within-individuals) design would seem superior (why?). In the first design, each students gets a pair of weight measurements, and produces paired data, which is the subject of this chapter. The second (between-individuals) design requires the means of two different groups of students to be compared, the topic of the next chapter.

Definition 23.1 (Paired data) Data are paired when two observations about the same variable are recorded for each unit of analysis. Paired data come from within individual comparisons.

Since each unit of analysis has two observations about weight, the change (or the difference) in weight can be computed for each student. Then, questions can be asked about the population mean difference, which is not the same as difference between two separate population means (the subject of the next chapter). In paired data, finding the difference between the two measurements for each individual unit of analysis makes sense, since each unit of analysis (each student) has two related observations.

Which of these are paired situations?

  1. The blood pressure is recorded for 36 people, before and after taking a drug, to determine how much the average blood pressure reduces.
  2. The mean HDL cholesterol concentration is recorded for 22 males and 19 females, and the mean compared.
  3. The mean protein concentrations were compared in sea turtles before and after being rehabilitated to identify any changes (March et al. 2018).

Situations 1 and 3 are paired situations.

23.2 Mean differences

Prof. David Levitsky (Cornell University) obtained data to answer this (descriptive) RQ (D. A. Levitsky, Halbmaier, and Mrdjenovic (2004), D. Levitsky (n.d.)):

For Cornell University students, what is the mean weight gain in students after 12 weeks at university?

The parameter is \(\mu_d\), the population mean weight gain (in kg). The subscript \(d\) is because we are working with differences between the initial weight and the weight after 12 weeks. For the collected data (shown below) the same variable (weight) is measured twice for each unit of analysis (the student): the initial weight, and weight after 12 weeks.

Finding the gain in weight for each student seems sensible: each student has a Week 1 and Week 12 measurement. Once the differences are computed, the process for computing a CI is the same as in Chap. 22, where these changes (or differences) are treated as the data.

Be clear about how the differences are computed. Differences could be computed as Week 1 minus Week 12 (the weight loss), or Week 12 minus Week 1 (the weight gain).

Either is fine: provided you are consistent throughout, the meaning of any conclusions will be the same. Here, we use weight gain since it is explicitly mentioned in the RQ.

TABLE 23.1: The student weight-change data: The weight of students in Week 1 at university, in Week 12, and the weight gain (all in kg)
Student Week 1 After 12 weeks Weight gain
Student 1 77.0 75.6 -1.4
Student 2 49.5 50.0 0.5
Student 3 60.3 61.2 0.9
Student 4 51.8 53.6 1.8
Student 5 67.5 69.8 2.3
Student 6 46.8 47.7 0.9
Student 7 63.9 66.6 2.7
Student 8 54.0 55.8 1.8
Student 9 64.8 66.6 1.8
Student 10 70.2 69.3 -0.9
Student 11 51.3 51.3 0.0
Student 12 54.5 55.4 0.9
Student 13 54.9 56.7 1.8
Student 14 54.0 51.8 -2.2
Student 15 51.8 53.1 1.3
Student 16 49.5 50.9 1.4
Student 17 63.9 65.7 1.8
Student 18 57.1 57.1 0.0
Student 19 45.9 47.2 1.3
Student 20 56.2 56.2 0.0
Student 21 70.7 71.1 0.4
Student 22 53.6 56.7 3.1
Student 23 50.9 51.3 0.4
Student 24 54.0 57.6 3.6
Student 25 60.8 62.6 1.8
Student 26 66.6 67.5 0.9
Student 27 49.5 50.4 0.9
Student 28 72.0 73.4 1.4
Student 29 99.0 100.8 1.8
Student 30 59.4 59.9 0.5
Student 31 65.2 66.2 1.0
Student 32 63.5 63.5 0.0
Student 33 71.1 72.0 0.9
Student 34 60.8 60.3 -0.5
Student 35 66.6 67.5 0.9
Student 36 73.8 74.2 0.4
Student 37 61.6 62.1 0.5
Student 38 89.1 90.4 1.3
Student 39 54.9 55.8 0.9
Student 40 65.7 65.7 0.0
Student 41 67.5 68.0 0.5
Student 42 84.2 86.4 2.2
Student 43 42.3 43.2 0.9
Student 44 47.2 47.2 0.0
Student 45 57.1 58.5 1.4
Student 46 63.9 64.8 0.9
Student 47 63.0 64.4 1.4
Student 48 48.1 48.1 0.0
Student 49 46.8 47.2 0.4
Student 50 50.0 50.4 0.4
Student 51 72.0 72.9 0.9
Student 52 60.3 60.3 0.0
Student 53 68.0 68.0 0.0
Student 54 57.1 58.5 1.4
Student 55 47.7 48.6 0.9
Student 56 83.2 84.6 1.4
Student 57 56.2 57.6 1.4
Student 58 56.2 56.7 0.5
Student 59 69.8 71.1 1.3
Student 60 53.1 54.0 0.9
Student 61 67.0 67.5 0.5
Student 62 67.0 67.0 0.0
Student 63 54.9 54.5 -0.4
Student 64 69.8 71.1 1.3
Student 65 72.0 72.4 0.4
Student 66 51.8 53.6 1.8
Student 67 75.2 76.5 1.3
Student 68 59.0 59.0 0.0

Some individual weight gains are negative. This does not mean a negative weight, since the values are differences (specifically, weight gains). The differences are computed as Week 12 minus Week 1, so a negative value means that the Week 1 weight is greater than the Week 12 weight value: that is, a weight loss.

As always, begin by understanding the data, and produce appropriate graphical and numerical summaries.

What graphs would be suitable for displaying these data?

  • Boxplot

  • A histogram

  • A histogram of the differences (such as the energy savings) for each house

  • A case-profile plot

23.3 Defining notation

The notation used for paired data reflects that we work with the differences (Table 23.2). Apart from that, the notation is similar to that used in Chap. 22.

TABLE 23.2: The notation used for mean differences (paired data) compared to the notation used for one sample mean
One sample mean Mean of paired data
The observations: Values: \(x\) Differences: \(d\)
Sample mean: \(\bar{x}\) \(\bar{d}\)
Standard deviation: \(s\) \(s_d\)
Standard error of sample mean: \(\displaystyle\text{s.e.}(\bar{x}) = \frac{s}{\sqrt{n}}\) \(\displaystyle\text{s.e.}(\bar{d}) = \frac{s_d}{\sqrt{n}}\)
Sample size: Number of observations: \(n\) Number of differences: \(n\)

23.4 Summarising data

An appropriate graph is a histogram of the differences (Fig. 23.1). Graphing the Week 1 and Week 12 data may also be useful too, but a graph of the differences is crucial, as the RQ is about the differences. A case-profile plot (Sect. 12.8.2) is also useful, but is difficult to read here as the sample size is large (recall, a case-profile plot contains a line for each unit of analysis).

A histogram of weight changes (the vertical grey line represents no weight gain)

FIGURE 23.1: A histogram of weight changes (the vertical grey line represents no weight gain)

Since the data are differences, a numerical summary must summarise the differences. Summarising the Week 1 and Week 12 data is useful too, but summarising the differences is crucial because the RQ is about the differences (see below). For the weight-gain data, the appropriate numerical summary for paired quantitative data summarises the differences using means, standard deviations, and so on, as appropriate.

A mean or a median may be appropriate for describing the data. However, the CI is about the mean of the data, and not about the data itself.

Since the sampling distribution for the sample mean (under certain conditions) has a symmetric normal distribution, the mean is appropriate for describing the sampling distribution.

A numerical summary of the weight gain (from a computer) gives the sample mean of the differences as \(\bar{d} = -0.8735\), and the standard deviation of the differences as \(s_d = 0.9489378\). A formal numerical summary table is shown in Table 23.3.

TABLE 23.3: The mean, median, standard deviation and IQR for the weight-gain data (in kg)
Mean Median Std dev IQR
Week 1 weight 61.24 60.3 10.97 14.02
Week 12 weight 62.10 60.3 11.07 14.10
Weight gain 0.86 0.9 0.96 1.00

23.5 Describing sampling distribution

The study concerns the mean weight gain. Every possible sample of \(n = 68\) students comprises different students, and hence produces different Week 1 and Week 12 weights, and hence different weight gains. As a result, the sample mean weight gain will vary from sample to sample, so the mean differences have a sampling distribution, and a standard error.

Since the differences are like a single sample of data (Chap. 22), the sampling distribution for the differences will have a similar sampling distribution to the mean of a single sample \(\bar{x}\) (provided the conditions are met; Sect. 23.7).

Definition 23.2 (Sampling distribution of a sample mean difference) The sampling distribution of a sample mean differences is described by:

  • an approximate normal distribution,
  • centred around the sampling mean whose value is the population mean difference \(\mu_d\),
  • with a standard deviation, called the standard error of the difference, of \(\displaystyle\text{s.e.}(\bar{d}) = \frac{s_d}{\sqrt{n_d}}\),

when certain conditions are met (Sect. 23.7), where \(n\) is the size of the sample, and \(s_d\) is the standard deviation of the individual differences in the sample.

For the weight-gain data, the variation in the sample mean differences \(\bar{d}\) is described by (Fig. 23.2):

  • approximate normal distribution;
  • with a sampling mean whose value is \(\mu_{{d}}\);
  • with a standard error of \(\displaystyle\text{s.e.}(\bar{d}) = {0.9563659}\div{\sqrt{68}} = 0.1159764\).

Notice that many decimal places are used in the working here; results will be rounded when reported.

The sampling distribution is a normal distribution; it shows how the sample mean weight gain varies in samples of size $n = 68$

FIGURE 23.2: The sampling distribution is a normal distribution; it shows how the sample mean weight gain varies in samples of size \(n = 68\)

TABLE 23.4: The notation used for describing means, and the sampling distribution of the sample means
Quantity Description
Individual difference in the population Vary with mean \(\mu_d\) and standard deviation \(\sigma_d\)
Individual differences in a sample Vary with mean \(\bar{d}\) and standard deviation \(s_d\)
Sample means (\(\bar{d}\)) across all possible samples: Vary with approx. normal distribution (under certain conditions) sampling mean \(\mu_{{d}}\); standard deviation \(\text{s.e.}(\bar{d})\)

23.6 Computing confidence intervals

The CI for the mean difference has the same form as for a single mean (Chap. 22), so an approximate 95% confidence interval (CI) for \(\mu_d\) is
\[ \bar{d} \pm (2 \times\text{s.e.}(\bar{d})). \] This is the same as the CI for \(\bar{x}\) if the differences are treated like the data. For the weight-gain data:
\[ 0.8618 \pm (2 \times 0.1159764), \] or \(0.8618\pm 0.2319528\) (so the margin of error is \(0.232\)). Equivalently, the CI is from \(0.8618 - 0.23195 = 0.630\), up to \(0.8618 + 0.232195 = 1.094\). We write:

Based on the sample, an approximate 95% CI for the population mean weight gain between Week 1 and 12 is from \(0.63\)kg to \(1.09\)kg.

The 95% CI is saying that we are reasonably confident that, between Weeks 1 and 12, the mean weight gain is between using \(0.63\)kg and \(1.09\)kg. Alternatively, the plausible values for the mean weight gain in the population are between \(0.63\)kg and \(1.09\)kg.

23.7 Statistical validity conditions

As with any inferential procedure, these results apply under certain conditions. The conditions under which the CI is statistically valid for paired data are similar to those for one sample mean, rephrased for differences.

The CI computed above is statistically valid if one of these conditions is true:

  1. The sample size of differences is at least 25; or
  2. The sample size of differences is smaller than 25, and the population of differences has an approximate normal distribution.

The sample size of 25 is a rough figure here, and some books give other (similar) values (such as 30). This condition ensures that the distribution of the sample means has an approximate normal distribution (so that, for example, the 68--95--99.7 rule can be used). Provided the sample size is larger than about 25, this will be approximately true even if the distribution of the individuals in the population does not have a normal distribution. That is, when \(n > 25\) the sample means generally have an approximate normal distribution, even if the data themselves don't have a normal distribution.

Example 23.1 (Statistical validity) For the weight-gain data, the sample size is \(n = 68\), larger than 25, so the results are statistically valid. We do not require that the differences in the population follow a normal distribution.

23.8 Using software

Software (such as jamovi or SPSS) can produce exact 95% CIs, which may be slightly different than the approximate 95% CI (since the 68--95--99.7 rule gives approximate multipliers). The approximate and exact 95% CIs are the same to two decimal places. From the jamovi or SPSS output (Fig. 23.3) we can write:

Based on the sample, a 95% CI is for the population mean weight gain between Weeks 1 and 12 is between \(0.63\) to \(1.09\)kg

The weight-gain data: jamovi output (top) and SPSS output(bottom)The weight-gain data: jamovi output (top) and SPSS output(bottom)

FIGURE 23.3: The weight-gain data: jamovi output (top) and SPSS output(bottom)

23.9 Example: endangered species

A study of endangered species (Harnish and Nataraajan 2020, 1703) examined

...whether perceived physical attractiveness of a species impacted participants' attitudes toward supporting and protecting the species...

To do so, 210 undergraduate students were surveyed about 14 animals on various aspects of supporting and protecting them. Part of the data are summarised in Table 23.5, for two animals, when asked about 'support to protect the animal from illicit trade'. Larger values means greater support for protecting the animal from illicit trade. (Notice that the standard deviation of the difference is not the difference between the two given values of the standard deviation.)

TABLE 23.5: The endangered-species data summary
Mean score Standard deviation
Bay Checkerspot Butterfly 3.10 1.06
Valley Elderberry Longhorn Beetle 2.33 1.13
Difference 0.77 1.07

The difference is defined as each student's score for the butterfly (deemed more attractive) minus their score for the beetle (deemed less attractive). A positive value therefore means more support (on average) for the butterfly.

The researchers (p. 1704) wished to test if

...animals perceived as more physically attractive [i.e., the butterfly] compared to those which are perceived as less physically attractive [i.e., the beetle] will receive relatively more support to prevent the species from illicit trade

The parameter is \(\mu_d\), the population mean difference. The mean difference is \(\bar{d} = 0.77\) and \(s_d = 1.07\). The value of \(\bar{d}\) will vary from sample to sample, so has a standard error:

\[ \text{s.e.}(\bar{d}) = \frac{s_d}{\sqrt{n}} = \frac{1.07}{\sqrt{210}} = 0.073837. \] Using an approximate multiplier of 2, the margin of error is \(2 \times 0.073837 = 0.147674\), so an approximate 95% CI for the difference in support in the population is \(0.77\pm 0.147674\), or from \(0.62\) to \(0.92\). We write:

From the sample, an approximate 95% CI for the difference in support for preventing illicit trade is between 0.62 and 0.92 higher for the Bay Checkerspot Butterfly compared to the Valley Elderberry Longhorn Beetle from illict trade (mean difference: 0.77; standard deviation: 1.07; \(n = 210\) undergraduate students).

23.10 Example: blood pressure

A US study (Willems et al. 1997; Schorling et al. 1997) examined how CHD risk factors were assessed among parts of the population with diabetes. Subjects reported to the clinic on multiple occasions. Consider this RQ:

What is the mean difference in diastolic blood pressure from the first to the second visit?

Each person has a pair of diastolic blood pressure (DBP) measurements: One each from their first and second visits. The data (shown below) are from the 141 people for whom both measurements are available (some data are missing). The differences could be computed as:

  • The first visit DBP minus the second visit DBP: the reduction in DBP; or
  • The second visit DBP minus the first visit DBP: the increase in DBP.

Either way is fine, provided the order is used consistently. Here, the observation from the second visit will be used, so that the differences represent the reduction in DBP from the first to second visit. The parameter is \(\mu_d\), the population mean reduction in DBP.

Since the dataset is large, the appropriate graphical summary is a histogram of differences (Fig. 23.4). The numerical summary can summarise both the first and second visit observations, but must summarise the differences. Numerical summaries can be computed using software, then reported in a suitable table (Table 23.6).

Histogram of the decrease in DBP between the first and second visits

FIGURE 23.4: Histogram of the decrease in DBP between the first and second visits

TABLE 23.6: The numerical summary for the diabetes data (in mm Hg). The differences are the second visit value minus the first visit value: the decreases in diastolic blood pressure from the first to second visit
Mean Standard deviation Standard error Sample size
DBP: First visit 94.48 11.473 0.966 141
DBP: Second visit 92.52 11.555 0.973 141
Decrease in DBP 2.38 10.931 0.921 141

The standard error of the sample mean is

\[ \text{s.e.}(\bar{d})=\frac{s_d}{\sqrt{n}} = \frac{8.02614}{\sqrt{141}} = 0.67592. \] Using an approximate multiplier of 2, the margin of error is \(2 \times 0.67592 = 1.3518\), so an approximate 95% CI for the decrease in DBP is \(1.9504\pm 1.3518\), or from \(0.60\) to \(3.30\) mm Hg, after rounding sensibly. We write:

Based on the sample, an approximate 95% CI for the mean decrease in DBP is from \(0.60\) to \(3.30\) mm Hg.

The exact 95% CI from jamovi or SPSS (Fig. 23.5), using an exact \(t\)-multiplier rather than an approximate multiplier of 2, is similar since the sample size is large. After rounding, write:

Based on the sample, an exact 95% CI for the decrease in DBP is from \(0.61\) to \(3.29\) mm Hg.

The wording ('for the decrease in DBP') implies which reading is the higher reading on average: the first.

jamovi output (top) and SPSS output (bottom) for the blood pressure data, including the exact 95\% CIjamovi output (top) and SPSS output (bottom) for the blood pressure data, including the exact 95\% CI

FIGURE 23.5: jamovi output (top) and SPSS output (bottom) for the blood pressure data, including the exact 95% CI

Be clear in your conclusion about how the differences are computed.

The CI is statistically valid as the sample size is larger than 25. (The data do not need to follow a normal distribution.)

Is there a mean difference in DBP in the population?

Be careful: The RQ is about the mean difference in the population... but we only have the mean difference from one of the many possible samples. So it is difficult to be certain.

23.11 Quick review questions

  1. True or false: For paired data, the mean of the differences is treated like the mean of a single variable.
  2. True or false: The appropriate graph for displaying paired data is often a histogram of the differences.
  3. True or false: The population mean difference is denoted by \(\mu_d\).
  4. True or false: The standard error of the sample mean difference is denoted by \(s_d\).

23.12 Exercises

Selected answers are available in Sect. D.22.

Exercise 23.1 People often struggle to eat the recommended intake of vegetables. In one study exploring ways to increase vegetable intake in teens (Fritts et al. 2018), teens rated the taste of raw broccoli, and raw broccoli served with a specially-made dip.

Each teen (\(n = 101\)) had a pair of measurements: the taste rating of the broccoli with and without dip. Taste was assessed using a '100 mm visual analog scale', where a higher score means a better taste. In summary:

  • For raw broccoli, the mean taste rating was \(56.0\) (with a standard deviation of \(26.6\));
  • For raw broccoli served with dip, the mean taste rating was \(61.2\) (with a standard deviation of \(28.7\)).

Because the data are paired, the differences are the best way to describe the data. The mean difference in the ratings was \(5.2\), with \(\text{s.e.}(\bar{d}) = 3.06\). From this information:

  1. Construct a suitable numerical summary table.
  2. Compute the approximate 95% CI for the mean difference in taste ratings.

Exercise 23.2 In a study of hypertension (Hand et al. 1996; MacGregor et al. 1979), 15 patients were given a drug (Captopril) and their systolic blood pressure measured (in mm Hg) immediately before and two hours after being given the drug (Table 23.7).

  1. Explain why it is sensible to compute differences as the Before minus the After measurements. What do the differences mean when computed this way?
  2. Compute the differences.
  3. Compute an approximate 95% CI for the mean difference.
  4. Write down the exact 95% CI using the computer output (Fig. 23.6).
  5. Why are the two CIs different?
TABLE 23.7: The Captopril data: before after after systolic blood pressures (in mm Hg)
Before After Before After
210 201 173 147
169 165 146 136
187 166 174 151
160 157 201 168
167 147 198 179
176 145 148 129
185 168 154 131
206 180
jamovi (top) and SPSS (bottom) output for the Captoril datajamovi (top) and SPSS (bottom) output for the Captoril data

FIGURE 23.6: jamovi (top) and SPSS (bottom) output for the Captoril data

Exercise 23.3 A study (Allen et al. 2018) examined the effect of exercise on smoking. Men and women were assessed on a range of measures, including the 'intention to smoke'.

'Intention to smoke' was assessed both before and after exercise for each subject, using two quantitative questionnaires. Smokers (people smoking at least five cigarettes per day) aged 18 to 40 were enrolled for the study. For the 23 women in the study, the mean intention to smoke after exercise reduced by 0.66 (with a standard error of 0.37).

  1. Find a 95% confidence interval for the population mean reduction in intention to smoke for women after exercising.
  2. Is this CI statistically valid?

Exercise 23.4 Young girls (\(n = 29\)) with anorexia received cognitive behavioural treatment (Hand et al. (1996), Dataset 285), and their weight before and after treatment were recorded. In summary:

  • Before the treatment, the mean weight was 82.69 pounds (\(s = 4.85\) pounds);
  • After the treatment, the mean weight was 85.7 pounds (\(s = 8.35\) pounds).

If the standard deviation of the weight loss was 7.31 pounds, find a 95% CI for the population mean weight loss. Do you think the treatment had any impact on the mean weight of the girls?

Exercise 23.5 The concentration of beta-endorphins in the blood is a sign of stress. One study (Hand et al. (1996), Dataset 232; Hoaglin, Mosteller, and Tukey (2011)) measured the beta-endorphin concentration for 19 patients about to undergo surgery.

Each patient had their beta-endorphin concentrations measured 12--14 hours before surgery, and also 10 minutes before surgery. A numerical summary can be produced from jamovi output (Table 23.8).

TABLE 23.8: The numerical summary for the presurgical stress data
Mean Std deviation Std error Sample size
12--14 hours before surgery 8.35 4.397 1.009 19
10 minutes before surgery 16.05 12.509 2.870 19
Increase 7.70 13.519 3.102 19
  1. Use the jamovi output in Fig. 23.7 to construct an approximate 95% CI for the increase in stress as surgery gets closer.
  2. Use the jamovi output in Fig. 23.7 to write down the exact 95% CI for the increase in stress as surgery gets closer.
  3. Is the CI likely to be statistically valid?
jamovi output for the surgery-stress data

FIGURE 23.7: jamovi output for the surgery-stress data

Exercise 23.6 A study of \(n = 213\) Spanish health students (Romero-Blanco et al. 2020) measured (among other things) the number of minutes of vigorous physical activity (PA) performed by students before and during the COVID-19 lockdown (from March to April 2020 in Spain). Since the before and during lockdown were both measured on each participant, the data are paired (within individuals). The data are summarised in Table 23.9.

  1. Explain what the differences mean.
  2. Compute the standard error of the differences.
  3. Compute the approximate 95% CI, and interpret what it means.
TABLE 23.9: Summary information for the COVID-lockdown exercise data
Mean (mins) Std dev (mins)
Before 28.47 54.13
During 30.66 30.04
Increase 2.68 51.30