23 CIs for mean differences (paired data)
So far, you have learnt to ask a RQ, design a study, describe and summarise the data, understand the decision-making process and to work with probabilities. You have also been introduced to confidence intervals for proportions and means. In this chapter, you will learn to construct confidence intervals for mean differences (i.e., for paired data). You will learn to:
- produce a confidence interval for a mean difference.
- determine whether the conditions for using the confidence interval apply in a given situation.

23.1 Introduction

What happens to students' weight when students start at university? Among other changes, many students will be responsible for their own meals for the first time. Perhaps these students forgo healthy foods for more convenient, but less healthy, foods. This may mean that, on average, their weight increases.
One approach to studying this is to obtain some students' weights as they begin university, and then obtain some students' weight at some later time. If the same students have their weights measured at both times, we would be comparing within individuals (Sect. 2.3.4).
This would be a descriptive RQ: the Outcome is the mean change in weight, and the response variable is the weight change for each individual student. There is no between-individuals comparison.
Alternatively, the researchers could take a sample of students who are beginning university and measure their weight, and then take a different sample of students at some later time and measure their weight. This would be comparing between individuals (Sect. 2.3.4).
This would be a relational RQ: the Outcome is the mean weight, and the response variable is the weight for each student. The Comparison is between the units of analysis weighed at the start of university, and those weighed at some later time.
Either study is possible, and each has advantages and disadvantages. Here the first (within-individuals) design would seem superior (why?). In the first design, each students gets a pair of weight measurements, and produces paired data, which is the subject of this chapter. The second (between-individuals) design requires the means of two different groups of students to be compared, the topic of the next chapter.
Definition 23.1 (Paired data) Data are paired when two observations about the same variable are recorded for each unit of analysis. Paired data come from within individual comparisons.
Since each unit of analysis has two observations about weight, the change (or the difference) in weight can be computed for each student. Then, questions can be asked about the population mean difference, which is not the same as difference between two separate population means (the subject of the next chapter). In paired data, finding the difference between the two measurements for each individual unit of analysis makes sense, since each unit of analysis (each student) has two related observations.
Which of these are paired situations?
- The blood pressure is recorded for 36 people, before and after taking a drug, to determine how much the average blood pressure reduces.
- The mean HDL cholesterol concentration is recorded for 22 males and 19 females, and the mean compared.
- The mean protein concentrations were compared in sea turtles before and after being rehabilitated to identify any changes (March et al. 2018).
Situations 1 and 3 are paired situations.
23.2 Mean differences
Prof. David Levitsky (Cornell University) obtained data to answer this (descriptive) RQ (D. A. Levitsky, Halbmaier, and Mrdjenovic (2004), D. Levitsky (n.d.)):
For Cornell University students, what is the mean weight gain in students after 12 weeks at university?
The parameter is \(\mu_d\), the population mean weight gain (in kg). The subscript \(d\) is because we are working with differences between the initial weight and the weight after 12 weeks. For the collected data (shown below) the same variable (weight) is measured twice for each unit of analysis (the student): the initial weight, and weight after 12 weeks.
Finding the gain in weight for each student seems sensible: each student has a Week 1 and Week 12 measurement. Once the differences are computed, the process for computing a CI is the same as in Chap. 22, where these changes (or differences) are treated as the data.
Be clear about how the differences are computed. Differences could be computed as Week 1 minus Week 12 (the weight loss), or Week 12 minus Week 1 (the weight gain).
Either is fine: provided you are consistent throughout, the meaning of any conclusions will be the same. Here, we use weight gain since it is explicitly mentioned in the RQ.
Student | Week 1 | After 12 weeks | Weight gain |
---|---|---|---|
Student 1 | 77.0 | 75.6 | -1.4 |
Student 2 | 49.5 | 50.0 | 0.5 |
Student 3 | 60.3 | 61.2 | 0.9 |
Student 4 | 51.8 | 53.6 | 1.8 |
Student 5 | 67.5 | 69.8 | 2.3 |
Student 6 | 46.8 | 47.7 | 0.9 |
Student 7 | 63.9 | 66.6 | 2.7 |
Student 8 | 54.0 | 55.8 | 1.8 |
Student 9 | 64.8 | 66.6 | 1.8 |
Student 10 | 70.2 | 69.3 | -0.9 |
Student 11 | 51.3 | 51.3 | 0.0 |
Student 12 | 54.5 | 55.4 | 0.9 |
Student 13 | 54.9 | 56.7 | 1.8 |
Student 14 | 54.0 | 51.8 | -2.2 |
Student 15 | 51.8 | 53.1 | 1.3 |
Student 16 | 49.5 | 50.9 | 1.4 |
Student 17 | 63.9 | 65.7 | 1.8 |
Student 18 | 57.1 | 57.1 | 0.0 |
Student 19 | 45.9 | 47.2 | 1.3 |
Student 20 | 56.2 | 56.2 | 0.0 |
Student 21 | 70.7 | 71.1 | 0.4 |
Student 22 | 53.6 | 56.7 | 3.1 |
Student 23 | 50.9 | 51.3 | 0.4 |
Student 24 | 54.0 | 57.6 | 3.6 |
Student 25 | 60.8 | 62.6 | 1.8 |
Student 26 | 66.6 | 67.5 | 0.9 |
Student 27 | 49.5 | 50.4 | 0.9 |
Student 28 | 72.0 | 73.4 | 1.4 |
Student 29 | 99.0 | 100.8 | 1.8 |
Student 30 | 59.4 | 59.9 | 0.5 |
Student 31 | 65.2 | 66.2 | 1.0 |
Student 32 | 63.5 | 63.5 | 0.0 |
Student 33 | 71.1 | 72.0 | 0.9 |
Student 34 | 60.8 | 60.3 | -0.5 |
Student 35 | 66.6 | 67.5 | 0.9 |
Student 36 | 73.8 | 74.2 | 0.4 |
Student 37 | 61.6 | 62.1 | 0.5 |
Student 38 | 89.1 | 90.4 | 1.3 |
Student 39 | 54.9 | 55.8 | 0.9 |
Student 40 | 65.7 | 65.7 | 0.0 |
Student 41 | 67.5 | 68.0 | 0.5 |
Student 42 | 84.2 | 86.4 | 2.2 |
Student 43 | 42.3 | 43.2 | 0.9 |
Student 44 | 47.2 | 47.2 | 0.0 |
Student 45 | 57.1 | 58.5 | 1.4 |
Student 46 | 63.9 | 64.8 | 0.9 |
Student 47 | 63.0 | 64.4 | 1.4 |
Student 48 | 48.1 | 48.1 | 0.0 |
Student 49 | 46.8 | 47.2 | 0.4 |
Student 50 | 50.0 | 50.4 | 0.4 |
Student 51 | 72.0 | 72.9 | 0.9 |
Student 52 | 60.3 | 60.3 | 0.0 |
Student 53 | 68.0 | 68.0 | 0.0 |
Student 54 | 57.1 | 58.5 | 1.4 |
Student 55 | 47.7 | 48.6 | 0.9 |
Student 56 | 83.2 | 84.6 | 1.4 |
Student 57 | 56.2 | 57.6 | 1.4 |
Student 58 | 56.2 | 56.7 | 0.5 |
Student 59 | 69.8 | 71.1 | 1.3 |
Student 60 | 53.1 | 54.0 | 0.9 |
Student 61 | 67.0 | 67.5 | 0.5 |
Student 62 | 67.0 | 67.0 | 0.0 |
Student 63 | 54.9 | 54.5 | -0.4 |
Student 64 | 69.8 | 71.1 | 1.3 |
Student 65 | 72.0 | 72.4 | 0.4 |
Student 66 | 51.8 | 53.6 | 1.8 |
Student 67 | 75.2 | 76.5 | 1.3 |
Student 68 | 59.0 | 59.0 | 0.0 |
Some individual weight gains are negative. This does not mean a negative weight, since the values are differences (specifically, weight gains). The differences are computed as Week 12 minus Week 1, so a negative value means that the Week 1 weight is greater than the Week 12 weight value: that is, a weight loss.
As always, begin by understanding the data, and produce appropriate graphical and numerical summaries.
What graphs would be suitable for displaying these data?
- Boxplot
- A histogram
- A histogram of the differences (such as the energy savings) for each house
- A case-profile plot
23.3 Defining notation
The notation used for paired data reflects that we work with the differences (Table 23.2). Apart from that, the notation is similar to that used in Chap. 22.
One sample mean | Mean of paired data | |
---|---|---|
The observations: | Values: \(x\) | Differences: \(d\) |
Sample mean: | \(\bar{x}\) | \(\bar{d}\) |
Standard deviation: | \(s\) | \(s_d\) |
Standard error of sample mean: | \(\displaystyle\text{s.e.}(\bar{x}) = \frac{s}{\sqrt{n}}\) | \(\displaystyle\text{s.e.}(\bar{d}) = \frac{s_d}{\sqrt{n}}\) |
Sample size: | Number of observations: \(n\) | Number of differences: \(n\) |
23.4 Summarising data
An appropriate graph is a histogram of the differences (Fig. 23.1). Graphing the Week 1 and Week 12 data may also be useful too, but a graph of the differences is crucial, as the RQ is about the differences. A case-profile plot (Sect. 12.8.2) is also useful, but is difficult to read here as the sample size is large (recall, a case-profile plot contains a line for each unit of analysis).

FIGURE 23.1: A histogram of weight changes (the vertical grey line represents no weight gain)
Since the data are differences, a numerical summary must summarise the differences. Summarising the Week 1 and Week 12 data is useful too, but summarising the differences is crucial because the RQ is about the differences (see below). For the weight-gain data, the appropriate numerical summary for paired quantitative data summarises the differences using means, standard deviations, and so on, as appropriate.
A mean or a median may be appropriate for describing the data. However, the CI is about the mean of the data, and not about the data itself.
Since the sampling distribution for the sample mean (under certain conditions) has a symmetric normal distribution, the mean is appropriate for describing the sampling distribution.
A numerical summary of the weight gain (from a computer) gives the sample mean of the differences as \(\bar{d} = -0.8735\), and the standard deviation of the differences as \(s_d = 0.9489378\). A formal numerical summary table is shown in Table 23.3.
Mean | Median | Std dev | IQR | |
---|---|---|---|---|
Week 1 weight | 61.24 | 60.3 | 10.97 | 14.02 |
Week 12 weight | 62.10 | 60.3 | 11.07 | 14.10 |
Weight gain | 0.86 | 0.9 | 0.96 | 1.00 |
23.5 Describing sampling distribution
The study concerns the mean weight gain. Every possible sample of \(n = 68\) students comprises different students, and hence produces different Week 1 and Week 12 weights, and hence different weight gains. As a result, the sample mean weight gain will vary from sample to sample, so the mean differences have a sampling distribution, and a standard error.
Since the differences are like a single sample of data (Chap. 22), the sampling distribution for the differences will have a similar sampling distribution to the mean of a single sample \(\bar{x}\) (provided the conditions are met; Sect. 23.7).
Definition 23.2 (Sampling distribution of a sample mean difference) The sampling distribution of a sample mean differences is described by:
- an approximate normal distribution,
- centred around the sampling mean whose value is the population mean difference \(\mu_d\),
- with a standard deviation, called the standard error of the difference, of \(\displaystyle\text{s.e.}(\bar{d}) = \frac{s_d}{\sqrt{n_d}}\),
when certain conditions are met (Sect. 23.7), where \(n\) is the size of the sample, and \(s_d\) is the standard deviation of the individual differences in the sample.
For the weight-gain data, the variation in the sample mean differences \(\bar{d}\) is described by (Fig. 23.2):
- approximate normal distribution;
- with a sampling mean whose value is \(\mu_{{d}}\);
- with a standard error of \(\displaystyle\text{s.e.}(\bar{d}) = {0.9563659}\div{\sqrt{68}} = 0.1159764\).
Notice that many decimal places are used in the working here; results will be rounded when reported.

FIGURE 23.2: The sampling distribution is a normal distribution; it shows how the sample mean weight gain varies in samples of size \(n = 68\)
Quantity | Description |
---|---|
Individual difference in the population | Vary with mean \(\mu_d\) and standard deviation \(\sigma_d\) |
Individual differences in a sample | Vary with mean \(\bar{d}\) and standard deviation \(s_d\) |
Sample means (\(\bar{d}\)) across all possible samples: | Vary with approx. normal distribution (under certain conditions) sampling mean \(\mu_{{d}}\); standard deviation \(\text{s.e.}(\bar{d})\) |
23.6 Computing confidence intervals
The CI for the mean difference has the same form as for a single mean (Chap. 22), so an approximate 95% confidence interval (CI) for \(\mu_d\) is
\[
\bar{d} \pm (2 \times\text{s.e.}(\bar{d})).
\]
This is the same as the CI for \(\bar{x}\) if the differences are treated like the data.
For the weight-gain data:
\[
0.8618 \pm (2 \times 0.1159764),
\]
or \(0.8618\pm 0.2319528\) (so the margin of error is \(0.232\)).
Equivalently, the CI is from \(0.8618 - 0.23195 = 0.630\), up to \(0.8618 + 0.232195 = 1.094\).
We write:
Based on the sample, an approximate 95% CI for the population mean weight gain between Week 1 and 12 is from \(0.63\)kg to \(1.09\)kg.
The 95% CI is saying that we are reasonably confident that, between Weeks 1 and 12, the mean weight gain is between using \(0.63\)kg and \(1.09\)kg. Alternatively, the plausible values for the mean weight gain in the population are between \(0.63\)kg and \(1.09\)kg.
23.7 Statistical validity conditions
As with any inferential procedure, these results apply under certain conditions. The conditions under which the CI is statistically valid for paired data are similar to those for one sample mean, rephrased for differences.
The CI computed above is statistically valid if one of these conditions is true:
- The sample size of differences is at least 25; or
- The sample size of differences is smaller than 25, and the population of differences has an approximate normal distribution.
The sample size of 25 is a rough figure here, and some books give other (similar) values (such as 30). This condition ensures that the distribution of the sample means has an approximate normal distribution (so that, for example, the 68--95--99.7 rule can be used). Provided the sample size is larger than about 25, this will be approximately true even if the distribution of the individuals in the population does not have a normal distribution. That is, when \(n > 25\) the sample means generally have an approximate normal distribution, even if the data themselves don't have a normal distribution.
Example 23.1 (Statistical validity) For the weight-gain data, the sample size is \(n = 68\), larger than 25, so the results are statistically valid. We do not require that the differences in the population follow a normal distribution.
23.8 Using software
Software (such as jamovi or SPSS) can produce exact 95% CIs, which may be slightly different than the approximate 95% CI (since the 68--95--99.7 rule gives approximate multipliers). The approximate and exact 95% CIs are the same to two decimal places. From the jamovi or SPSS output (Fig. 23.3) we can write:
Based on the sample, a 95% CI is for the population mean weight gain between Weeks 1 and 12 is between \(0.63\) to \(1.09\)kg


FIGURE 23.3: The weight-gain data: jamovi output (top) and SPSS output(bottom)
23.9 Example: endangered species
A study of endangered species (Harnish and Nataraajan 2020, 1703) examined
...whether perceived physical attractiveness of a species impacted participants' attitudes toward supporting and protecting the species...
To do so, 210 undergraduate students were surveyed about 14 animals on various aspects of supporting and protecting them. Part of the data are summarised in Table 23.5, for two animals, when asked about 'support to protect the animal from illicit trade'. Larger values means greater support for protecting the animal from illicit trade. (Notice that the standard deviation of the difference is not the difference between the two given values of the standard deviation.)
Mean score | Standard deviation | |
---|---|---|
Bay Checkerspot Butterfly | 3.10 | 1.06 |
Valley Elderberry Longhorn Beetle | 2.33 | 1.13 |
Difference | 0.77 | 1.07 |
The difference is defined as each student's score for the butterfly (deemed more attractive) minus their score for the beetle (deemed less attractive). A positive value therefore means more support (on average) for the butterfly.
The researchers (p. 1704) wished to test if
...animals perceived as more physically attractive [i.e., the butterfly] compared to those which are perceived as less physically attractive [i.e., the beetle] will receive relatively more support to prevent the species from illicit trade
The parameter is \(\mu_d\), the population mean difference. The mean difference is \(\bar{d} = 0.77\) and \(s_d = 1.07\). The value of \(\bar{d}\) will vary from sample to sample, so has a standard error:
\[ \text{s.e.}(\bar{d}) = \frac{s_d}{\sqrt{n}} = \frac{1.07}{\sqrt{210}} = 0.073837. \] Using an approximate multiplier of 2, the margin of error is \(2 \times 0.073837 = 0.147674\), so an approximate 95% CI for the difference in support in the population is \(0.77\pm 0.147674\), or from \(0.62\) to \(0.92\). We write:
From the sample, an approximate 95% CI for the difference in support for preventing illicit trade is between 0.62 and 0.92 higher for the Bay Checkerspot Butterfly compared to the Valley Elderberry Longhorn Beetle from illict trade (mean difference: 0.77; standard deviation: 1.07; \(n = 210\) undergraduate students).
23.10 Example: blood pressure

A US study (Willems et al. 1997; Schorling et al. 1997) examined how CHD risk factors were assessed among parts of the population with diabetes. Subjects reported to the clinic on multiple occasions. Consider this RQ:
What is the mean difference in diastolic blood pressure from the first to the second visit?
Each person has a pair of diastolic blood pressure (DBP) measurements: One each from their first and second visits. The data (shown below) are from the 141 people for whom both measurements are available (some data are missing). The differences could be computed as:
- The first visit DBP minus the second visit DBP: the reduction in DBP; or
- The second visit DBP minus the first visit DBP: the increase in DBP.
Either way is fine, provided the order is used consistently. Here, the observation from the second visit will be used, so that the differences represent the reduction in DBP from the first to second visit. The parameter is \(\mu_d\), the population mean reduction in DBP.
Since the dataset is large, the appropriate graphical summary is a histogram of differences (Fig. 23.4). The numerical summary can summarise both the first and second visit observations, but must summarise the differences. Numerical summaries can be computed using software, then reported in a suitable table (Table 23.6).

FIGURE 23.4: Histogram of the decrease in DBP between the first and second visits
Mean | Standard deviation | Standard error | Sample size | |
---|---|---|---|---|
DBP: First visit | 94.48 | 11.473 | 0.966 | 141 |
DBP: Second visit | 92.52 | 11.555 | 0.973 | 141 |
Decrease in DBP | 2.38 | 10.931 | 0.921 | 141 |
The standard error of the sample mean is
\[ \text{s.e.}(\bar{d})=\frac{s_d}{\sqrt{n}} = \frac{8.02614}{\sqrt{141}} = 0.67592. \] Using an approximate multiplier of 2, the margin of error is \(2 \times 0.67592 = 1.3518\), so an approximate 95% CI for the decrease in DBP is \(1.9504\pm 1.3518\), or from \(0.60\) to \(3.30\) mm Hg, after rounding sensibly. We write:
Based on the sample, an approximate 95% CI for the mean decrease in DBP is from \(0.60\) to \(3.30\) mm Hg.
The exact 95% CI from jamovi or SPSS (Fig. 23.5), using an exact \(t\)-multiplier rather than an approximate multiplier of 2, is similar since the sample size is large. After rounding, write:
Based on the sample, an exact 95% CI for the decrease in DBP is from \(0.61\) to \(3.29\) mm Hg.
The wording ('for the decrease in DBP') implies which reading is the higher reading on average: the first.


FIGURE 23.5: jamovi output (top) and SPSS output (bottom) for the blood pressure data, including the exact 95% CI
Be clear in your conclusion about how the differences are computed.
The CI is statistically valid as the sample size is larger than 25. (The data do not need to follow a normal distribution.)
Is there a mean difference in DBP in the population?
Be careful: The RQ is about the mean difference in the population... but we only have the mean difference from one of the many possible samples. So it is difficult to be certain.
23.11 Quick review questions
- True or false: For paired data, the mean of the differences is treated like the mean of a single variable.
- True or false: The appropriate graph for displaying paired data is often a histogram of the differences.
- True or false: The population mean difference is denoted by \(\mu_d\).
- True or false: The standard error of the sample mean difference is denoted by \(s_d\).
23.12 Exercises
Selected answers are available in Sect. D.22.
Exercise 23.1 People often struggle to eat the recommended intake of vegetables. In one study exploring ways to increase vegetable intake in teens (Fritts et al. 2018), teens rated the taste of raw broccoli, and raw broccoli served with a specially-made dip.
Each teen (\(n = 101\)) had a pair of measurements: the taste rating of the broccoli with and without dip. Taste was assessed using a '100 mm visual analog scale', where a higher score means a better taste. In summary:
- For raw broccoli, the mean taste rating was \(56.0\) (with a standard deviation of \(26.6\));
- For raw broccoli served with dip, the mean taste rating was \(61.2\) (with a standard deviation of \(28.7\)).
Because the data are paired, the differences are the best way to describe the data. The mean difference in the ratings was \(5.2\), with \(\text{s.e.}(\bar{d}) = 3.06\). From this information:
- Construct a suitable numerical summary table.
- Compute the approximate 95% CI for the mean difference in taste ratings.
Exercise 23.2 In a study of hypertension (Hand et al. 1996; MacGregor et al. 1979), 15 patients were given a drug (Captopril) and their systolic blood pressure measured (in mm Hg) immediately before and two hours after being given the drug (Table 23.7).
- Explain why it is sensible to compute differences as the Before minus the After measurements. What do the differences mean when computed this way?
- Compute the differences.
- Compute an approximate 95% CI for the mean difference.
- Write down the exact 95% CI using the computer output (Fig. 23.6).
- Why are the two CIs different?
Before | After | Before | After |
---|---|---|---|
210 | 201 | 173 | 147 |
169 | 165 | 146 | 136 |
187 | 166 | 174 | 151 |
160 | 157 | 201 | 168 |
167 | 147 | 198 | 179 |
176 | 145 | 148 | 129 |
185 | 168 | 154 | 131 |
206 | 180 |


FIGURE 23.6: jamovi (top) and SPSS (bottom) output for the Captoril data
Exercise 23.3 A study (Allen et al. 2018) examined the effect of exercise on smoking. Men and women were assessed on a range of measures, including the 'intention to smoke'.
'Intention to smoke' was assessed both before and after exercise for each subject, using two quantitative questionnaires. Smokers (people smoking at least five cigarettes per day) aged 18 to 40 were enrolled for the study. For the 23 women in the study, the mean intention to smoke after exercise reduced by 0.66 (with a standard error of 0.37).
- Find a 95% confidence interval for the population mean reduction in intention to smoke for women after exercising.
- Is this CI statistically valid?
Exercise 23.4 Young girls (\(n = 29\)) with anorexia received cognitive behavioural treatment (Hand et al. (1996), Dataset 285), and their weight before and after treatment were recorded. In summary:
- Before the treatment, the mean weight was 82.69 pounds (\(s = 4.85\) pounds);
- After the treatment, the mean weight was 85.7 pounds (\(s = 8.35\) pounds).
If the standard deviation of the weight loss was 7.31 pounds, find a 95% CI for the population mean weight loss. Do you think the treatment had any impact on the mean weight of the girls?
Exercise 23.5 The concentration of beta-endorphins in the blood is a sign of stress. One study (Hand et al. (1996), Dataset 232; Hoaglin, Mosteller, and Tukey (2011)) measured the beta-endorphin concentration for 19 patients about to undergo surgery.
Each patient had their beta-endorphin concentrations measured 12--14 hours before surgery, and also 10 minutes before surgery. A numerical summary can be produced from jamovi output (Table 23.8).
Mean | Std deviation | Std error | Sample size | |
---|---|---|---|---|
12--14 hours before surgery | 8.35 | 4.397 | 1.009 | 19 |
10 minutes before surgery | 16.05 | 12.509 | 2.870 | 19 |
Increase | 7.70 | 13.519 | 3.102 | 19 |

FIGURE 23.7: jamovi output for the surgery-stress data
Exercise 23.6 A study of \(n = 213\) Spanish health students (Romero-Blanco et al. 2020) measured (among other things) the number of minutes of vigorous physical activity (PA) performed by students before and during the COVID-19 lockdown (from March to April 2020 in Spain). Since the before and during lockdown were both measured on each participant, the data are paired (within individuals). The data are summarised in Table 23.9.
- Explain what the differences mean.
- Compute the standard error of the differences.
- Compute the approximate 95% CI, and interpret what it means.
Mean (mins) | Std dev (mins) | |
---|---|---|
Before | 28.47 | 54.13 |
During | 30.66 | 30.04 |
Increase | 2.68 | 51.30 |