27 CIs for mean differences (paired data)
So far, you have learnt to ask a RQ, design a study, classify and summarise the data, and form confidence intervals. In this chapter, you will learn to
- identify situations where estimating a mean difference is appropriate.
- construct confidence intervals for a mean difference.
- determine whether the conditions for using the confidence interval apply in a given situation.
27.1 Introduction: students starting university
What happens to students' eating habits when they start university? Many students will be responsible for their own meals for the first time, so some may forgo healthy foods for convenient, but less healthy, foods. Alternatively, some may not be able to afford sufficient or healthy food.
One approach to studying this is to take a sample of students who are beginning university and measure their weight, and then a sample of different students some later time and measure their weight. This is comparing between individuals. This between-individuals design compares the means of two groups of different students, the topic of the next chapter.
Another approach is to record some students' weights as they begin university, and then obtain the same students' weight some later time. The comparison is within individuals (Sect. 2.4); this is a repeated-measures study. Each student has a pair of weight measurements, and the study produces paired data, the topic of this chapter.
D. A. Levitsky, Halbmaier, and Mrdjenovic (2004) used this second approach to answer this question:
For Cornell University students, what is the mean weight change in students after \(12\) weeks at university?
The data collected to answer this RQ are shown below (D. Levitsky, n.d.).
Some weight gains are negative. This does not mean a negative weight. Since the differences are computed as Week \(12\) minus Week \(1\), a negative difference means the Week \(1\) weight is greater than the Week \(12\) weight value (i.e., a weight loss).
27.2 Paired data
The data above are paired. The RQ is a special case of a repeated-measures RQs (Sect. 2.4), where each unit of analysis has only two observations. Computing the differences or changes between the pairs of observations makes sense.
Pairing data, when appropriate, is useful because individuals can vary substantially, and pairing means that extraneous variables (potentially, confounding variables) are held constant for those paired observations. For example, each pair of weights recorded in the data above come from same person, so sex and age remains the same. Pairing is a form of blocking (Sect. 7.2), and is a good design strategy when the individuals in the pair are similar for many extraneous variables.
Definition 27.1 (Paired data) Paired data occurs when the outcome is compared for two different, distinct situations for each individual.
Paired studies appear in many situations:
- Blood pressure is recorded on the same individuals before and after receiving a drug. The change in blood pressure is recorded for each person.
- The number of campers is recorded at many national parks (the 'individuals') on the first weekend in summer, and on the first weekend on winter. The change in camper numbers for each national park between these time points is recorded.
- The body temperature of dogs (the 'individuals') is measured using rectal and ear thermometers. The difference between the two recorded temperatures from the thermometers is recorded.
- Height is measured for each twin in a pair (the twin-pair is the 'individual'). Pairing the heights for each twin is reasonable given the shared genetics (and probably environments also). The difference between the height of the twins can be recorded for each pair.
Many of these examples can be extended to beyond two measurements. For instance, blood pressures can be recorded every thirty minutes over four hours, or temperatures can be compared using three different types of thermometers. We only study pairs of measurements, and only for quantitative variable.
27.3 Summarising the data
For the student-eating study, weight is measured for the same students at the start of university and after \(12\) weeks at university. Each student receives two measurements, and the change in weight for each individual is recorded.
Since data are paired, an appropriate graph is a histogram of the differences (Chap. 15); specifically, weight gains. A boxplot comparing students' weights at Week \(1\) and at Week \(12\) (that is, not pairing the data) shows the distribution of weights, and the median weights, are very similar (Fig. 27.1, left panel). Any change in individuals' weights is difficult to see and detect. In addition, the link between the weights of students in Week \(1\) and Week \(12\) has been lost.
The histogram of the weight gains makes the change in individuals' weights easier to see (Fig. 27.1, right panel). The histogram also makes it easy to see that some students lost weight and some students gained weight from Week \(1\) to Week \(12\). Individually graphing the Week \(1\) and Week \(12\) data may also be useful too, but a graph of the differences is crucial, as the RQ is about those differences. A case-profile plot (Sect. 15.3.2) is also appropriate, but is difficult to read for these data as the sample size is large (a line is needed for each unit of analysis).
The Week \(1\) and the Week \(12\) weights can be summarised individually (the first two rows of Table 27.1) using the methods of Chap. 25. All statistics are slightly different in Weeks \(1\) and \(12\); in particular, a slight weight gain is seen. However, since the RQ is about weight change in individuals, a numerical summary of the differences is essential.
Mean | Median | Standard deviation | Standard error | |
---|---|---|---|---|
Week 1 weight (in kg) | \(61.24\) | \(60.3\) | \(10.970\) | \(1.330\) |
Week 12 weight (in kg) | \(62.10\) | \(60.3\) | \(11.073\) | \(1.343\) |
Weight gain (in kg) | \(\phantom{0}0.86\) | \(\phantom{0}0.9\) | \(\phantom{0}0.956\) | \(0.116\) |
27.4 Sampling distribution for \(\bar{d}\)
Every possible sample of \(n = 68\) students comprises different students, and hence produces different Week \(1\) weights and Week \(12\) weights. For this reason, the Week \(1\) and Week \(12\) summarises in Table 27.1 include standard errors. Since the Week \(1\) and Week \(12\) weight vary from sample to sample, the weight changes vary from sample to sample too, and also have a sampling distribution.
The differences can be treated like a single sample of data (Chap. 25), and so the sampling distribution for the differences has a similar sampling distribution to that of \(\bar{x}\) (provided the conditions are met; Sect. 27.6). In addition, the notation when working with paired data is similar to that used when working with the one-mean case too (Table 27.2).
Definition 27.2 (Sampling distribution of a sample mean difference) The sampling distribution of a sample mean difference is (when certain conditions are met; Sect. 27.6) described by:
- an approximate normal distribution,
- centred around the sampling mean whose value is the population mean difference \(\mu_d\),
- with a standard deviation, called the standard error of the difference, of \(\displaystyle\text{s.e.}(\bar{d}) = \frac{s_d}{\sqrt{n_d}}\),
where \(n\) is the number of differences, and \(s_d\) is the standard deviation of the individual differences in the sample.
A mean or a median may be appropriate for describing the differences. However, the sampling distribution for the sample mean difference (under certain conditions) has a normal distribution. Hence, the mean is appropriate for describing the sampling distribution, even if not for describing the data.
One sample mean | Mean difference | |
---|---|---|
The observations: | Values: \(x\) | Differences: \(d\) |
Sample mean: | \(\bar{x}\) | \(\bar{d}\) |
Standard deviation: | \(s\) | \(s_d\) |
Standard error of \(\bar{x}\): | \(\displaystyle\text{s.e.}(\bar{x}) = \frac{s}{\sqrt{n}}\) | \(\displaystyle\text{s.e.}(\bar{d}) = \frac{s_d}{\sqrt{n}}\) |
Sample size: | Number of observations: \(n\) | Number of differences: \(n\) |
For the student-eating data, the sample mean differences \(\bar{d}\) are described by (Fig. 27.2):
- approximate normal distribution,
- with a sampling mean whose value is \(\mu_{{d}}\),
- with a standard error of
\[ \text{s.e.}(\bar{x}) = \frac{0.956366}{\sqrt{68}} = 0.11598. \]
The summary information for the weight gains can be added to the summary table (the third row of Table 27.1), after appropriate rounding. Notice that the standard deviation of the difference is not the difference between the standard deviations for the Week \(1\) and Week \(12\) data. (The same applies for the standard error and the median.) Instead, the standard deviation of the list of differences is found (i.e., the column Weight gain in the data table).
27.5 Confidence intervals for \(\mu_d\)
The CI for the mean difference has the same form as for a single mean (Chap. 25).
The \(95\)% confidence interval (CI) for \(\mu_d\) is
\[
\bar{d} \pm (\text{multiplier} \times\text{s.e.}(\bar{d})).
\]
As usual, for an approximate \(95\)% confidence interval (CI), the approximate multiplier is \(2\) (from the \(68\)--\(95\)--\(99.7\) rule).
This is the same as the CI for \(\bar{x}\) if the differences are treated as the data.
For the student-eating data:
\[
0.8618 \pm (2 \times 0.1159764),
\]
or \(0.862\pm 0.232\) (so the margin of error is \(0.232\)).
Equivalently, the CI is from \(0.862 - 0.232 = 0.630\), up to \(0.862 + 0.232 = 1.094\).
We write:
The mean weight gain from Week \(1\) to \(12\) is \(0.86\) kg (\(\text{s.e.} = 0.116\); \(n = 68\)), with an approximate \(95\)% CI from \(0.63\) kg to \(1.09\) kg.
The CI means that the reasonable values for the population mean weight gain are between \(0.63\) kg and \(1.09\) kg. Alternatively, we are \(95\)% confident that, between Weeks \(1\) and \(12\), the population mean weight gain is between \(0.63\) kg and \(1.09\) kg. A weight gain of this magnitude probably has no practical importance.
Statistical software produces exact \(95\)% CIs, which may be slightly different than the approximate \(95\)% CI (the \(68\)--\(95\)--\(99.7\) rule gives approximate multipliers). For the student-eating data, the approximate and exact \(95\)% CIs are the same to two decimal places (Fig. 27.3). We write:
The mean weight gain from Week \(1\) to Week \(12\) is \(0.86\) kg (\(\text{s.e.} = 0.116\); \(n= 68\)), with a \(95\)% CI between \(0.63\) to \(1.09\) kg.
27.6 Statistical validity conditions
As with any confidence interval, these results apply under certain conditions. The conditions under which the CI is statistically valid for paired data are similar to those for one sample mean, rephrased for differences.
Statistical validity can be assessed using these criteria:
- When \(n > 25\), the test is statistically valid provided the distribution of differences is not highly skewed.
- When \(n \le 25\), the test is statistically valid only if the data come from a population of differences with a normal distribution.
The sample size of \(25\) is a rough figure; some books give other values (such as \(30\)). This condition ensures that the distribution of the sample means has an approximate normal distribution (so the \(68\)--\(95\)--\(99.7\) rule can be used). Provided the sample size is larger than about \(25\), this will be approximately true even if the distribution of the differences in the population does not have a normal distribution. That is, when \(n > 25\) the sample means generally have an approximate normal distribution, even if the data themselves don't have a normal distribution.
If the statistical validity conditions are not met, resampling methods may be used (Efron and Hastie 2021).
Example 27.1 (Statistical validity) For the eating data, the sample size is \(n = 68\), so the results are statistically valid. The differences in the population, nor the weights in Week \(1\) and Week \(12\), need to follow a normal distribution.
27.7 Example: invasive plants
Skypilot is a alpine wildflower native to the Colorado Rocky Mountains (USA). In recent years, a willow shrub has been encroaching on skypilot territory and, because willow often flowers early, Kettenbach et al. (2017) studied whether the willow may 'negatively affect pollination regimes of resident alpine wildflower species' (p. 6,965). One RQ was:
In the Colorado Rocky Mountains, what is the mean difference between first-flowering day for the native skypilot and the encroaching willow?
Data for both species was collected at \(n = 25\) different sites, so the data are paired by site (Sect. 15.4). The unit of analysis is the site, and the unit of observation is the plant. The data are shown in the table below. The 'first-flowering day' is the number of days since the start of the year (e.g., January \(12\) is 'day \(12\)') when flowers were first observed.
The parameter is \(\mu_d\), the population mean difference between the day of first flowering for skypilot, less the day of first flowering for willow. Hence, a positive value for the difference means that the skypilot values are larger, and hence that willow flowered first.
Explaining how the differences are computed is important. The differences here are skypilot minus willow first-flowering days. Positive values mean willow flowered first; negative values mean skypilot flowered first.
The data are summarised graphically (Fig. 15.4) and numerically (Table 15.4), using software output (Fig. 15.3).
The standard error of the mean difference is \(\text{s.e.}(\bar{d}) = 0.940\) (Fig. 15.3; Table 15.4). The approximate \(95\)% CI is \(1.36 \pm (2\times 0.940)\), or from \(-0.52\) to \(3.24\) days. Software output (Fig. 15.3) gives the \(95\)% CI as \(-0.58\) to \(3.30\) days. Remembering that positive differences mean willow flowers earlier, we write (using the exact CI):
The mean difference in the day of first flowering is \(1.36\) days earlier for the willow (\(\text{s.e.} = 0.940\); \(n = 25\)), with an approximate \(95\)% CI between \(0.52\) days earlier for skypilot to \(3.24\) days earlier for willow.
The CI is statistically valid since \(n = 25\).
Be clear in your conclusion about how the differences are computed. Make sure to interpret the CI consistent with how the differences are defined.
27.8 Example: chamomile tea
Rafraf, Zemestani, and Asghari-Jafarabadi (2015) studied patients with Type 2 diabetes mellitus (T2DM). They randomly allocated \(32\) patients into a control group (who drank hot water), and another \(32\) patients to receive chamomile tea (p. 164):
The study was blinded so that the allocation of the intervention or control group was concealed from the researchers and statistician [...] The intervention group (\(n = 32\)) consumed one cup of chamomile tea [...] three times a day immediately after meals (breakfast, lunch, and dinner) for \(8\) weeks. The control group (\(n = 32\)) consumed an equivalent volume of warm water during the \(8\)-week period...
The total glucose (TG) was measured for each individual both before the intervention and after eight weeks on the intervention, in both the control and treatment groups. The data are not available, so no graphical summary of the data can be produced; however, the article gives a data summary (motivating Table 27.3). The following RQs can be asked:
- For patients with T2DM, what is the mean reduction in TG after eight weeks drinking chamomile tea?
- For patients with T2DM, what is the mean reduction in TG after eight weeks drinking hot water?
For the tea group, the standard error of the reduction in TG is \(\text{s.e.}(\bar{d}) = 30.37/\sqrt{32} = 5.37\). For the control group, the standard error of the reduction in TG is \(\text{s.e.}(\bar{d}) = 36.66/\sqrt{32} = 6.48\). Thus, the approximate \(95\)% CI for the reduction in TG is:
- Tea-drinking group: \(38.62\pm (2\times 5.37)\), or from \(27.88\) to \(49.36\) mg.dl\(^{-1}\).
- Control group: \(-7.12\pm (2\times 6.48)\), or from \(-20.08\) to \(5.84\) mg.dl\(^{-1}\).
(A negative reduction in TG means an increase in TG.) The chamomile tea appears to reduce TG, but not the hot water. Is the difference between the two treatments due to sampling variation? This question is studied further in Sect. 34.9.
\(n\) | Mean | Std. dev. | Mean | Std. dev. | Mean | Std. dev. | |
---|---|---|---|---|---|---|---|
Chamomile tea | \(32\) | \(203.00\) | \(54.96\) | \(164.37\) | \(50.70\) | \(38.62\) | \(30.37\) |
Control | \(32\) | \(178.25\) | \(53.06\) | \(185.37\) | \(52.59\) | \(-7.12\) | \(36.66\) |
Difference | \(\phantom{0}24.75\) | \(\phantom{0}21.00\) | \(45.74\) |
We write:
The mean reduction in TG for those drinking chamomile tea is \(38.62\) mg.dl^{-1} (approx. \(95\)% CI: \(27.88\) to \(49.36\) mg.dl^{-1}), and \(-7.12\) mg.dl^{-1} for those drinking water (approx. \(95\)% CI: \(-20.08\) and \(-5.84\) mg.dl^{-1}).
The intervals have a \(95\)% chance of straddling the population mean reduction in TG. The sample sizes are larger than \(25\), so the results are statistically valid.
27.9 Chapter summary
To compute a confidence interval (CI) for a mean difference, compute the sample mean difference, \(\bar{d}\), and identify the sample size \(n\).
Then compute the standard error, which quantifies how much the value of \(\bar{d}\) varies across all possible samples:
\[
\text{s.e.}(\bar{d})
=
\frac{ s_d }{\sqrt{n}},
\]
where \(s_d\) is the sample standard deviation.
The margin of error is (multiplier\(\times\)standard error), where the multiplier is \(2\) for an approximate \(95\)% CI (using the \(68\)--\(95\)--\(99.7\) rule).
Then the CI is:
\[
\bar{d} \pm \left( \text{multiplier}\times\text{standard error} \right).
\]
The statistical validity conditions should also be checked.
27.10 Quick review questions
Are the following statements true or false?
- For paired data, the mean of the differences is treated like the mean of a single variable when computing a CI.
- An appropriate graph for displaying paired data is a histogram of the differences.
- The population mean difference is denoted \(\mu_d\).
- The standard error of the sample mean difference is denoted \(s_d\).
27.11 Exercises
Answers to odd-numbered exercises are available in App. E.
Exercise 27.1 Which (if any) of these scenarios are paired?
- Heart rate is measured for each individual when sitting and when standing. (Some individuals have their heart rate recorded first while sitting, and some first while standing.) Each person receives two measurements, and the difference in heart rate between sitting and standing is recorded.
- The mean protein concentrations were compared in sea turtles before and after being rehabilitated (March et al. 2018).
Exercise 27.2 Which (if any) of these scenarios are paired?
- Heart rate was recorded for \(36\) people, both before and after exercise, to determine how much the average heart rate increase.
- The mean HDL cholesterol concentration is recorded for a group of males and a group of females, and the means compared.
Exercise 27.3 [Dataset: Fruit
]
Mukherjee, Deb, and Devy (2019) studied the effect of rainfall on growing Chayote squash (Sechium edule).
They compared the size of the fruit in a year with normal rainfall (2015) compared to a dry year (2014) on \(24\) farms:
For Chayote squash grown in Bangalore, what is the mean difference in fruit weight between a normal and dry year?
Ten fruits were gathered from each farm in both years, and the average (mean) weight of the fruit recorded for the farm. Since the same farms are used in both years, the data are paired (see above). Data is missing for Farm 20 in the dry year (2014), so there are \(n = 23\) differences.
Farm | Dry | Normal | Change (in g) |
---|---|---|---|
\(\phantom{0}1\) | \(367.75\) | \(371.05\) | \(\phantom{-}\phantom{0}\phantom{0}3.30\) |
\(\phantom{0}2\) | \(238.25\) | \(218.85\) | \(-19.40\) |
\(\phantom{0}3\) | \(271.25\) | \(217.55\) | \(-53.70\) |
\(\phantom{0}4\) | \(286.27\) | \(221.70\) | \(-64.57\) |
\(\phantom{0}5\) | \(259.20\) | \(268.95\) | \(\phantom{-}\phantom{0}\phantom{0}9.75\) |
\(\phantom{0}6\) | \(196.23\) | \(194.85\) | \(\phantom{0}-1.38\) |
\(\phantom{0}7\) | \(283.70\) | \(293.00\) | \(\phantom{-}\phantom{0}\phantom{0}9.30\) |
\(\phantom{0}8\) | \(252.05\) | \(264.15\) | \(\phantom{-}\phantom{0}12.10\) |
\(\phantom{0}9\) | \(253.70\) | \(218.45\) | \(-35.25\) |
\(10\) | \(279.80\) | \(225.40\) | \(-54.40\) |
\(11\) | \(206.05\) | \(225.90\) | \(\phantom{-}\phantom{0}19.85\) |
\(12\) | \(222.00\) | \(222.85\) | \(\phantom{-}\phantom{0}\phantom{0}0.85\) |
\(13\) | \(285.50\) | \(282.25\) | \(\phantom{0}-3.25\) |
\(14\) | \(171.50\) | \(266.00\) | \(\phantom{-}\phantom{0}94.50\) |
\(15\) | \(186.75\) | \(206.20\) | \(\phantom{-}\phantom{0}19.45\) |
\(16\) | \(219.55\) | \(194.60\) | \(-24.95\) |
\(17\) | \(198.15\) | \(346.75\) | \(\phantom{-}148.60\) |
\(18\) | \(248.10\) | \(304.55\) | \(\phantom{-}\phantom{0}56.45\) |
\(19\) | \(231.55\) | \(263.20\) | \(\phantom{-}\phantom{0}31.65\) |
\(20\) | \(223.70\) | ||
\(21\) | \(257.50\) | \(258.75\) | \(\phantom{-}\phantom{0}\phantom{0}1.25\) |
\(22\) | \(230.70\) | \(248.95\) | \(\phantom{-}\phantom{0}18.25\) |
\(23\) | \(260.50\) | \(155.95\) | \(-104.55\) |
\(24\) | \(231.85\) | \(219.30\) | \(-12.55\) |
- What is the unit of analysis? What is the units of observation?
- Create a numerical summary table for the data (use Fig. 27.4).
- Create a suitable graph to display the data.
- Construct an approximate \(95\)% CI for the mean difference in fruit weight.
- Is this CI statistically valid?
Exercise 27.4 [Dataset: Captopril
]
MacGregor et al. (1979) studied of hypertension for \(15\) patients.
Patients were given a drug (Captopril) and their systolic blood pressure measured (in mm Hg) immediately before and two hours after being given the drug (Table 15.6; Hand et al. (1996)).
- Explain why it is sensible to compute differences as the Before minus the After measurements. What do the differences mean when computed this way?
- Compute an approximate \(95\)% CI for the mean difference.
- Write down the exact \(95\)% CI using the computer output (Fig. 27.5).
- Is this CI statistically valid?
- Why are the two CIs different?
Exercise 27.5 People often struggle to eat the recommended intake of vegetables. Fritts et al. (2018) explored ways to increase vegetable intake in teens. Teens rated the taste of raw broccoli, and raw broccoli served with a specially-made dip.
Each teen (\(n = 100\)) had a pair of measurements: the taste rating of the broccoli with and without dip. Taste was assessed using a '\(100\) mm visual analog scale', where a higher score means a better taste. In summary:
- For raw broccoli, the mean taste rating was \(56.0\) (with \(s = 26.6\));
- For raw broccoli served with dip, the mean taste rating was \(61.2\) (with \(s = 28.7\)).
Because the data are paired, differences are the best way to describe the data. The mean difference in the ratings was \(5.2\), with \(\text{s.e.}(\bar{d}) = 3.06\). From this information:
- Construct a suitable numerical summary table.
- Compute the approximate \(95\)% CI for the mean difference in taste ratings.
- Is this CI statistically valid?
Exercise 27.6 Allen et al. (2018) examined the effect of exercise on smoking. Men and women were assessed on their 'intention to smoke', both before and after exercise for each subject (using two quantitative questionnaires). Smokers ('smoking at least five cigarettes per day') aged \(18\) to \(40\) were enrolled for the study. For the \(23\) women in the study, the mean intention to smoke after exercise reduced by \(0.66\) (with a standard error of \(0.37\)).
- Find an approximate \(95\)% confidence interval for the population mean reduction in intention to smoke for women after exercising.
- Is this CI statistically valid?
Exercise 27.7 [Dataset: Anorexia
]
Young girls (\(n = 29\)) with anorexia received cognitive behavioural treatment (Hand et al. (1996)), and their weight before and after treatment were recorded.
In summary:
- Before the treatment, the mean weight was \(82.69\) pounds (\(s = 4.845\) pounds);
- After the treatment, the mean weight was \(85.70\) pounds (\(s = 8.352\) pounds).
The mean weight loss per girls was \(3.01\) pounds, with a standard deviation of \(7.31\) pounds Find an approximate \(95\)% CI for the population mean weight loss. Do you think the treatment had any meaningful impact on the mean weight of the girls, based solely on these data?
Exercise 27.8 [Dataset: Stress
]
The concentration of beta-endorphins in the blood is a sign of stress.
Hoaglin, Mosteller, and Tukey (2011) measured the beta-endorphin concentration for \(19\) patients about to undergo surgery (Hand et al. 1996).
Each patient had their beta-endorphin concentrations measured \(12\)--\(14\) hours before surgery, and also \(10\) minutes before surgery.
A numerical summary appears is in Table 15.8.
- Use the jamovi output in Fig. 27.6 to construct an approximate \(95\)% CI for the increase in stress as surgery gets closer.
- Use the jamovi output in Fig. 27.6 to write down the exact \(95\)% CI for the increase in stress as surgery gets closer.
- Why is there a difference between the two CIs?
- Is the CI statistically valid?
Exercise 27.9 A study of \(n = 213\) Spanish health students (Romero-Blanco et al. 2020) measured (among other things) the number of minutes of vigorous physical activity (PA) performed by students before and during the COVID-19 lockdown (from March to April 2020 in Spain). Since the before and during lockdown were both measured on each participant, the data are paired. The data are summarised in Table 15.9.
- Explain what the differences mean.
- Compute the standard error of the differences.
- Compute the approximate \(95\)% CI, and interpret what it means.
Exercise 27.10 Suppose, in the example of Sect. 27.7, the differences were defined as the day of first flowering for willow, less the day of first flowering for skypilot.
Write down, and interpret the meaning of, the confidence interval for the mean difference in first-flowering times.
Exercise 27.11 Suppose, in the example of Sect. 27.8, the differences were defined as increase in total glucose (TG).
Write down, and interpret the meaning of, the confidence interval for the mean increase in TG for the tea-drinking group.