24 CIs for two independent means
So far, you have learnt to ask a RQ, identify different ways of obtaining data, design the study, collect the data describe the data, summarise data graphically and numerically, and understand the tools of inference.
In this chapter, you will learn about confidence intervals for the differences between two means. You will learn to:
- produce confidence intervals for two independent means.
- determine whether the conditions for using the confidence intervals apply in a given situation.
24.1 Means of two independent samples
A study^{406} examined the reaction times of students while driving.
In one study, two different groups of students were used: one group used a mobile phone, and a different group did not use a mobile phone. The reaction time for each student was measured in a driving simulator. This is an example of a between individuals comparison.
The study uses two groups with different treatments: one group using a mobile phone while driving, and a different group not using a mobile phone while driving.
The data are not paired; instead, the means of two separate (or independent) samples are being compared. (The data would be paired if each student was measured twice: once using a phone, and once without using a phone.)
Consider the RQ:
For students, what is the difference between the mean reaction time while driving when using a mobile phone and the mean reaction time while driving when not using a mobile phone?
The data are shown below.
What are P, O, C and I in this study?
P: Students (this is defined more specifically in the original study).
O: Mean reaction time.
C: Between two groups: those using and not using a mobile phone while driving.
I: Yes; the use of a phone (or not) was decided by the researchers.
For this study, what graph would be suitable for displaying the data?
24.2 Graphical summary: Two independent means
To compare two quantitative variables, a suitable graphical summary may be a boxplot (Fig. 24.1) or (when samples sizes aren't too large) a dot chart.
For the reaction-time data, the boxplot shows that the sample medians are a little different, but the IQR about the same; one large outlier is present for the phone-using group.
24.3 Notation: Two independent means
Since two groups are being compared, distinguishing between the statistics for the two groups (say, Group A and Group B) is important. One way is to use subscripts (Table 24.1).
Group A | Group B | |
---|---|---|
Population means: | \(\mu_A\) | \(\mu_B\) |
Sample means: | \(\bar{x}_A\) | \(\bar{x}_B\) |
Standard deviations: | \(s_A\) | \(s_B\) |
Standard errors: | \(\displaystyle\text{s.e.}(\bar{x}_A) = \frac{s_A}{\sqrt{n_A}}\) | \(\displaystyle\text{s.e.}(\bar{x}_B) = \frac{s_B}{\sqrt{n_B}}\) |
Sample sizes: | \(n_A\) | \(n_B\) |
Using this notation, the difference between population means, the parameter of interest, is \(\mu_A-\mu_B\). As usual, the population values are unknown, so this parameter is estimated using the statistic \(\bar{x}_A-\bar{x}_B\).
Notice that Table 24.1 does not include a standard deviation or a sample size for the difference between means; they make no sense in this context.
For example, if Group A has 15 individuals, and Group B has 45 individuals, and we wish to study the difference \(\bar{x}_A - \bar{x}_B\). What is the sample size be? Certainly not \(15 - 45 = -30\).
On the other hand, the standard error of the difference between the means does make sense: it measures how much the value of \(\bar{x}_A - \bar{x}_B\) varies from sample to sample.
For the reaction-time data, we will use the subscripts \(P\) for phone-users group, and \(C\) for the control group. That means that the two sample means would be denoted as \(\bar{x}_P\) and \(\bar{x}_C\), and the difference between them as \(\bar{x}_P - \bar{x}_C\).
24.4 Numerical summary: Two independent means
The numerical summary should summarise both groups, but must summarise the differences between the means (since the RQ is about this difference). All this information can be found using jamovi (Fig. 24.2) or SPSS (Fig. 24.3), then compiled into a table (Table 24.2).
Mean | Sample size | Std dev | Standard error | |
---|---|---|---|---|
Using phone | 585.19 | 32 | 89.65 | 15.847 |
Not using phone | 533.59 | 32 | 65.36 | 11.554 |
Difference | 51.59 | 19.612 |
Each time the study is repeated, the means for each group are likely to be different, and so the difference between the means is likely to be different. That is, the difference between the means has sampling variation and hence a standard error, since it varies from sample to sample.
For those using a phone, what is the difference between the standard deviation and the standard error in the context of the reaction-time study?
The standard deviation quantifies how much the individual reactions times vary from person to person.
The standard error quantifies how much the difference between sample mean reaction times varies from sample to sample.
24.5 Sampling distribution: Two independent means
Since the difference between the population means is unknown (that's why the study was done), the difference is estimated using the sample means. For the reaction time data, we will use subscripts such as \(P\) for phone-users group, and \(C\) for the control group. Then the difference between the two sample means (the statistic) is \(\bar{x}_P - \bar{x}_C\).
The parameter is \(\mu_P - \mu_C\), the difference between the two population means (using a phone, minus not using a phone).
The differences could be compute in the opposite direction (\(\bar{x}_C - \bar{x}_P\)). However, for the reaction-time data, computing differences as the reaction time for phone users, minus the reaction time for non-phone users (controls) probably makes more sense: the differences then refer to how much greater (on average) the reaction times are when students are using phones,
Making clear how the differences are computed is important! Therefore, carefully defining the parameter is important.
The differences could be computed as:
- the reaction time for phone users, minus the reaction time for non-phone users (how much slower the phone users are, on average); or
- the reaction time for non-phone users, minus the reaction time for phone users (how much slower the non-phone users are, on average).
Either is fine, provided you are consistent, and clear about how the difference are computed. The meaning of any conclusions will be the same.
Each sample of students will comprise different students, and will give different reaction times while driving. The means for each group will differ from sample to sample, and the difference between the means will be different for each sample. The difference between the sample means varies from sample to sample, and so has a sampling distribution and standard error.
Definition 24.1 (Sampling distribution for the difference between two sample means) The sampling distribution of the difference between two sample means is described by:
- an approximate normal distribution;
- centred around \(\mu_A - \mu_B\) (the differences between the means);
- with a standard deviation of \(\displaystyle\text{s.e.}( \bar{x}_A - \bar{x}_B)\),
when the appropriate conditions are met.
We don't give a formula for finding the standard error \(\displaystyle\text{s.e.}( \bar{x}_A - \bar{x}_B)\), so the value of this standard error will be given.
A formula exists for finding the standard error of the difference between two means, but is complicated and we won't provide it.
(It is not necessary for our purposes anyway; software can handle details.)
We will provide the standard error of the difference between two means, or expect you to find it on computer output.
For the reaction-time data, the differences between the sample means will have:
- an approximate normal distribution;
- centred around \(\mu_P - \mu_C\) (the differences between the means in the two populations);
- with a standard deviation, called the standard error of the difference, of \(\text{s.e.}(\bar{x}_P - \bar{x}_C) = 19.61\).
We can draw this sampling distribution (Fig. 24.4).
What does a negative difference mean?
Earlier, we defined the differences as \(\mu_P - \mu_C\), the difference between the two population means (using a phone, minus not using a phone).
So a negative value simply means that the mean is greater when not using a phone.
24.6 Confidence intervals: Two independent means
Being able to describe
the sampling distribution implies that
we have some idea of how the values of
\(\bar{x}_P - \bar{x}_C\)
are likely to vary from sample to sample.
Then,
finding an approximate 95% CI for the difference between the mean reaction times
is similar to the process used in Chap. 22.
Approximate 95% CIs all have the same form:
\[ \text{statistic} \pm (2\times\text{s.e.}(\text{statistic})). \] When the statistic is \(\bar{x}_P - \bar{x}_C\), the approximate 95% CI is
\[ (\bar{x}_P - \bar{x}_C) \pm (2 \times \text{s.e.}(\bar{x}_P - \bar{x}_C)). \]
In this case (using more decimal places than in the summary table in Table 24.2), the CI is
\[\begin{eqnarray*} 51.59375 \pm (2 \times 19.61213), \end{eqnarray*}\] or \(51.59375\pm 19.61213\). After rounding appropriately, an approximate 95% CI for the difference is from \(12.37\) to \(90.82\) milliseconds.
We write:
Based on the sample, an approximate 95% CI for the difference in reaction time while driving, for those using a phone and those not using a phone, is from \(12.37\) to \(90.82\) milliseconds (higher for those using a phone).
The plausible values for the difference between the two population means are between \(12.37\) to \(90.82\) milliseconds.
Stating the CI is insufficient; you must also state the direction in which the differences were calculated, so readers know which group had the higher mean.
Example 24.1 (Gray whales) A study of gray whales (Eschrichtius robustus) measured (among other things) the length of whales at birth.^{407} The data are shown below.
Sex | Mean (in m) | Standard deviation (in m) | Sample size |
---|---|---|---|
Female | 4.66 | 0.379 | 26 |
Male | 4.60 | 0.305 | 30 |
How much longer are female gray whales than males, on average?
Let's define the difference as the mean length of female gray whales minus the mean length of male gray whales. Then we wish to estimate the difference \(\mu_F - \mu_M\), where \(F\) and \(M\) represent female and male gray whales respectively. In this situation, this is the parameter of interest. The best estimate of this difference is \(\bar{x}_F - \bar{x}_M = 4.66 - 4.60 = 0.06\).
We know that this value is likely to vary from sample to sample, and hence it has a standard error.
We cannot easily determine the standard error of this difference from the above information (though it is possible), so we must be given this information: \(\text{s.e.}(\bar{x}_F - \bar{x}_M) = 0.0929\).
Then the approximate 95% CI is from
\[ 0.06 - (2 \times 0.0929) = -0.125747 \] to \[ 0.06 + (2 \times 0.0929) = 0.245747, \] so the CI is from \(-0.12\) m to \(0.25\) m.
Notice that one of these limits is a negative value. This does not mean a negative length for a whale; that would be silly. Remember that this CI is for the difference between the mean lengths, and a negative length just says the mean length for males is greater than the mean length for females.
So we could say:
The population mean difference between the length of female and male gray whales at birth has a 95% chance of being between \(0.12\) m longer for male whales to \(0.25\) m longer for female whales.
24.7 Using software: CIs for two independent means
The jamovi output (Fig. 24.5) and the SPSS output (Fig. 24.6) both show two CIs. We will use the results from the second row in both cases, as this row of output is more general (and makes fewer assumptions).
jamovi and SPSS give two confidence intervals. In this book, we will use the second row of information (the 'Welch's \(t\)' row in jamovi; the 'Equal variance not assumed' row in SPSS) because it is more general and makes fewer assumptions.
From the SPSS output, the standard error is \(\text{s.e.}(\bar{x}_P - \bar{x}_C) = 19.612\). From the jamovi or SPSS output, the exact 95% CI is from \(12.3\) to \(90.9\).
The approximate CI and the exact (from SPSS) CIs are only slightly different, as SPSS uses an exact multiplier (whereas manually an approximate \(t\)-multiplier of 2 is used, based on the 68--95--99.7 rule), and the sample sizes aren't too small.
24.8 Statistical validity conditions: Two independent means
As usual, these results apply under certain conditions. The CI computed above is statistically valid if one of these conditions is true:
- Both sample sizes are at least 25; or
- Either sample size is smaller than 25, and the populations corresponding to both comparison groups have an approximate normal distribution.
The sample size of 25 is a rough figure here, and some books give other (similar) values (such as 30). We can explore the histograms of the samples to determine if normality of the populations seems reasonable.
In addition to the statistical validity condition, the CI will be
- internally valid if the study was well designed; and
- externally valid if the sample is a simple random sample and is internally valid.
Example 24.2 (Statistical validity) For the reaction-time data, both samples are larger than \(25\), so the CI will be statistically valid.
Example 24.3 (Statistical validity) In the whales examples of Example 24.1, the two sample sizes are 26 (for females) and 30 (for males).
Since both samples are larger than 25, the CI will be statistically valid.
24.9 Error bar charts
A useful way to display the CIs from two (or more) groups is with an error bar chart, which displays the CIs (or sometimes the standard errors) for each group being compared. (A boxplot displays the data.)
Error bars charts display the expected variation in the sample means from sample to sample, while boxplots display the variation in the individual observations and show the median.
For the reaction time data, the error bar chart (Fig. 24.7) shows the 95% CI for each group (the mean has been added as a dot).
What is different about the information displayed in the error bar chart in (Fig. 24.7) and the boxplot (Fig. 24.1)?
The error bar chart helps us understand how precisely the sample mean estimates the population mean.
The boxplot shows the variation in the individual data values.
Example 24.4 (Error bar charts) A study^{408} examined the impact of plastic litter on the shoreline at Talim Bay (Batangas, Philippines) during various seasons, and the impact on the gastropod Nassarius pullus.
The error bar chart (Fig. 24.8) shows that summer seems different---in terms of average value (mean) and the amount of variation---than the other seasons.
Example 24.5 (Error bar charts) A study^{409} examined the foliage biomass of small-leaved lime trees from three sources were studied: coppices; natural; planted.
Two graphical summaries are shown in Fig. 24.9: a boxplot (showing the variation in individual trees) and an error bar chart (showing the variation in the sample means). Using a better scale for the error-bar plot is helpful (Fig. 24.10).
Example 24.6 (Whales) Using the data about gray whales from Example 24.1, the error bar chart in Fig. 24.11 can be constructed.
The plot seems to suggest little difference between the mean length of female and male gray whales at birth.
24.10 Example: Speed signage
In an attempt to reduce vehicle speeds on freeway exit ramps, a Chinese study tried using additional signage.^{410}
At one site studied (Ningxuan Freeway), speeds were recorded for 38 vehicles before the extra signage was added, and then for 41 vehicles after the extra signage was added.
The researchers are hoping that the addition of extra signage will reduce the mean speed of the vehicles.
The RQ is:
At this freeway exit, is the mean vehicle speed the same before extra signage is added and after extra signage is added?
The data are not paired: different vehicles are measured before and after the extra signage is added.
The data are summarised in Table 24.3.
Mean | Std deviation | Std error | Sample size | |
---|---|---|---|---|
Before | 98.02 | 13.19 | 2.140 | 38 |
After | 92.34 | 13.13 | 2.051 | 41 |
Speed reduction | 5.68 | 2.964 |
The parameter is \(\mu_{\text{Before}} - \mu_{\text{After}}\), the reduction in the mean speed.
The standard error must given; you cannot easily calculate this from the other information. You are not expected to do so.
A useful graphical summary of the data is a boxplot (Fig. 24.12, left panel); likewise, an error bar chart can be produced by computing the CI for each group (Fig. 24.12, right panel).
Based on the sample, an approximate 95% CI for the difference in mean speeds is
\[ 5.68 \pm (2 \times 2.964), \] or from \(-0.24\) to \(11.6\) km/h, higher before the addition of extra signage. (The negative value refers to a negative reduction; that is, an increase in speed of 0.24 km/h.)
This means that, if many samples of size 38 and 41 were found, and the difference between the mean speeds were found, about 95% of the CIs would contain the population difference (\(\mu_{\text{Before}} - \mu_{\text{After}}\)). Loosely speaking, there is a 95% chance that our CI straddles the difference in the population means (\(\mu_{\text{Before}} - \mu_{\text{After}}\)).
We could write:
Based on the sample, an approximate 95% CI for the reduction in mean speeds after adding extra signage is between -0.24 km/h (i.e., an increase of 0.24 km/h) and 11.6 km/h (two independent samples).
Using the validity conditions, the CI is statistically valid.
Remember: clearly state which mean is larger.
24.11 Example: Health Promotion services
A study^{411} compared the access to health promotion (HP) services for people with and without a disability.
Access was measured using the quantitative Barriers to Health Promoting Activities for Disabled Persons (BHADP) scale. Higher scores mean greater barriers to health promotion services.
The RQ is:
What is the difference between the mean BHADP scores, for people with and without a disability?
The parameter is \(\mu_D - \mu_{ND}\), the difference between the two population means (disability, minus non-disability).
In this case, only summary data is available (Table 24.4): the data is not available. Nonetheless, a useful graphical summary (an error bar chart) can be produced by computing the CI for each group manually (Fig. 24.13).
The best estimate of the difference between the population means is the difference between sample means: \((\bar{x}_D - \bar{x}_{ND}) = 6.76\). The standard error for estimating this difference is \(\text{s.e.}(\bar{x}_D - \bar{x}_{ND}) = 0.80285\), as given in the table.
Sample mean | Std deviation | Sample size | Std error | |
---|---|---|---|---|
Disability | 31.83 | 7.73 | 132 | 0.6728 |
No disability | 25.07 | 4.8 | 137 | 0.4101 |
Difference | 6.76 | 0.80285 |
The standard error is given; you cannot easily calculate this from the other information. You are not expected to do so.
Based on the sample, an approximate 95% CI for the difference in population mean BHADP scores between people with and without a disability is
\[ 6.76 \pm (2 \times 0.80285), \] or from \(5.15\) to \(8.37\), higher for those with a disability.
This means that, if many samples of size 132 and 137 were found, and the difference between the mean BHADP scores were found, about 95% of the CIs would contain the population difference (\(\mu_D - \mu_{ND}\)). Loosely speaking, there is a 95% chance that our CI straddles the difference in the population means (\(\mu_D - \mu_{ND}\)).
We could write:
Based on the sample, an approximate 95% CI for the difference in BHADP scores is between \(5.15\) to \(8.37\), higher for those with a disability.
Remember: clearly state which mean is larger.
Using the validity conditions, the CI is statistically valid.
24.12 Example: Face-plant study
A study^{412} compared the lean-forward angle in younger and older women. An elaborate set-up was constructed to measure this angle, using a harnesses.
Consider the RQ:
Among healthy women, what is difference between the mean lean-forward angle for younger women compared to older women?
The parameter is \(\mu_Y - \mu_O\), the difference between the two population means (younger, minus older).
The data are shown in Table 24.5. An appropriate graph for displaying the data is a boxplot or dotplot (since the sample sizes are small).
The appropriate numerical summary for the means of two independent samples summarises both groups, and (most importantly) the difference (Table 24.6). Summarising the difference is important, as the RQ is about those differences.
The error bar chart is the best plot for comparing the mean of the two groups (Fig. 24.14).
29 | 34 | 18 |
32 | 27 | 15 |
34 | 32 | 23 |
31 | 28 | 13 |
33 | 27 | 12 |
Mean | Standard deviation | Standard error | Sample size | |
---|---|---|---|---|
Younger women | 30.7 | 2.75 | 0.87 | 10 |
Older women | 16.2 | 4.44 | 1.98 | 5 |
Difference | 14.5 | 2.17 |
The second row of the jamovi output (Fig. 24.15) and SPSS output (Fig. 24.16) show that the 95% CI is from \(9.10\) to \(19.90\). (We could also compute the approximate 95% CI manually.) After rounding the numbers:
Based on the sample, a 95% CI for the difference between population mean one-step fall-recovery angle for heathy women is between \(9.1\) and \(19.9\) degrees greater for younger women than for older women (two independent samples).
The statement clearly states which group has the higher mean (younger women). This CI tells us that if we found many samples (of sizes 10 and 5) in the same way, and computed the CI for the difference between the mean from each sample, about 95% of the CIs would contain the difference between the means in the population: \(\mu_Y - \mu_{O}\). Loosely speaking: There is a 95% chance that our CI straddles \(\mu_Y - \mu_{O}\).
The CI may not be statistically valid (as the sample sizes are not large), so the CIs may not be accurate.
24.13 Quick review questions
- The appropriate graph for displaying quantitative data for two separate groups is a:
- True or false: The difference in population means could be denoted by \(\mu_A - \mu_B\).
- True or false: The standard error of the difference between the sample means is denoted by \(\text{s.e.}(\bar{x}_A) - \text{s.e.}(\bar{x}_B)\).
Progress:
24.14 Exercises
Selected answers are available in Sect. D.23.
Exercise 24.1 Earlier, we used the NHANES study data (Sect. 12.9), and considered this RQ:
Among Americans, is the mean direct HDL cholesterol different for current smokers and non-smokers?
Use the SPSS output (Fig. 24.17) to answer these questions.
- Construct an appropriate table showing the numerical summary.
- Determine, and suitably communicate, the 95% CI for the difference between the direct HDL cholesterol values between current smokers and non-smokers.
Exercise 24.2 A study^{413} of the effectiveness of echinacea to treat the common cold compared, among other things, the duration of the cold for participants treated with echinaca or a placebo. Participants were blinded to the treatment, and allocated to the groups randomly. A summary of the data is given in Table 24.7.
- Compute the standard error for the mean duration of symptoms for each group.
- Compute an approximate 95% CI for the difference between the mean durations for the two groups.
- In which direction is the difference computed? What does it mean when the difference is calculated in this way?
- Compute an approximate 95% CI for the population mean duration of symptoms for those treated with echinacea.
- Are the CIs likely to be statistically valid?
Mean | Std deviation | Std error | Sample size | |
---|---|---|---|---|
Placebo | 6.87 | 3.62 | 176 | |
Echinacea | 6.34 | 3.31 | 183 | |
Difference | 0.53 | 0.367 |
Exercise 24.3 Carpal tunnel syndrome (CTS) is pain experienced in the wrists. One study^{414} compared two different treatments: night splinting, or gliding exercises.
Participants were randomly allocated to one of the two groups. Pain intensity (measured using a quantitative visual analog scale; larger values mean greater pain) were recorded after one week of treatment. The data are summarised in Table 24.8.
- Compute the standard error for the mean pain intensity for each group.
- In which direction is the difference computed? What does it mean when the difference is calculated in this way?
- Compute an approximate 95% CI for the difference in the mean pain intensity for the treatments.
- Compute an approximate 95% CI for the population mean pain intensity for those treated with splinting.
- Are the CIs likely to be statistically valid?
Mean | Std deviation | Std error | Sample size | |
---|---|---|---|---|
Exercise | 0.8 | 1.4 | 10 | |
Splinting | 1.1 | 1.1 | 10 | |
Difference | 0.3 | 0.563 |
Exercise 24.4 A study^{415} examined the sugar consumption in industrialised (mean: 41.8 kg/person/year) and non-industrialised (mean: 24.6 kg/person/year) countries. Using the jamovi output (Fig. 24.18), write down and interpret the CI.
Exercise 24.5 In an attempt to reduce vehicle speeds on freeway exit ramps, a Chinese study tried using additional signage.^{416}
At one site studied (Ningxuan Freeway), speeds were recorded at various points on the freeway exit for 38 vehicles before the extra signage was added, and then for 41 vehicles after the extra signage was added.
From this data, the deceleration of each vehicle was determined (data below) as the vehicle left the 120 km/h speed zone and approached the 80 km/hr speed zone.
Use the data, and the summary in Table 24.9, to address this RQ:
At this freeway exit, what is the difference between the mean vehicle deceleration, comparing the times before the extra signage is added and after extra signage is added?
Mean | Std deviation | Sample size | Std error | |
---|---|---|---|---|
Before | 0.0745 | 0.0494 | 0.00802 | 38 |
After | 0.0765 | 0.0521 | 0.00814 | 41 |
Change | -0.0020 | 0.00181 |
In this context, the researchers are hoping that the extra signage might cause cars to slow down faster (i.e., they will decelerate more, on average, after adding the extra signage).
- Identify clearly the parameter of interest to understand how much the deceleration increased after adding the extra signage.
- Compute and interpret the CI for this parameter.