35 Tests for means of two independent groups

So far, you have learnt to ask a RQ, design a study, describe and summarise the data, construct confidence intervals, and perform some hypothesis tests. In this chapter, you will learn to:

  • conduct hypothesis tests for comparing two means.
  • determine whether the conditions for using these methods apply in a given situation.

35.1 Introduction: garter snakes

Some Mexican garter snakes (Thamnophis melanogaster) live in habitats with no crayfish, while some live in habitats with crayfish and hence use crayfish as a food source. Manjarrez, Macias Garcia, and Drummond (2017) were interested in whether the snakes in these two regions were different:

For female Mexican garter snakes, is the mean snout--vent length (SVL) different for those in regions with crayfish and without crayfish?

Two different groups of snakes are studied, so the data are not paired. The data are shown below.

Since the data are available (see above), a numerical summary must summarise the difference between the means, as the RQ is about the difference. Each group can be summarised too. The information is found using software (Fig. 35.1), and can be compiled into a table (Table 35.2). The appropriate summary for graphically summarising the data is (for example) a boxplot (Fig. 35.2, left panel). An error bar chart (Fig. 35.2, right panel), which compares the sample means, should also be produced.

jamovi output for the snakes datajamovi output for the snakes data

FIGURE 35.1: jamovi output for the snakes data

35.2 Statistical hypotheses and notation

Since two groups are being compared, using subscripts to distinguish between the statistics for the two groups. Calling these groups \(A\) and \(B\) in general, the notation is shown in Table 35.1. Using this notation, the parameter in the RQ is the difference between population means: \(\mu_A - \mu_B\). As usual, the population values are unknown, so this is estimated using the statistic \(\bar{x}_A - \bar{x}_B\).

TABLE 35.1: Notation used to distinguish between the two independent groups
Group A Group B
Population means: \(\mu_A\) \(\mu_B\)
Sample means: \(\bar{x}_A\) \(\bar{x}_B\)
Standard deviations: \(s_A\) \(s_B\)
Standard errors: \(\displaystyle\text{s.e.}(\bar{x}_A) = \frac{s_A}{\sqrt{n_A}}\) \(\displaystyle\text{s.e.}(\bar{x}_B) = \frac{s_B}{\sqrt{n_B}}\)
Sample sizes: \(n_A\) \(n_B\)

We will defined the differences as the mean for females snakes living in non-crayfish regions (\(N\)), minus the mean for female snakes in crayfish regions (\(C\)): \(\mu_N - \mu_C\). This is the parameter. By this definition, the differences refer to how much larger (on average) the SVL is for snakes living in non-crayfish regions.

Here the difference is computed as the mean SVL for snakes living in non-crayfish regions, minus the mean SVL for snakes living in crayfish regions. Computing the difference in the reverse direction (i.e., crayfish regions, minus non-crayfish regions) is also correct. You need to be clear about how the difference is computed, and be consistent throughout.

As always, the null hypothesis is the default 'no difference, no change, no relationship' position; any difference between the parameter and statistic is due to sampling variation (Sect. 33.2). Hence, the null hypothesis is 'no difference' between the population means SVL of the two groups:

  • \(H_0\): \(\mu_N - \mu_C = 0\) (or \(\mu_N = \mu_C\)).

From the RQ, the alternative hypothesis is two-tailed:

  • \(H_1\): \(\mu_N - \mu_C\ne 0\) (or \(\mu_N \ne \mu_C\)).

This hypothesis proposes that any difference between the sample means is because a difference really exists between the population means.

35.3 Sampling distribution for \(bar{x}_1 - \bar{x}_2\)

The sample mean difference between the SVL depends on which one of the many possible samples is randomly obtained, even if the difference between the means in the population is zero. The difference between the sample means is \(8.394\) cm... but this value will vary from sample to sample; that is, there is sampling variation.

Definition 35.1 (Sampling distribution for the difference between two sample means) The sampling distribution of the difference between two sample means \(\bar{x}_A\) and \(\bar{x}_B\) is, when the appropriate conditions are met (Sect. 28.8), described by:

  • an approximate normal distribution,
  • centred around a sampling mean whose value is \({\mu_{A}} - {\mu_{B}} = 0\), the difference between the population means (from \(H_0\)),
  • with a standard deviation of \(\displaystyle\text{s.e.}(\bar{x}_A - \bar{x}_B)\).

The standard error for the difference between the means is found using
\[ \text{s.e.}(\bar{x}_A - \bar{x}_B) = \sqrt{ \text{s.e.}(\bar{x}_A)^2 + \text{s.e.}(\bar{x}_B)^2}, \] though this value will usually be given (e.g., on computer output).

For the SVL data, the sampling variation of \(\bar{x}_N - \bar{x}_C\) can be described as having:

  • an approximate normal distribution,
  • centred around a sampling mean whose value is \({\mu_{N}} - {\mu_{C}} = 0\), the difference between the means (from \(H_0\)),
  • with a standard deviation, called the standard error for the difference between the means, of. \[ \text{s.e.}(\bar{x}_N - \bar{x}_C) = \sqrt{1.2160^2 + 2.1117^2 } = 2.4368. \] The calculation uses the standard errors from Table 35.2. The answer agrees with the second row of the jamovi output (Fig. 35.1).

jamovi gives results from two similar hypothesis tests. In this book, we will always use the second row of information: the "Welch's \(t\)" row in jamovi.

This row of information does not assume that the population standard deviation are equal. The first row of information does assume the population standard deviation are equal. In most cases, the information in both rows are similar anyway.

TABLE 35.2: Numerical summaries of SVL (in cm) for female snakes in two regions
Mean Sample size Standard deviation Standard error
Crayfish region \(34.17\) \(12.49\) \(35\) \(2.112\)
Not-crayfish region \(42.57\) \(\phantom{0}7.79\) \(41\) \(1.216\)
Difference \(\phantom{0}8.39\) \(2.437\)
Boxplot (left) and error bar chart (right) of SVL for female snakes in two regions

FIGURE 35.2: Boxplot (left) and error bar chart (right) of SVL for female snakes in two regions

35.4 Computing the value of the test statistic: \(t\)-scores

The observed difference between sample means, relative to what was expected, is found by computing the test statistic; in this case, a \(t\)-score. The jamovi output (Fig. 35.1) gives the \(t\)-score, but the \(t\)-score can also be computed using the information in Table 35.2:
\[\begin{align*} t &= \frac{\text{sample statistic} - \text{assumed population parameter, from $H_0$}} {\text{standard error for sample statistic}}\\ &= \frac{ (\bar{x}_P - \bar{x}_C) - (\mu_P - \mu_C)} {\text{s.e.}(\bar{x}_P - \bar{x}_C)} = \frac{8.39 - 0}{2.4368} = 3.44, \end{align*}\] as in the software output.

35.5 Determining \(P\)-values

A \(P\)-value determines if the sample statistic is consistent with the assumption (i.e., \(H_0\)). Since the \(t\)-score is large, the \(P\)-value will be small using the \(68\)--\(95\)--\(99.7\) rule. This is confirmed by the software (Fig. 35.1): the two-tailed \(P\)-value is \(0.0011\).

A small \(P\)-value suggests the observations are inconsistent with the assumption (Table 33.1), and the difference between the sample means could not be reasonably explained by sampling variation.

Click on the pins in the following image, and describe what the jamovi output tells us.

35.6 Writing conclusions

In conclusion, write:

Strong evidence exists in the sample (two independent samples \(t = 3.445\); two-tailed \(P = 0.0011\)) that the population mean SVL is different for female snakes living in crayfish regions (mean: \(34.17\) cm; \(n = 35\)) and non-crayfish regions (mean: \(42.57\) cm; \(n = 41\); \(95\)% CI for the difference: \(3.51\) to \(13.28\) cm longer for those in non-crayfish regions).

The conclusion contains an answer to the RQ, the evidence leading to this conclusion (\(t = 3.44\); two-tailed \(P = 0.0011\)), and sample summary statistics, including a CI.

35.7 Statistical validity conditions

As usual, these results apply under certain conditions, which are the same as those for forming a CI for the difference between two means. The test above is statistically valid if one of these conditions is true:

  1. Both sample sizes are at least \(25\); or
  2. If one or both sample sizes are \(25\) or smaller, and the populations corresponding to both comparison groups have an approximate normal distribution.

The sample size of \(25\) is a rough figure; some books give other values (such as \(30\)). This condition ensures that the distribution of the difference between sample means has an approximate normal distribution (so that, for example, the \(68\)--\(95\)--\(99.7\) rule can be used). The histograms of the sample data can be used to determine if normality of the populations seems reasonable.

Example 35.1 (Statistical validity) For the snake data, the samples sizes exceed \(25\) (\(41\) and \(35\)), so the results is statistically valid. The data in each group do not need be normally distributed, since both sample sizes are larger than \(25\).

35.8 Example: speed signage

(This study was seen in Sect. 28.9.) In an attempt to reduce vehicle speeds on freeway exit ramps, a Chinese study added signage (Ma et al. 2019). At one site studied (Ningxuan Freeway), speeds were recorded for \(38\) vehicles before (\(B\)) the extra signage was added, and then for \(41\) vehicles after (\(A\)) the extra signage was added (data in Sect. 28.9). The researchers are hoping the addition of extra signage will reduce the mean speed of the vehicles. The RQ is:

At this freeway exit, does the mean vehicle speed reduce after extra signage is added?

The data are not paired: different vehicles are measured before and after the extra signage is added. The data are summarised in Table 35.3. A graphical summary of the data is a boxplot (Fig. 28.6, left panel); an error bar chart compares the means, with a CI for each group (Fig. 28.6, right panel).

TABLE 35.3: The signage data summary (in km/h)
Mean Std deviation Std error Sample size
Before \(\phantom{0}98.02\) \(13.19\) \(2.140\) \(38\)
After \(\phantom{0}92.34\) \(13.13\) \(2.051\) \(41\)
Speed reduction \(\phantom{0}\phantom{0}5.68\) \(2.964\)

Define \(\mu\) at the mean speed (in km.h-1) on the exit ramp. Then, the parameter is \(\mu_B - \mu_A\), the reduction in the population mean speed after signage is added. The hypotheses are:

  • \(H_0\): \(\mu_B - \mu_A = 0\): There is no change in the population mean speeds.
  • \(H_1\): \(\mu_B - \mu_A > 0\): The population mean speed has reduced.

The best estimate of the difference in population means is the difference between the sample means: \((\bar{x}_B - \bar{x}_A) = 5.68\). Table 35.3 gives the standard error for estimating this difference as \(\text{s.e.}(\bar{x}_B - \bar{x}_A) = 2.964\). Using the summary information in Table 35.3, the \(t\)-score (using Eq. (32.2)) is
\[ t = \frac{(\bar{x}_B - \bar{x}_A) - (\mu_B - \mu_A)}{\text{s.e.}(\bar{x}_B - \bar{x}_{A})} = \frac{5.68 - 0}{2.964} = 1.92. \] (Recall that \(\mu_B - \mu_A = 0\) from the null hypothesis.)

Remembering that the alternative hypothesis is one-tailed, the \(P\)-value (using the \(68\)--\(95\)--\(99.7\) rule) is larger than \(0.025\), but smaller than \(0.32\)... so making a clear decision is difficult without using software. However, since the \(t\)-score is just less than 2, we suspect that the \(P\)-value is likely to be closer to \(0.025\) than \(0.32\).

From software, \(P = 0.02968\) (you cannot be this precise just using the \(68\)--\(95\)--\(99.7\) rule). Using Table 33.1, this \(P\)-value provides moderate evidence of a reduction in mean speeds. We conclude:

Moderate evidence exists in the sample (\(t = 1.92\); one-tailed \(P = 0.030\)) that mean speeds have reduced after the addition of extra signage (Before: mean speed: \(98.02\) km.h-1; \(n = 38\); standard deviation: \(13.18\) km.h-1; After: mean speed: \(92.34\) km.h-1; \(n = 41\); standard deviation: \(13.13\) km.h-1; 95% CI for the difference: \(-0.24\) to \(11.6\) km.h-1).

Whether the mean speed reduction of \(5.68\) has practical importance is a different issue. Using the validity conditions, the CI is statistically valid.

Remember: which mean is larger must be clear!

35.9 Example: chamomile tea

(This study was seen in Sects. 27.10, 28.10 and 34.9.) A study of patients with Type 2 diabetes mellitus (T2DM) randomly allocated \(32\) patients into a control group (who drank hot water), and \(32\) to receive chamomile tea (Rafraf, Zemestani, and Asghari-Jafarabadi (2015), p. 164).

The total glucose (TG) was measured for each individual, in both groups, both before the intervention and after eight weeks on the intervention. Summary data are given in Table 27.3. Evidence suggests that the chamomile tea group shows a mean reduction in TG (Sect. 34.9), while the hot-water group shows no evidence of a reduction. However, the differences between the chamomile-tea and the hot-water groups may be due to the samples selected (i.e., sampling variation), so comparing the changes in the two groups is helpful.

The following relational RQ can be asked:

For patients with T2DM, is the mean reduction in TG greater for the chamomile tea group compared to the hot water group?

Notice the RQ is one-tailed; the aim of the study is to determine if the chamomile-tea drinking group performs better (i.e., greater change in TG) than the control group.

This RQ is comparing two separate groups, specifically comparing the differences between the two groups. This study contains both within-individuals comparisons (see Sect. 34.9) and a between-individuals comparison (this section); see Fig. 35.3.

The chamomile-tea study has two within-individuals comparisons, and a between-individuals comparison

FIGURE 35.3: The chamomile-tea study has two within-individuals comparisons, and a between-individuals comparison

The corresponding hypotheses are:
\[ \text{$H_0$: $\mu_T - \mu_W = 0$ \qquad and\qquad $H_1$: $\mu_T - \mu_W > 0$} \] where \(\mu\) refers to the mean reduction in TG, \(T\) refers to the tea-drinking group, and \(W\) to the hot-water drinking group.

The parameter \(\mu_T - \mu_W\) is estimated by the statistic \(\bar{x}_T - \bar{x}_W = 45.74\) mg.dl-1. The standard error for the statistic was found in Sect. 28.10) as \(\text{s.e.}(\bar{x}_T - \bar{x}_W) = 8.42\). Hence, the test statistic is:
\[ t = \frac{(\mu_T - \mu_W) - (\bar{x}_T - \bar{x}_W)}{\text{s.e.}(\bar{x}_T - \bar{x}_W)} = \frac{45.75 - 0}{8.42} = 5.43, \] which is very large, so the \(P\) value will be very small (using the \(68\)--\(95\)--\(99.7\) rule), and certainly smaller than \(0.001\).

We write:

There is very strong evidence (\(t = 5.43\); one-tailed \(P < 0.001\)) that the mean reduction in TG for the chamomile-tea drinking group is greater than the mean reduction in TG for the hot-water drinking group (difference between means: \(45.74\) mg.dl-1; approx. \(95\)% CI: \(28.64\) to \(62.84\) mg.dl-1).

Again, the sample sizes are larger than \(25\), so the results are statistically valid.

35.10 Chapter summary

To test a hypothesis about a difference between two population means \(\mu_1 - \mu_2\), based on the value of the difference between two sample mean \(\bar{x}_1 - \bar{x}_2\), assume the value of \(\mu_1 - \mu_2\) in the null hypothesis to be true (usually zero). Then, the difference between the sample means varies from sample to sample and, under certain statistical validity conditions, varies with an approximate normal distribution centred around the hypothesised value of \(\mu_1 - \mu_2\), with a standard deviation of \(\text{s.e.}(\bar{x}_1 - \bar{x}_2)\). This distribution describes what values of the sample mean could be expected in the sample if the value of \(\mu\) in the null hypothesis was true. The test statistic is
\[ t = \frac{ (\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2)}{\text{s.e.}(\bar{x}_1 - \bar{x}_2)}, \] where \(\mu_1 - \mu_2\) is the hypothesised value given in the null hypothesis (usually zero). This describes the observations. The \(t\)-value is like a \(z\)-score, and so an approximate \(P\)-value is estimated using the \(68\)--\(95\)--\(99.7\) rule or normal-distribution tables, which is how we weigh the evidence to determine if it is consistent with the assumption.

The following short video may help explain some of these concepts:

35.11 Quick review questions

A study (Y.-M. Lee et al. 2016) compared using a vegan (\(n = 46\)) and a conventional (\(n = 47\)) diet for \(12\) weeks, for a group of Koreans with Type II diabetes. A summary of the data for iron levels are shown in Table 35.4.

TABLE 35.4: Comparing the iron levels (mg) for subjects using a vegan or conventional diet for 12 weeks
Mean Standard deviation \(n\)
Vegan diet \(13.9\) \(2.3\) \(46\)
Conventional diet \(15.0\) \(2.7\) \(47\)
Difference \(\phantom{0}1.1\)
  1. The sample size is missing from the 'Difference' row. What should the sample size in this row be?

  2. What is the standard deviation for the difference?

  3. What is the standard error for the difference?

  4. The two-tailed \(P\)-value for the comparison is given as \(P = 0.046\). What does this mean?

35.12 Exercises

Answers to odd-numbered exercises are available in App. E.

Exercise 35.1 (This study was also seen in Exercise 28.1.) A study of gray whales (Eschrichtius robustus) measured (among other things) the length of adult whales (Agbayani, Fortune, and Trites 2020). Are adult female gray whales longer than males, on average, in the population? Summary information is shown in Table 35.5.

TABLE 35.5: Numerical summary of length of whales at birth (in m)
Mean Std deviation Sample size
Female \(4.66\) \(0.38\) \(26\)
Male \(4.60\) \(0.30\) \(30\)
  1. Define the difference.
  2. Write the hypotheses to answer the RQ.
  3. Compute the \(t\)-score, and approximate the \(P\)-value using the \(68\)--\(95\)--\(99.7\) rule.
  4. Write a conclusion.
  5. Is the test statistically valid?

Exercise 35.2 [Dataset: NHANES] (This study was also seen in Exercise 28.2.) Earlier, the NHANES study (Exercise 16.7) was used to answer this RQ:

Among Americans, is the mean direct HDL cholesterol different for current smokers and non-smokers?

Use the jamovi output in Fig. 35.4 to perform a hypothesis test to answer the RQ.

jamovi output for the NHANES data

FIGURE 35.4: jamovi output for the NHANES data

Exercise 35.3 (This study was also seen in Exercise 28.3.) A study (Barrett et al. 2010) of the effectiveness of echinacea to treat the common cold compared, among other things, the duration of the cold for participants treated with echinacea or a placebo. Participants were blinded to the treatment, and allocated to the groups randomly. A summary of the data is given in Table 28.4.

  1. Define the difference.
  2. Write the hypotheses to answer the RQ.
  3. Compute the \(t\)-score, and approximate the \(P\)-value using the \(68\)--\(95\)--\(99.7\) rule.
  4. Write a conclusion.
  5. Is the test statistically valid?

Exercise 35.4 (This study was also seen in Exercise 28.4.) Carpal tunnel syndrome (CTS) is pain experienced in the wrists. One study (Schmid et al. 2012) compared two different treatments: night splinting, or gliding exercises.

Participants were randomly allocated to one of the two groups. Pain intensity (measured using a quantitative visual analog scale; larger values mean greater pain) were recorded after one week of treatment. The data are summarised in Table 28.5.

  1. Define the difference.
  2. Write the hypotheses to answer the RQ.
  3. Compute the \(t\)-score, and approximate the \(P\)-value using the \(68\)--\(95\)--\(99.7\) rule.
  4. Write a conclusion.
  5. Is the test statistically valid?

Exercise 35.5 [Dataset: Dental] (These data were also seen in Exercise 28.5.) A study (Woodward and Walker 1994) examined the sugar consumption in industrialised (mean: \(41.8\) kg/person/year) and non-industrialised (mean: \(24.6\) kg/person/year) countries. The jamovi output is shown in Fig. 35.5.

  1. Write the hypotheses.
  2. Write down and interpret the CI.
  3. Write a conclusion for the hypothesis test.
jamovi output for the sugar-consumption data

FIGURE 35.5: jamovi output for the sugar-consumption data

Exercise 35.6 [Dataset: Deceleration] (These data were also seen in Exercise 28.6.) In an attempt to reduce vehicle speeds on freeway exit ramps, a Chinese study tried using additional signage (Ma et al. 2019). At one site studied (Ningxuan Freeway), speeds were recorded at various points on the freeway exit for vehicles before the extra signage was added, and then for different vehicles after the extra signage was added.

From this data, the deceleration of each vehicle was determined (data with Exercise 28.6) as the vehicle left the 120 km/h speed zone and approached the 80 km/hr speed zone. Use the data, and the summary in Table 35.6, to test the RQ:

At this freeway exit, is the mean vehicle deceleration the same before extra signage is added and after extra signage is added?

TABLE 35.6: The signage data summary (in m/s-squared)
Mean Std deviation Sample size Std error
Before \(\phantom{-}0.0745\) \(0.0494\) \(0.00802\) \(38\)
After \(\phantom{-}0.0765\) \(0.0521\) \(0.00814\) \(41\)
Change \(-0.0020\) \(0.00181\)

Exercise 35.7 [Dataset: ForwardFall] (This study was seen in Exercise 28.7.) A study (Wojcik et al. 1999) compared the lean-forward angle in younger and older women. An elaborate set-up was constructed to measure this lean-forward angle, using harnesses. Consider this RQ:

Among healthy women, is the mean lean-forward angle greater for younger women compared to older women?

Use the software output (Fig. 35.6) to answer these questions:

  1. What is the parameter? Carefully describe what it means.
  2. Is the test one- or two-tailed?
  3. Write the statistical hypothesis.
  4. Use the jamovi output to conduct the hypothesis test.
  5. Write a conclusion.
jamovi output for the face-plant data

FIGURE 35.6: jamovi output for the face-plant data

Exercise 35.8 (This study was seen in Exercise 28.8.) A study (Becker, Stuifbergen, and Sands 1991) compared the access to health promotion (HP) services for people with and without a disability in southwestern of the USA. 'Access' was measured using the quantitative Barriers to Health Promoting Activities for Disabled Persons (BHADP) scale. Higher scores mean greater barriers to health promotion services. The RQ is:

What is the difference between the mean BHADP scores, for people with and without a disability, in southwestern USA?

  1. Write down the hypotheses.
  2. Compute the \(t\)-score.
  3. Determine the \(P\)-value.
  4. Write a conclusion.

Exercise 35.9 [Dataset: BodyTemp] (This study was seen in Exercise 28.9.) Consider again the body temperature data from Sect. 32.1. The researchers also recorded the gender of the patients, as they also wanted to compare the mean internal body temperatures for males and females.

Use the jamovi output in Fig. 35.7 to perform this test and make a conclusion. Also comment on the practical significance of your results.

jamovi output for the body-temperature data

FIGURE 35.7: jamovi output for the body-temperature data

Exercise 35.10 (This study was seen in Exercise 28.10.) A study of male paramedics in Western Australia compared conventional paramedics with special operations paramedics (D. Chapman et al. 2007). Some information comparing their physical profiles is shown in Table 35.7.

  1. Consider comparing the mean grip strength for the two groups of paramedics.
    1. Carefully write down the hypotheses.
    2. Compute the \(t\)-score for testing if a difference exists between the two types of paramedics.
    3. Approximate the \(P\)-value using the \(68\)--\(95\)--\(99.7\) rule.
    4. Discuss the conditions required for statistical validity in this context.
    5. Make a conclusion.
  2. Consider comparing the mean number of push-ups completed in one minute. The standard error for the difference between the means is \(4.0689\).
    1. Carefully write down the hypotheses.
    2. Compute the \(t\)-score for testing if a difference exists between the two types of paramedics.
    3. Approximate the \(P\)-value using the \(68\)--\(95\)--\(99.7\) rule.
    4. Discuss the conditions required for statistical validity in this context.
    5. Make a conclusion.
TABLE 35.7: The physical profile of conventional (\(n=18\)) and special operation (\(n = 11\)) paramedics in Western Australia
Conventional Special Operations
Grip strength (in kg)
Mean \(51\) \(56\)
Std deviation \(\phantom{0}8\) \(\phantom{0}9\)
Std error
Push-ups (per minutes)
Mean \(36\) \(47\)
Std deviation \(10\) \(11\)
Std error