31 Tests for one mean
You have learnt to ask a RQ, design a study, classify and summarise the data, construct confidence intervals, and perform some hypothesis tests. In this chapter, you will learn to:
- identify situations where conducting a test for a mean is appropriate.
- conduct hypothesis tests for one sample mean, using a \(t\)-test.
- determine whether the conditions for using these methods apply in a given situation.
31.1 Introduction: body temperatures
The average internal body temperature is commonly believed to be \(37.0^\circ\text{C}\) (\(98.6^\circ\)F). This is based on data over 150 years old (Wunderlich 1868). Mackowiak, Wasserman, and Levine (1992) re-examined this claim, to determine if these values are still appropriate (given changes in how temperature is measured, as much as actual body-temperature changes). That is, a decision is sought about the value of the population mean body temperature. This value will never be known: we would need to measure the internal body temperature of every person alive .. and even those not yet born.
Define the parameter as \(\mu\), the population mean internal body temperature (in \({}^\circ\text{C}\)). A sample of people can be taken to determine whether or not there is evidence that the population mean internal body temperature is \(37.0^\circ\text{C}\).
To make this decision, the decision-making process (Sect. 19.3) is used. Assume that \(\mu = 37.0\) (there is no evidence that this accepted standard is wrong), and the evidence is examined to determine if the evidence supports this claim or not. The RQ is:
Is the population mean internal body temperature \(37.0^\circ\text{C}\)?
31.2 Statistical hypotheses and notation
The decision making process begins by assuming that the null hypothesis is true: that \(\mu = 37.0\). Because every sample is different, \(\bar{x}\) will vary, and the sample mean \(\bar{x}\) probably won't be exactly \(37.0\), even if the population mean \(\mu\) is \(37.0\). Two broad reasons could explain why:
- The population mean body temperature is \(37.0^\circ\text{C}\).
However, \(\bar{x}\) isn't exactly \(37.0\) due to sampling variation; - The population mean body temperature is not \(37.0\).
The sample mean body temperature reflects this.
These hypotheses can be written more formally as:
- The null hypothesis (\(H_0\)): \(\mu = 37.0^\circ\text{C}\); and
- The alternative hypothesis (\(H_1\)): \(\mu \ne 37.0^\circ\text{C}\).
The RQ asks if \(\mu\) is \(37.0\) or some other value (either smaller or larger than \(37.0\)). Two possibilities are considered, so the alternative hypothesis is two-tailed.
31.3 Sampling distribution for \(\bar{x}\)
To answer this RQ, data were collected by Shoemaker (1996) (Fig. 31.1). A graphical summary (Fig. 31.2) and a numerical summary, using jamovi (Fig. 31.3), shows that:
- the sample mean is \(\bar{x} = 36.8052^\circ\)C,
- the sample standard deviation is \(s = 0.4073^\circ\)C, and
- the sample size is \(n = 130\).
The sample mean \(\bar{x}\) is less than the assumed value of \(\mu = 37\)... but why? Can the difference reasonably be explained by sampling variation? The approximate \(95\)% CI for \(\mu\) is from \(36.73\) to \(36.88\). This CI is narrow, implying \(\mu\) has been estimated with precision, so detecting even small deviations of \(\mu\) from \(37.0^\circ\) should be possible.
The sampling distribution of \(\bar{x}\) was given in Sect. 24.2.
Definition 31.1 (Sampling distribution of a sample mean) The sampling distribution of the sample mean is (when certain conditions are met; Sect. 31.9) described by
- an approximate normal distribution,
- centred around the sampling mean, whose value is \(\mu\) (from \(H_0\)),
- with a standard deviation (called the standard error of \(\bar{x}\)) of
\[\begin{equation} \text{s.e.}(\bar{x}) = \frac{s}{\sqrt{n}}, \tag{31.1} \end{equation}\] where \(n\) is the size of the sample, and \(s\) is the standard deviation of the data.
Hence, if \(\mu\) really was \(37.0\), the possible values of the sample means across all possible samples can be described using:
- an approximate normal distribution,
- with a sampling mean whose value is \(\mu = 37.0\) (from \(H_0\)),
- with a standard deviation of \(\text{s.e.}(\bar{x}) = s/\sqrt{n} = 0.4073/\sqrt{130} = 0.0357\).
A picture of this sampling distribution (Fig. 31.4) shows how the sample mean varies when \(n = 130\), for all possible samples, simply due to sampling variation when \(\mu = 37.0\). This enables questions to be asked about the likely values of \(\bar{x}\) that would be found in the sample, when the population mean is \(\mu = 37.0\). For example, the value of \(\bar{x}\) will be larger than \(37.0357^\circ\)C, if \(\mu\) really is \(37.0\), about \(16\)% of the time (using the \(68\)--\(95\)--\(99.7\) rule).
31.4 Computing the value of the test statistic: \(t\)-tests
The sampling distribution describes how the sample means varies; that is, what to expect from the sample means, assuming \(\mu = 37.0\). The observed value of \(\bar{x}\) is \(\bar{x} = 36.8052^\circ\)C. How likely is it that such a value at least this small could occur in our sample by chance (i.e., by sampling variation)?
The value of the observed sample mean can be located on the sampling distribution (Fig. 31.5). The value \(\bar{x} = 36.8052^\circ\text{C}\) is extremely small: a sample mean this low is very unlikely from a sample of \(n = 130\) when \(\mu = 37.0\). How many standard deviations is \(\bar{x}\) away from \(\mu = 37.0\)?
Relatively speaking, the distance that the observed sample mean (of \(\bar{x} = 36.8052\)) is from the mean of the sampling distribution (Fig. 31.5) is found by computing how many standard deviations the value of \(\bar{x}\) is from the mean of the distribution: \[ \frac{36.8052 - 37.0}{0.035724} = -5.453. \] This is like a \(z\)-score, but is actually called a \(t\)-score. Both \(t\) and \(z\) scores measure the number of standard deviations that a value is from the mean. Here we have a \(t\)-score, though, because the population standard deviation \(\sigma\) is unknown, and the sample standard deviation is used to compute \(\text{s.e.}(\bar{x})\).
Like \(z\)-scores, \(t\)-scores measure the number of standard deviations that a value is from the mean. Both measure the number of standard errors that a value is from the mean.
The calculation is therefore: \[ t = \frac{36.8052 - 37.0}{0.035724} = -5.453; \] the observed sample mean is more than five standard deviation below the population mean, which is highly unusual based on the \(68\)--\(95\)--\(99.7\) rule (Fig. 31.5).
In general, a \(t\)-score in hypothesis testing is
\[\begin{equation}
t
=
\frac{\text{sample statistic} - \text{mean of the sampling distribution}}
{\text{standard error of the sampling distribution}}
=
\frac{\bar{x} - \mu}{\text{s.e.}(\bar{x})}.
\tag{31.2}
\end{equation}\]
31.5 Determining \(P\)-values
As seen in Sect. 30.5, a \(P\)-value quantifies how unusual the observed sample statistic is, after assuming the null hypothesis is true. Since \(t\)-scores and \(z\)-scores are very similar, the \(P\)-value can be approximated using the \(68\)--\(95\)--\(99.7\) rule and a diagram (Sect. 31.5.1), or using tables (Appendix B.1.). Commonly, software is used to compute the \(P\)-value (Sect. 31.5.2). \(P\)-values are very similar when \(t\)-scores and \(z\)-scores have the same value, except for small sample sizes.
31.5.1 Approximate \(P\)-values
Since \(t\)-scores are similar to \(z\)-scores, the ideas in Sect. 30.5 can be used to approximate a \(P\)-value for a \(t\)-score. In addition, tables of \(z\)-scores (Appendix B.1.) can be used to approximate the \(P\)-values for \(t\)-scores also (Sect. 30.5.2).
Both methods produce approximate \(P\)-values only, since the approximations are based on using \(z\)-scores rather than \(t\)-scores. Usually, software is used to determine \(P\)-values for \(t\)-scores.
31.5.2 Exact \(P\)-values using software
Software computes the \(t\)-score and a precise \(P\)-value (Fig. 31.6).
The output (in jamovi, under the heading p
) shows that the \(P\)-value is indeed very small: less than \(0.001\) (written as \(P < 0.001\)).
Some software reports a \(P\)-value of 0.000
, which really means (and we should write) \(P < 0.001\): that is, the \(P\)-value is smaller than \(0.001\).
This \(P\)-value means that, if \(\mu = 37.0\), a sample mean as low as \(36.8052\) would be very unusual to observe (from a sample size of \(n = 130\)). And yet... we did. Using the decision-making process, this implies that the initial assumption (the null hypothesis) is contradicted by the data: we observed something extremely unlikely if \(\mu = 37.0\). That is, the evidence suggests that the population mean body temperature is not \(37.0^\circ\text{C}\).
For one-tailed tests, the \(P\)-value is half the value of the two-tailed \(P\)-value.
31.6 Making decisions with \(P\)-values
As seen in Sect. 30.6, \(P\)-values measure the probability of observing the sample statistic (or something more extreme), assuming the population parameter is the value given in \(H_0\). For the body-temperature data then, where \(P < 0.001\), the \(P\)-value is very small, so very strong evidence exists that the population mean body temperature is not \(37.0^\circ\text{C}\).
31.7 Writing conclusions
Communicating the results of any hypothesis test requires an answer to the RQ, a summary of the evidence used to reach that conclusion (such as the \(t\)-score and \(P\)-value, stating if it is a one- or two-tailed \(P\)-value), and some sample summary information (including a CI). So for the body-temperature example, write:
The sample provides very strong evidence (\(t = -5.45\); two-tailed \(P<0.001\)) that the population mean body temperature is not \(37.0^\circ\text{C}\) (\(\bar{x} = 36.81\); \(n = 130\); \(95\)% CI from 36.73\(^\circ\)C to 36.88\(^\circ\)C).
This statement contains the three components:
- The answer to the RQ. The sample provides very strong evidence... that the population mean body temperature is not \(37.0^\circ\text{C}\). The alternative hypothesis is two-tailed, so the conclusion is worded in terms of the population mean body temperature not being \(37.0^\circ\text{C}\).
- The evidence used to reach the conclusion: \(t = -5.45\); two-tailed \(P < 0.001\).
- Some sample summary information (including a CI, using details in Chap. 24): \(\bar{x} = 36.81\); \(n = 130\); \(95\)% CI from \(36.73^\circ\)C to \(36.88^\circ\)C.
Since the null hypothesis is initially assumed to be true, the onus is on the evidence to refute the null hypothesis. Hence, conclusions are worded in terms of how strongly the evidence (i.e., sample data) support the alternative hypothesis.
The alternative hypothesis may or may not be true... but the evidence (data) available here strongly supports the alternative hypothesis.
31.8 Process overview
Let's recap the decision-making process, in this context about body temperatures:
- Step 1: Assumption: Write the null hypothesis about the parameter (based on the RQ): \(H_0\): \(\mu = 37.0\). In addition, write the alternative hypothesis: \(H_1\): \(\mu \ne 37.0\). (This alternative hypothesis is two-tailed.)
- Step 2: Expectation: The sampling distribution describes what to expect from the statistic if the null hypothesis is true.
- Step 3: Observation: Compute the \(t\)-score: \(t = -5.45\). The \(t\)-score can be computed by software, or using the general equation in Eq. (31.2).
- Step 4: Decision: Determine if the data are consistent with the assumption, by computing the \(P\)-value. Here, the \(P\)-value is much smaller than \(0.001\). The \(P\)-value can be computed by software, or approximated using the \(68\)--\(95\)--\(99.7\) rule. The conclusion is that there is very strong evidence that \(\mu\) is not \(37.0\).
31.9 Statistical validity conditions
All hypothesis tests have underlying conditions to be met so that the results are statistically valid; that is, the \(P\)-values can be found accurately because the sampling distribution is an approximate normal distribution. For a hypothesis test for one mean, these conditions are the same as for the CI for one mean (Sect. 24.4).
Statistical validity can be assessed using these criteria:
- When \(n > 25\), the test is statistically valid provided the distribution of data is not highly skewed.
- When \(n \le 25\), the test is statistically valid only if the data come from a population with a normal distribution.
The sample size of \(25\) is a rough figure; some books give other values (such as \(30\)). Data with severe skewness or large outliers may need a larger sample size for the test to be statistically valid.
This condition ensures that the distribution of the sample means has an approximate normal distribution (so that, for example, the \(68\)--\(95\)--\(99.7\) rule can be used). Provided the sample size is larger than about \(25\), this will be approximately true even if the distribution of the individuals in the population does not have a normal distribution. That is, when \(n > 25\) the sample means generally have an approximate normal distribution, even if the data themselves do not have a normal distribution.
The units of analysis are also assumed to be independent (e.g., from a simple random sample).
If the statistical validity conditions are not met, other similar options include a sign test or a Wilcoxon signed-rank test (Conover 2003), or using resampling methods (Efron and Hastie 2021).
Example 31.1 (Statistical validity) The hypothesis test regarding body temperature is statistically valid since the sample size is larger than \(25\) (\(n = 130\)). (The data do not need come from a population with a normal distribution.)
31.10 Example: student IQs
Standard IQ scores are designed to have a mean in the general population of \(100\). Reilly, Neumann, and Andrews (2022) studied \(n = 224\) students at Griffith University, and found the sample mean IQ was \(111.19\), with a standard deviation of \(14.21\). Is this evidence that students at Griffith University (GU) have a higher mean IQ than the general population?
The RQ is:
For students at Griffith University, is the mean IQ higher than \(100\)?
The parameter is \(\mu\), the population mean IQ for students at GU. The statistical hypotheses are: \[ \text{$H_0$: $\mu = 100 \qquad \text{and} \qquad H_1$: $\mu > 100$.} \] This test is one-tailed, since the RQ asks if the IQ of GU students is greater than \(100\). (Writing \(H_0\): \(\mu\le 100\) is also correct (and equivalent), though the test still proceeds as if \(\mu = 100\).)
We do not have the original data, but the summary data are sufficient: \(\bar{x} = 111.19\) with \(s = 14.21\) from a sample of size \(n = 224\). The sample mean is higher than \(100\), but we know sample mean vary. The sample means vary with a normal distribution, with mean \(100\) and a standard deviation of \[ \text{s.e.}(\bar{x}) = \frac{s}{\sqrt{n}} = \frac{14.21}{\sqrt{224}} = 0.94945. \] The \(t\)-score is \[ t = \frac{\bar{x} - \mu_{\bar{x}}}{\text{s.e.}(\bar{x})} = \frac{111.19 - 100}{0.94945} = 11.786. \]
This \(t\)-score is huge: a sample mean as large as \(111.19\) would be highly unlikely to occur simply by sampling variation in a sample of size \(n = 224\) if the population mean really was \(100\). Since the alternative hypothesis is one-tailed, and specifically asking if \(\mu > 100\), the \(P\)-value is the area in the right-side tail of the distribution (Fig. 31.7); it will be extremely small.
We conclude (where the CI is found using the ideas in Sect. 24.3):
Very strong evidence exists in the sample (\(t = 11.78\); one-tailed \(P < 0.001\)) that the population mean IQ in students at Griffith University is greater than \(100\) (mean \(111.19\); \(n = 224\); \(95\)% CI from \(109.29\) to \(113.09\)).
The test is about the mean IQ; individual students may have IQs less than \(100\).
Since the sample size is much large than \(25\), this conclusion is statistically valid. The sample is not a true random sample from the population of all GU students (the students are mostly first-year students, and most were enrolled in an undergraduate psychological science degree). However, these students may be somewhat representative of all GU student; those in the sample are probably not that different to students not in the sample. That is, the sample may be externally valid.
The difference between the general population IQ of \(100\) and the sample mean IQ of GU students is only small: about \(11\) IQ units (less than one standard deviation). Possibly, this difference has very little practical importance, even though the statistical evidence suggests that the difference cannot be explained by chance.
IQ scores are designed to have a standard deviation of \(\sigma = 15\) in the general population. If we accept that this applies for university students too (we do not know if it does), the standard error is \(\text{s.e.} = \sigma/\sqrt{n} = 15/\sqrt{130} = 1.0022\), and the test-statistic is a \(z\)-score: \[ z = \frac{\bar{x} - \mu}{\text{s.e.}(\bar{x})} = \frac{111.19 - 100}{1.0022} = 11.87; \] the conclusions do not change.
31.11 Chapter summary
To test a hypothesis about a population mean \(\mu\):
- Write the null hypothesis (\(H_0\)) and the alternative hypothesis (\(H_1\)).
- Initially assume the value of \(\mu\) in the null hypothesis to be true.
- Then, describe the sampling distribution, which describes what to expect from the sample mean based on this assumption: under certain statistical validity conditions, the sample mean varies with:
- an approximate normal distribution,
- with sampling mean whose value is the value of \(\mu\) (from \(H_0\)), and
- having a standard deviation of \(\displaystyle \text{s.e.}(\bar{x}) =\frac{s}{\sqrt{n}}\).
- Compute the value of the test statistic: \[ t = \frac{ \bar{x} - \mu}{\text{s.e.}(\bar{x})}, \] where \(\mu\) is the hypothesised value given in the null hypothesis.
- The \(t\)-value is like a \(z\)-score, and so an approximate \(P\)-value can be estimated using the \(68\)--\(95\)--\(99.7\) rule, or found using software.
The following short video may help explain some of these concepts:
31.12 Quick review questions
The usual engineering recommendation is that the safe gap between travelling vehicles in traffic (a 'headway') is at least \(1.9\) s (often conveniently rounded to \(2\) s). Majeed et al. (2014) studied \(n = 28\) streams of traffic in Birmingham, Alabama found the mean headway was \(1.1915\) s, with a standard deviation of \(0.231\) s. The researchers wanted to test if the mean headway in Birmingham was less than the recommended \(1.9\) s.
- True or false? The test is one-tailed.
- What is the standard error of the mean? (Use five decimal places.)
- What is the null hypothesis?
- What is the value of the test statistic? (Use two decimal places.)
- What is the value of the one-tailed \(P\)-value?
- True or false? There is no evidence to accept the alternative hypothesis (that the mean headway is less than \(1.9\) s).
31.13 Exercises
Answers to odd-numbered exercises are available in App. E.
Exercise 31.1 Azwari and Hamsa (2021) studied driving speeds in Malaysia, and recorded the speeds of vehicles on various roads. One RQ is whether the mean speed of cars on one road was the posted speed limit of \(90\) km.h^{\(-1\)}, or whether it was higher.
The researchers recorded the speed of \(n = 400\) vehicles on this road, and found the mean and standard deviation of the speeds of individual vehicles were \(\bar{x} = 96.56\) and \(s = 13.874\).
- Define the parameter of interest.
- Write the statistical hypotheses.
- Compute the standard error of the sample mean.
- Sketch the sampling distribution of the sample mean.
- Compute the test statistic, a \(t\)-score.
- Determine the \(P\)-value.
- Write a conclusion.
- Is the test statistically valid?
Exercise 31.2 Most dental associations^{8} recommend brushing teeth for at least two minutes. I. D. M. Macgregor and Rugg-Gunn (1979) studied the brushing time for \(85\) uninstructed school children from England (\(11\) to \(13\) years old) and found the mean brushing time was \(60.3\) s, with a standard deviation of \(23.8\) s.
- Define the parameter of interest.
- Write the statistical hypotheses.
- Compute the standard error of the sample mean.
- Sketch the sampling distribution of the sample mean.
- Compute the test statistic, a \(t\)-score.
- Determine the \(P\)-value.
- Write a conclusion.
- Is the test statistically valid?
Exercise 31.3 Greenlee, DeLucia, and Newton (2018) conducted a study of human--automation interaction with automated vehicles. They were interested in whether the average mental demand of 'drivers' of automated vehicles was higher than the average mental demand for ordinary tasks.
In the study, the \(n = 22\) participants 'drove' (in a simulator) an automated vehicle for \(40\) mins. While driving, the drivers monitored the road for hazards. The researchers assessed the 'mental demand' placed on these drivers, where scores of \(50\) over 'typically indicate substantial levels of workload' (p. 471). For the sample, the mean score was \(84.00\) with a standard deviation of \(22.05\).
Is there evidence of a 'substantial workload' associated with monitoring roadways while 'driving' automated vehicles?
Exercise 31.4 Health departments recommend that hot water be stored at \(60^\circ\)C or higher, to kill legionella bacteria (for example, Health and Safety Executive, UK). Alary and Joly (1991) studied \(n = 178\) Quebec homes with electric water heaters to see if the water temperature was less than \(60^\circ\)C (i.e., at risk).
They found the mean temperature was \(56.6^\circ\)C, with a standard error of \(0.4^\circ\)C. Is there evidence the mean water temperature in Quebec is too low?
Exercise 31.5 A Cherry Ripe is a popular chocolate bar in Australia. In 2017, 2018 and 2019, I sampled some Cherry Ripe Fun Size bars. The packaging claimed that the Fun Size bars weigh \(14\) g (on average). Use the jamovi summary of the data (Fig. 31.8) to perform a hypothesis test to determine if the mean weight really is \(14\) g or not.
Exercise 31.6 (This study was also seen in Exercise 24.6.) B. Williams and Boyle (2007)] asked \(n = 199\) paramedics to estimate the amount of blood on four different surfaces. When the actual amount of blood spilt on concrete was \(1000\) ml, the mean guess was \(846.4\) ml (with a standard deviation of \(651.1\) ml).
Is there evidence that the mean guess really is \(1000\) ml (the true amount)? Is this test statistically valid?
Exercise 31.7 Lin et al. (2021) compared the average sleep times of Taiwanese pre-school children to the recommendation (of at least \(10\) hours per night). The summary of the data for weekend sleep-times is shown in Table 31.1, for both boys and girls. On average, do girls get at least \(10\) hours of sleep per night? Do boys?
Sample size | Sample mean | Sample std. dev. | |
---|---|---|---|
Boys | \(47\) | \(8.50\) | \(0.48\) |
Girls | \(39\) | \(8.64\) | \(0.37\) |
Exercise 31.8 [Dataset: BloodLoss
]
A quality-control study (Feng, Huang, and Ma 2017) assessed the accuracy of two instruments from a clinical laboratory, by comparing the reported luteotropichormone (LH) concentrations to known pre-determined values (Table 31.2).
Perform a series of tests to determine how well the two instruments perform, for both high- and mid-level LH concentrations (from the data in below.
High level | Mid level | High level | Mid level | |
---|---|---|---|---|
Mean of data | \(64.310\) | \(19.240\) | \(64.970\) | \(19.400\) |
Std. dev. of data | \(\phantom{0}1.700\) | \(\phantom{0}0.588\) | \(\phantom{0}1.029\) | \(\phantom{0}0.413\) |
Pre-determined target | \(64.220\) | \(19.010\) | \(65.050\) | \(19.450\) |
Exercise 31.9 (This study was also seen in Exercise 24.10.) In 2011, Eagle Boys' Pizza ran a campaign that claimed that Eagle Boys' pizzas were 'Real size \(12\)-inch large pizzas' (P. K. Dunn 2012). Eagle Boys' made the data from the campaign publicly available.
A summary of the diameters of a sample of \(125\) of their large pizzas is shown in Fig. 31.9. We would like to test the company's claim, and ask the RQ:
For Eagle Boys' pizzas, is mean diameter actually \(12\) inches, or not?
- What is the parameter of interest?
- Write down the values of \(\bar{x}\) and \(s\).
- Determine the value of the standard error of the mean.
- Write the hypotheses to test if the mean pizza diameter is \(12\) inches.
- Is the alternative hypothesis one- or two-tailed? Why?
- Draw the normal distribution that shows how the sample mean pizza diameter would vary by chance, even if the population mean diameter was 12 inches.
- Compute the \(t\)-score for testing the hypotheses.
- What is the approximate \(P\)-value using the \(68\)--\(95\)--\(99.7\) rule?
- Write a conclusion. (The CI was found in Exercise 24.10.)
- Is it reasonable to assume the statistical validity conditions are satisfied?
- Do you think that the pizzas do have a mean diameter of 12 inches in the population, as Eagle Boys' claim? Explain.