28 Tests for one mean

So far, you have learnt to ask a RQ, identify different ways of obtaining data, design the study, collect the data describe the data, summarise data graphically and numerically, understand the tools of inference, and to form confidence intervals.

In this chapter, you will learn about hypothesis tests for one mean. You will learn to:

  • conduct hypothesis tests for one sample mean, using a \(t\)-test.
  • determine whether the conditions for using these methods apply in a given situation.

28.1 Introduction: Body temperatures

The average internal body temperature is commonly believed to be \(\mu= 37.0^\circ\text{C}\), a guideline based on data over 150 years old.437 More recently, researchers wanted to re-examine this claim438 to see if this benchmark is still appropriate.

In this example, a decision is sought about the value of the population mean body temperature \(\mu\). The value of \(\mu\) will never be known: the internal body temperature of every person alive would need to be measured... and even those not yet born.

The parameter is \(\mu\), the population mean internal body temperature.

However, a sample of people can be taken to determine whether or not there is evidence that the population mean internal body temperature is still \(37.0^\circ\text{C}\).

To make this decision, the decision-making process (Sect. 15.3) is used. Begin by assuming that \(\mu=37.0^\circ\text{C}\) (as there is no evidence that this accepted standard is wrong), and then determine if the evidence supports this claim or not. The RQ could be stated as:

Is the population mean internal body temperature \(37.0^\circ\text{C}\)?

28.2 Statistical hypotheses and notation: One mean

The decision making process begins by assuming that the population mean internal body temperature is \(37.0^\circ\text{C}\).

The sample mean \(\bar{x}\) is likely to be different for every sample (sampling variation). The sampling distribution of \(\bar{x}\) describes how the value of \(\bar{x}\) varies from sample to sample. Because \(\bar{x}\) varies, the sample mean \(\bar{x}\) probably won't be exactly \(37.0^\circ\text{C}\), even if \(\mu\) is \(37.0^\circ\text{C}\).

If \(\bar{x}\) is not \(37.0^\circ\text{C}\), two broad reasons could explain why:

  1. The population mean body temperature is \(37.0^\circ\text{C}\), but \(\bar{x}\) isn't exactly \(37.0^\circ\text{C}\) due to sampling variation (that is, the sample mean varies and is likely to be different in every sample); or
  2. The population mean body temperature is not \(37.0^\circ\text{C}\), and the sample mean body temperature reflects this.

These two possible explanations are called statistical hypotheses. More formally, the two statistical hypotheses above are:

  1. The null hypothesis (\(H_0\)): \(\mu=37.0^\circ\text{C}\); the population mean body temperature is \(37.0^\circ\text{C}\); and
  2. The alternative hypothesis (\(H_1\)): \(\mu \ne 37.0^\circ\text{C}\); the population mean body temperature is not \(37.0^\circ\text{C}\).

Since the null hypothesis is assumed true, the evidence is evaluated to determine if it is supported by the data, or not.

Note that the alternative hypothesis asks if \(\mu\) is \(37.0^\circ\text{C}\) or not: the value of \(\mu\) may be smaller or larger than \(37.0^\circ\text{C}\). Two possibilities are considered: for this reason, this alternative hypothesis is called a two-tailed alternative hypothesis.

28.3 Sampling distribution: One mean

A RQ is answered using data (this is partly what is meant by evidence-based research). Fortunately, for the body-temperature study, data are available from a comprehensive American study.439

Summarising the data is important, because the data are the means by which the RQ is answered (data below).

A graphical summary (Fig. 28.1) shows that the internal body temperature of individuals varies from person to person: this is natural variation. A numerical summary (from software) shows that:

  • The sample mean is \(\bar{x} = 36.8051^\circ\)C;
  • The sample standard deviation is \(s = 0.40732^\circ\)C;
  • The sample size is \(n=130\).

The sample mean is less than the assumed value of \(\mu = 37^\circ\text{C}\)... The question is why: can the difference reasonably be explained by sampling variation, or not?

A 95% CI can also be computed (using software or manually): the 95% CI for \(\mu\) is from \(36.73^\circ\) to \(36.88^\circ\)C. This CI is narrow, implying that \(\mu\) has been estimated with precision, so detecting even small deviations of \(\mu\) from \(37^\circ\) should be possible.

The histogram of the body temperature data

FIGURE 28.1: The histogram of the body temperature data

The decision-making process assumes that the population mean temperature is \(\mu=37.0^\circ\text{C}\), as stated in the null hypothesis. Because of sampling variation, the value of \(\bar{x}\) sometimes would be smaller than \(37.0^\circ\text{C}\) and sometimes greater than \(37.0^\circ\text{C}\).

How much variation in the value of \(\bar{x}\) could be expected, simply due to sampling variation, when \(\mu = 37.0^\circ\text{C}\)? This variation is described by the sampling distribution.

The sampling distribution of \(\bar{x}\) was discussed in Sect. 22.2 (and Def. 22.1 specifically). From this, if \(\mu\) really was \(37.0^\circ\)C and if certain conditions are true, the possible values of the sample means can be described using:

  • An approximate normal distribution;
  • With mean \(37.0^\circ\text{C}\) (from \(H_0\));
  • With standard deviation of \(\displaystyle \text{s.e.}(\bar{x}) = \frac{s}{\sqrt{n}} = \frac{0.40732}{\sqrt{130}} = 0.035724\). This is the standard error of the sample means.

A picture of this sampling distribution (Fig. 28.2) shows how the sample mean varies when \(n = 130\), simply due to sampling variation, when \(\mu = 37^\circ\text{C}\). This enables questions to be asked about the likely values of \(\bar{x}\) that would be found in the sample, when the population mean is \(\mu = 37^\circ\text{C}\).

The distribution of sample mean body temperatures, if the population mean is $37^\circ$C and $n=130$.  The grey vertical lines are 1, 2 and 3 standard deviations from the mean.

FIGURE 28.2: The distribution of sample mean body temperatures, if the population mean is \(37^\circ\)C and \(n=130\). The grey vertical lines are 1, 2 and 3 standard deviations from the mean.

Given the sampling distribution shown in Fig. 28.2, use the 68--95--99.7 rule to determine how often will \(\bar{x}\) be larger than 37.036 degrees C just because of sampling variation, if \(\mu\) really is \(37^\circ\)C.

About 16% of the time.

28.4 The test statistic and \(t\)-scores: One mean

The sampling distributions describes what to expect from the sample mean, assuming \(\mu = 37.0^\circ\text{C}\). The value of \(\bar{x}\) that is observed, however, is \(\bar{x} = 36.8051^\circ\) . How likely is it that such a value could occur by chance?

The value of the observed sample mean can be located the picture of the sampling distribution (Fig. 28.3). The value \(\bar{x} = 36.8051^\circ\text{C}\) is unusually small. About how many standard deviations is \(\bar{x}\) away from \(\mu = 37\)? A lot...

The sample mean of $\bar{x}=36.8041^\circ$C is very unlikely to have been observed if the poulation mean really was $37^\circ$C, and $n=130$

FIGURE 28.3: The sample mean of \(\bar{x}=36.8041^\circ\)C is very unlikely to have been observed if the poulation mean really was \(37^\circ\)C, and \(n=130\)

Relatively speaking, the distance that the observed sample mean (of \(\bar{x} = 36.8051\)) is from the mean of the sampling distribution (Fig. 28.3) is found by computing how many standard deviations the value of \(\bar{x}\) is from the mean of the distribution; that is, computing something like a \(z\)-score. (Remember that the standard deviation in Fig. 28.3 is the the standard error: the amount of variation in the sample means.)

Since the mean and standard deviation (i.e., the standard error) of this normal distribution are known, the number of standard deviations that \(\bar{x} = 36.8051\) is from the mean is

\[ \frac{36.8051 - 37.0}{0.035724} = -5.453. \] This value is like a \(z\)-score. However, this is actually called a \(t\)-score because it has been computed when the population standard deviation is unknown, and the best estimate (the sample standard deviation) is used when \(\text{s.e.}(\bar{x})\) was computed.

Both \(t\) and \(z\) scores measure the number of standard deviations that an observation is from the mean: \(z\)-scores use \(\sigma\) and \(t\)-scores use \(s\). Here, the distribution of the sample statistic is relevant, so the appropriate standard deviation is the standard deviation of the sampling distribution: the standard error.

Like \(z\)-scores, \(t\)-scores measure the number of standard deviations that an observation is from the mean. \(z\)-scores are calculated using the population standard deviation, and \(t\)-scores are calculated using the sample standard deviation.

In hypothesis testing, \(t\)-scores are more commonly used than \(z\)-scores, because almost always the population standard deviation is unknown, and the sample standard deviation is used instead.

In this book, it is sufficient to think of \(z\)-scores and \(t\)-scores as approximately the same. Unless sample sizes are small, this is a reasonable approximation.

So the calculation is:

\[ t = \frac{36.8051 - 37.0}{0.035724} = -5.453; \] the observed sample mean is more than five standard deviation below the population mean. This is highly unusual based on the 68--95--99.7 rule, as seen in Fig. 28.3.

In general, a \(t\)-score in hypothesis testing is

\[\begin{equation} t = \frac{\text{sample statistic} - \text{assumed population parameter}} {\text{standard error of the sample statistic}}. \tag{28.1} \end{equation}\]

28.5 \(P\)-values: One mean

This is the decision-making process so far:

  1. Assume that the population mean is \(37.0^\circ\text{C}\) (this is \(H_0\)).
  2. Based on this assumption, describe what to expect from the sample means (Fig. 28.2).
  3. The observed statistic is computed, relative to what is expected using a \(t\)-score (Fig. 28.3): \(t = -5.453\).

The value of the \(t\)-score shows that the value of \(\bar{x}\) is highly unusual. How unusual can be assessed more precisely using a \(P\)-value, which is used widely in scientific research. The \(P\)-value is a way of measuring how unusual an observation is (if \(H_0\) is true).

\(P\) values can be approximated using the 68--95--99.7 rule and a diagram (Sect. 28.5.1), but more commonly by using software (Sect. 28.5.2).

28.5.1 Approximating \(P\)-values using the 68--95--99.7 rule

The \(P\)-value is the area more extreme than the calculated \(t\)-score. For example:

  • If the calculated \(t\)-score was \(t = -1\), the two-tailed \(P\)-value would be the shaded area in Fig. 28.4 (top panel): About 32%, based on the 68--95--99.7 rule. Because the alternative hypothesis is two-tailed, both sides of the mean are considered: the \(P\)-value would be the same if \(t = +1\).

  • If the calculated \(t\)-score was \(t = -2\), the two-tailed \(P\)-value would be the shaded area shown in Fig. 28.4 (bottom panel): About 5%, based on the 68--95--99.7 rule. Because the alternative hypothesis is two-tailed, both sides of the mean are considered: the \(P\)-value would be the same if \(t = +2\).

Clearly, from what the \(P\)-value means, a \(P\)-value is always between 0 and 1.

Computing $P$-values for the body temperature data

FIGURE 28.4: Computing \(P\)-values for the body temperature data

In practice, \(P\)-values are rarely ever exactly 1 or 2.

Consider when the \(t\)-score is a little larger than \(t = 1\), say \(t = 1.2\). Then, the tail area will be a little smaller than the tail area when \(t = 1\) (Table 28.5). The \(P\)-value will be little smaller than 0.32.

Similarly, when the \(t\)-score is a little smaller than \(t = 2\), say \(t = 1.9\), the tail area will be a little larger than the tail area when \(t = 2\) (Table 28.5). The \(P\)-value a little larger than 0.05.

Computing $P$-values for the body temperature data

FIGURE 28.5: Computing \(P\)-values for the body temperature data

What do you think the \(P\)-value will be for \(t = -5.45\) (using Fig. 28.3)?

Based on the 68--95--99.7 rule, the \(P\)-value will be extremely small.

28.5.2 Finding \(P\)-values using sofware

Software computes the \(t\)-score and a precise \(P\)-value (jamovi: Fig. 28.6; SPSS: Fig. 28.7). The output (in jamovi, under the heading p; in SPSS, under the heading Sig. (2-tailed)) shows that the \(P\)-value is indeed very small.

Although SPSS reports the \(P\)-value as 0.000, \(P\)-values can never be exactly zero, so we interpret this as 'zero to three decimal places', or that \(P\) is less than 0.001 (written as \(P < 0.001\), as jamovi reports).

When software reports a \(P\)-value of 0.000, it really means (and we should write) \(P < 0.001\): That is, the \(P\)-value is smaller than 0.001.

This \(P\)-value means that, assumpting \(\mu = 37.0^\circ\)C, observing a sample mean as low as \(36.8051^\circ\)C just through sampling variation (from a sample size of \(n = 130\)) is almost impossible. And yet, we did...

Using the decision-making process, this implies that the initial assumption (the null hypothesis) is contradicted by the data: The evidence suggests that the population mean body temperature is not \(37.0^\circ\text{C}\).

jamovi output for conducting the $t$-test for the body temperature data

FIGURE 28.6: jamovi output for conducting the \(t\)-test for the body temperature data

SPSS output for conducting the $t$-test for the body temperature data

FIGURE 28.7: SPSS output for conducting the \(t\)-test for the body temperature data

SPSS always produces two-tailed \(P\)-values, calls then Significance values, and labels them as Sig.

jamovi can produce one- or two-tailed \(P\)-values.

28.6 Making decisions with \(P\)-values

\(P\)-values tells us the likelihood of observing the sample statistic (or something more extreme), based on the assumption about the population parameter being true.

In this context, the \(P\)-value tells us the likelihood of observing the value of \(\bar{x}\) (or something more extreme), just through sampling variation (chance) if \(\mu = 37\).

The \(P\)-value is a probability, albeit a probability of something quite specific, so it is a value between 0 and 1. Then:

  • 'Big' \(P\)-values mean that the sample statistic (i.e., \(\bar{x}\)) could reasonably have occurred through sampling variation, if the assumption about the parameter (stated in \(H_0\)) was true (Fig. 28.8, top panel). The data do not contradict the assumption (\(H_0\)).

  • 'Small' \(P\)-values mean that the sample statistic (i.e., \(\bar{x}\)) is unlikely to have occurred through sampling variation, if the assumption about the parameter (stated in \(H_0\)) was true (Fig. 28.8, bottom panel). The data contradict the assumption.

What is meant by 'small' and 'big'? It is arbitrary: no definitive rules exist. Commonly, a \(P\)-value smaller than 1% (that is, smaller than 0.01) is usually considered 'small', and a \(P\)-value larger than 10% (that is, larger than 0.10) is usually considered 'big'. Between the values of 1% and 10% is often a 'grey area'.

A picture of large (top) and small (bottom) $P$-value situations

FIGURE 28.8: A picture of large (top) and small (bottom) \(P\)-value situations

Traditionally, a \(P\)-value is 'small' if it is less than 5% (less than 0.05), and 'big' if greater than 5% (greater than 0.05).

However, again this is arbitrary, and binary decision making (either big or small) is unreasonable.

More reasonably, \(P\)-values should be interpreted as providing varying strength of evidence in support of the alternative hypothesis \(H_1\) (Table 29.1).

These are not definitive, but are only guidelines. Of course, conclusions should be written in the context of the problem.

For one-tailed tests, the \(P\)-value is half the value of the two-tailed \(P\)-value.

SPSS always produces two-tailed \(P\)-values, usually calls them 'Significance values', and labels them as Sig., and sometimes explicitly notes that they are two-tailed.

For the body-temperature data then, where \(P < 0.001\), the \(P\)-value is very small, so there is very strong evidence that the population mean body temperature is not \(37.0^\circ\text{C}\).

28.7 Communicating results: One mean

In general, to communicate the results of any hypothesis test, report:

  • An answer to the RQ;
  • The evidence used to reach that conclusion (such as the \(t\)-score and \(P\)-value---including if it is a one- or two-tailed \(P\)-value); and
  • Some sample summary information, including a CI, summarising the data used to make the decision.

So write:

The sample provides very strong evidence (\(t = -5.45\); two-tailed \(P<0.001\)) that the population mean body temperature is not \(37.0^\circ\text{C}\) (\(\bar{x} = 36.81\); \(n = 130\); 95% CI from 36.73\(^\circ\)C to 36.88\(^\circ\)C).

The components are:

  • The answer to the RQ: The sample provides very strong evidence... that the population mean body temperature is not \(37.0^\circ\text{C}\)'.
  • The evidence used to reach the conclusion: '\(t = -5.45\); two-tailed \(P<0.001\)'.
  • Some sample summary information (including a CI): '\(\bar{x} = 36.81\); \(n=130\); 95% CI from 36.73\(^\circ\)C to 36.88\(^\circ\)C'.

Notice how the conclusion is worded: There is evidence to support the alternative hypothesis. In fact, the alternative hypothesis may or may not be true... but the evidence (data) available supports the alternative hypothesis.

28.8 Hypothesis testing for one mean: A summary

Let's recap the decision-making process seen earlier, in this context about body temperatures:

  • Step 1: Assumption: Write the null hypothesis about the parameter (based on the RQ): \(H_0\): \(\mu = 37.0^\circ\text{C}\). In addition, write the alternative hypothesis \(H_1\): \(\mu \ne 37.0^\circ\text{C}\). (This alternative hypothesis is two-tailed.)
  • Step 2: Expectation: The sampling distribution describes what to expect from the sample statistic if the null hypothesis is true: under certain circumstances, the sample means will vary with an approximate normal distribution around a mean of \(\mu = 37.0^\circ\text{C}\) with a standard deviation of \(\text{s.e.}(\bar{x}) = 0.03572\) (Fig. 28.3).
  • Step 3: Observation: Compute the \(t\)-score: \(t = -5.45\). The \(t\)-score can be computed by software, or using the general equation (28.1).
  • Step 4: Consistency?: Determine if the data are consistent with the assumption, by computing the \(P\)-value. Here, the \(P\)-value is much smaller than 0.001. The \(P\)-value can be computed by software, or approximated using the 68--95--99.7 rule.

The conclusion is that there is very strong evidence that \(\mu\) is not \(37.0^\circ\text{C}\):

Example 28.1 (Mean driving speeds) A study of driving speeds in Malaysia440 recorded the speeds of vehicles on various roads.

One RQ of interest was whether the mean speed of cars on one road was the posted speed limit of 90 km.h-1, or whether it was higher. The parameter of interest is \(\mu\), the mean speed in the population.

The statistical hypotheses are:

  • \(H_0\): \(\mu = 90\); and
  • \(H_1\): \(\mu > 90\) (since the researchers were interested in whether the mean speed was higher than the posted speed limit).

The researchers recorded the speed of \(n = 400\) vehicles on this road, and found \(\bar{x} = 96.56\), but this value is likely to vary from sample to sample. The sample standard deviation was \(s = 13.874\), so that

\[ \text{s.e.}(\bar{x}) = \frac{s}{\sqrt{n}} = \frac{13.874}{\sqrt{400}} = 0.6937. \] Hence, the test statistic is

\[ t = \frac{\bar{x} - \mu}{\text{s.e.}(\bar{x})} = \frac{96.56 - 90}{0.6937} = 9.46, \] where (as usual) the value of \(\mu\) is taken from the null hypothesis (which we always assume to be true).

This is a huge value, suggesting that the (one-tailed) \(P\)-value is very small.

We write (remembering the alternative hypothesis is one-tailed):

There is very strong evidence (\(t = 9.46\); one-tailed \(P<0.001\)) that the mean speed of vehicles on this road (sample mean: 96.56; standard deviation: 13.874) is greater than 90 km.h-1.

Of course, this statement refers to the mean speed; there may be individual vehicles travelling below the speed limit.

28.9 Statistical validity conditions: One mean

As with any inference procedure, the underlying mathematics requires certain conditions to be met so that the results are statistically valid. For a hypothesis test for one mean, these conditions are the same as for the CI for one mean (Sect. 22.4).

The test will be statistically valid if one of these is true:

  1. The sample size is at least 25, or
  2. The sample size is smaller than 25 and the population data has an approximate normal distribution.

The sample size of 25 is a rough figure here, and some books give other values (such as 30).

This condition ensures that the distribution of the sample means has an approximate normal distribution so that the 68--95--99.7 rule can be used. Provided the sample size is larger than about 25, this will be approximately true even if the distribution of the individuals in the population does not have a normal distribution. That is, when \(n > 25\) the sample means generally have an approximate normal distribution, even if the data themselves don't have a normal distribution.

In addition to the statistical validity condition, the test will be

Example 28.2 (Statistical validity) The hypothesis test regarding body temperature is statistically valid since the sample size is large (\(n = 130\)).

Since the sample size is large, we do not require the data to come from a population with a normal distribution.

Suppose we are performing a one-sample \(t\)-test about a mean. The random sample is of size is \(n = 45\), and the histogram of the data is skewed right.

Is the test likely to be statistically valid?

Example 28.3 (Driving speeds) In Example 28.1 about mean driving speeds, the sample size was 400, much larger than 25.

The test will be statistically valid.

28.10 Example: Recovery times

Seventeen patients were treated for medial collateral ligament (MCL) and anterior cruciate ligament (ACL) tears using a new treatment method.441

The current existing treatment has an average recovery time of 15 days. The RQ is:

For patients with this type of injury, does the new treatment method lead to shorter mean recovery times?

The parameter is \(\mu\), the population mean recovery time.

The statistical hypotheses (Step 1: Assumption) about the parameter are, from the RQ:

  • \(H_0\): \(\mu = 15\) (the population mean is \(15\), but \(\bar{x}\) is not \(15\) due to sampling variation): This is the initial assumption.
  • \(H_1\): \(\mu < 15\) (\(\mu\) is not \(15\); it really does produce shorter recovery times, on average).

This test is one-tailed: the RQ only asks if the new method produces shorter recovery times.

The evidence (Table 28.1) can be summarised numerically, using software or (since the data set is small) a calculator. Either way, \(\bar{x} = 13.29\) and \(s = 8.887\).

TABLE 28.1: The recovery times (in days) for a new treatment
14 18 12 10 8 28 24 3 9
9 26 0 4 21 24 2 14

If the null hypothesis is true (and \(\mu = 15\)), the values of the sample mean that are likely to occur through sampling variation can be described (Step 2: Expectation).

The sample means are likely to vary with an approximate normal distribution (under certain assumptions), with mean \(\mu=15\) and a standard deviation of

\[ \text{s.e.}(\bar{x}) = \frac{s}{\sqrt{n}} = \frac{8.887}{\sqrt{17}} = 2.155. \] This describes what values of \(\bar{x}\) we should expect in the sample if the mean recovery time \(\mu\) really was 15 days (Fig. 28.11).

The sample mean is \(\bar{x} = 13.29\), so the \(t\)-score to determine where the sample mean is located (Step 3: Observation), relative to what is expected, is

\[ t = \frac{\bar{x} - \mu}{\text{s.e.}(\bar{x})} = \frac{13.29 - 15}{2.155} = -0.79. \] Software could also be used (jamovi: Fig. 28.9; SPSS: Fig. 28.10); in either case, \(t = -0.79\).

A \(z\)-score of \(-0.79\) is not unusual, and (since \(t\)-scores are like \(z\)-scores) this \(t\)-score is not unusual either (Step 4: Consistency). The \(P\)-value in the jamovi output or SPSS output confirms this: the two-tailed \(P\)-value is \(0.440\), so the one-tailed \(P\)-value is \(0.440\div 2 = 0.220\).

jamovi output for the $t$-test for the recovery-times data

FIGURE 28.9: jamovi output for the \(t\)-test for the recovery-times data

SPSS output for the $t$-test for the recovery-times data

FIGURE 28.10: SPSS output for the \(t\)-test for the recovery-times data

Recall: For one-tailed tests, the \(P\)-value is half the value of the two-tailed \(P\)-value.

This 'large' \(P\)-value suggests that a sample mean of 13.29 could reasonably have been observed just through sampling variation: there is no evidence to support the alternative hypothesis \(H_1\).

If \(\mu\) really was 15, then about 22% of the time \(\bar{x}\) would be less than 13.29 just through sampling variation alone.

The sampling distribution for the recovery-times data

FIGURE 28.11: The sampling distribution for the recovery-times data

To summarise:

  • Step 1 (Assumption):
    • \(H_0\): \(\mu=15\), the initial assumption;
    • \(H_1\): \(\mu<15\) (note: one-tailed)
  • Step 2 (Expectation): The sample means will vary, and this sampling variation is described by an approximate normal distribution with mean \(15\) and standard deviation \(\text{s.e.}(\bar{x}) = 2.155\).
  • Step 3 (Observation): \(t = -0.791\).
  • Step 4 (Consistency?): The one-tailed \(P\)-value is \(0.220\): The data are consistent with \(H_0\), so there is no evidence to support the alternative hypothesis.

To write a conclusion, include an answer to the question, evidence leading to the conclusion, and some sample summary information:

No evidence exists in the sample (one sample \(t = -0.79\); one-tailed \(P = 0.220\)) that the population mean recovery time is less than 15 days (mean 13.29 days; \(n=17\); 95% CI from 8.73 to 17.86 days) using the new treatment method.

(The CI is found using the ideas in Sect. 22.3, or manually.)

Notice the wording: The new method may be better, but no evidence exists of this in the sample. The onus is on the new method to demonstrate that it is better than the current method.

The sample size is small (\(n = 17\)), so the test may not be statistically valid (but the \(P\)-value is so large that it probably won't affect the conclusions).

28.11 Example: Student IQs

Standard IQ scores are designed to have a mean in the general population of 100, with a standard deviation of 15.

A study of \(n = 224\) students at Griffith University442 found that the sample IQ scores were approximately normally distributed, with a mean of 111.19 and a standard deviation of 14.21.

Is this evidence that students at Griffith University (GU) have a higher mean IQ than the general population?

The RQ is:

For students at Griffith University, is the mean IQ higher than 100?

The parameter is \(\mu\), the population mean IQ for students at GU.

The statistical hypotheses (Step 1: Assumption) about the parameters are, from the RQ:

  • \(H_0\): \(\mu = 100\) (the population mean is \(100\), but \(\bar{x}\) is not \(100\) due to sampling variation): This is the initial assumption.
  • \(H_1\): \(\mu > 100\) (\(\mu\) is greater than \(100\); GU students really do have a higher mean IQ than the general population).

This test is one-tailed: the RQ only asks if the IQ of GU students is greater than 100.

We do not have the original data, but the summary data are sufficient: \(\bar{x} = 111.19\) with a standard deviation of \(s = 14.21\) from a sample of size \(n = 224\).

The sample mean is higher than 100, which:

...might be expected from a university subject pool...

Reilly, Neumann, and Andrews443, p. 7.

Since the sample mean varies for each sample, the sample mean has a standard error:

\[ \text{s.e.}(\bar{x}) = \frac{s}{\sqrt{n}} = \frac{14.21}{\sqrt{224}} = 0.9494456. \] The \(t\)-score is

\[ t = \frac{\bar{x} - \mu}{\text{s.e.}(\bar{x})} = \frac{111.19 - 100}{0.9494456} = 11.78. \]

This is a huge \(t\)-score, which means that a sample mean as large as 111.19 would be highly unlikely to occur simply by chance in a sample of size \(n = 224\) if the population mean really was 100. The \(P\)-value will be extremely small.

To conclude (where the CI is found using the ideas in Sect. 22.3, or manually):

Very strong evidence exists in the sample (one sample \(t = 11.78\); one-tailed \(P < 0.001\)) that the population mean IQ in students at Griffith University is greater than 100 (mean 111.19; \(n = 224\); 95% CI from 109.29 to 113.09).

(Note: IQ scores do not have units of measurement.)

Since the sample size is much large than 25, this conclusion is statistically valid (i.e., the mathematics behind the computations will be sound).

The sample is not a true random sample from the population of all GU students (the students are mostly first-year students, and most were enrolled in an undergraduate psychological science degree). However, these students may be somewhat representative of all GU student. That is, the sample may be externally valid.

The difference between the general population IQ of 100 and the sample mean IQ of GU students is only small: about 11 IQ units. Possibly, this difference has very little practical significance, even though the statistical evidence suggests that the difference cannot be explained by chance.

28.12 Summary

To test a hypothesis about a population mean \(\mu\),:

  • Initially assume the value of \(\mu\) in the null hypothesis to be true.
  • Then, describe the sampling distribution, which describes what to expect from the sample statistic based on this assumption: under certain statistical validity conditions, the sample mean varies with:
    • an approximate normal distribution,
    • centered around the hypothesised value of \(\mu\),
    • with a standard deviation of

\[ \text{s.e.}(\bar{x}) =\frac{s}{\sqrt{n}}. \]

  • The observations are then summarised, and test statistic computed:

\[ t = \frac{ \bar{x} - \mu}{\text{s.e.}(\bar{x})}, \] where \(\mu\) is the hypothesised value given in the null hypothesis.

  • The \(t\)-value is like a \(z\)-score, and so an approximate \(P\)-value can be estimated using the 68--95--99.7 rule, or found using software.

The following short video may help explain some of these concepts:

28.13 Quick review questions

A study444 compared the nutritional intake of \(n = 50\) anaemic infants in Lahore (Pakistan) with the amounts recommended.

The mean daily protein intake in the sample was 14g, with a standard deviation of 3g. The recommended protein intake was 13g.

The researchers wanted to see if the mean intake met the recommendation, or not.

  1. The standard error of the mean (to four decimal places) is

  2. The null hypothesis is:

  3. The test statistic (to two decimal places) is

  4. The two-tailed \(P\)-value is

  5. True or false? We accept the null hypothesis.

  6. There is to support the alternative hypothesis (that the mean daily protein intake is not 13g).

Progress:

28.14 Exercises

Selected answers are available in Sect. D.26.

Exercise 28.1 The recommended daily energy intake for women is 7725kJ (for a particular cohort, in a particular country; Altman445). The daily energy intake for 11 women was measured to see if this is being adhered to. The RQ is

For this group of women, is the population mean daily energy intake 7725kJ?

The data collected are shown in Table 28.2.

  1. Write the hypotheses for answering this RQ.
  2. Use the jamovi output (Fig. 28.12), write down the value of \(\bar{x}\) and \(\text{s.e.}(\bar{x})\).
  3. Using this output, write down the \(t\)-value and the \(P\)-value.
  4. Write a suitable conclusion.
  5. Is the test statistically valid?
  6. Sketch the sampling distribution of \(\bar{x}\).
TABLE 28.2: Energy consumptions (in kJ) for women
5260 5640 6390 6515 7515 8770
5470 6180 7515 6805 8230
jamovi output for the energy-intake data

FIGURE 28.12: jamovi output for the energy-intake data

Exercise 28.2 Most dental associations446 recommend brushing teeth for two minutes. A study447 of the brushing time for 85 uninstructed school children from England (11 to 13 years old) found the mean brushing time was 60.3 seconds, with a standard deviation of 23.8 seconds.

  1. Is there evidence that the mean brushing time for schoolchildren from England is two minutes (as recommended)?
  2. Sketch the sampling distribution of the sample mean.

Exercise 28.3 A study448 of human-automation interaction with automated vehicles aimed to

... determine whether monitoring the roadway for hazards during automated driving results in a vigilance decrement.

--- Greenlee, DeLucia, and Newton449, p. 465

(A 'decrement' is a reduction.) That is, they were interested in whether the average mental demand of 'drivers' of automated vehicles was higher than the average mental demand for ordinary tasks.

In the study, the \(n=22\) participants 'drove', in a simulator, an automated vehicle for 40 minutes. While driving, the drivers monitored the road for hazards. The researchers assessed the 'mental demand' placed on these drivers, where scores of 50 over 'typically indicate substantial levels of workload' (p. 471). For the sample, the mean score was 84.00 with a standard deviation of 22.05.

Is there evidence of a 'substantial workload' associated with monitoring roadways while 'driving' automated vehicles?

Exercise 28.4 A study explored the quality of life of patients receiving cavopulmonary shunts.450 Quality of life' was assessed using a 36-question health survey, where the scale is standardised so that the mean of the general population is 50.

For the 14 patients in the study, the sample mean for the 'Physical component' of the survey was 47.2 (with a standard deviation of 8.2). The sample mean for the 'Mental component' of the survey was 52.7 (with a standard deviation of 5.6).

Is there evidence that the patients are different, on average, to the general population on the basis of the results?

Exercise 28.5 A Cherry Ripe is a popular chocolate bar in Australia. In 2017 and 2018, I 'sampled' some Cherry Ripe Fun Size bars. The packet claimed that the Fun Size bars weigh 12 g (on average). Use the SPSS summary of the data (Fig. 28.13) to perform a hypothesis test to determine if the mean weight really is 12 g or not.

jamovi output for the Cherry Ripes data

FIGURE 28.13: jamovi output for the Cherry Ripes data

Exercise 28.6 A study of paramedics451 asked participants (\(n=199\)) to estimate the amount of blood on four different surfaces. When the actual amount of blood spilt on concrete was 1000 ml, the mean guess was 846.4 ml (with a standard deviation of 651.1 ml).

Is there evidence that the mean guess really is 1000 ml (the true amount)? Is this test likely to be valid?

Exercise 28.7 A quality-control study452 assessed the accuracy of two instruments from a clinical laboratory, by comparing the reported luteotropichormone (LH) concentrations to known pre-determined values (data below).

Perform a series of tests to determine how well the two instruments perform, for both high- and mid-level LH concentrations (using Table 28.3).

TABLE 28.3: Summary of the quality-control data for LH levels (in mIU/mL) for two instruments
High level (Inst. 1) Mid level (Inst. 1) High level (Inst. 2) Mid level (Inst. 2)
Mean of data 64.31 19.240 64.970 19.400
Std. dev. of data 1.70 0.588 1.029 0.413
Pre-determined target 64.22 19.010 65.050 19.450