32 More about hypothesis testing
You have learnt to ask a RQ, design a study, classify and summarise the data, construct confidence intervals, and perform some hypothesis tests. In this chapter, you will learn more about hypothesis tests. You will learn to:
- understand the steps of hypothesis testing.
- communicate the results of hypothesis tests.
- interpret \(P\)-values.
32.1 Introduction
In Chaps. 30 and 31, hypothesis tests for one proportion and one mean were studied. Later chapters discuss hypothesis tests in other contexts, too. However, the general approach to hypothesis testing is the same for any hypothesis test. This chapter discusses some general ideas in hypothesis testing:
- the assumptions and forming hypotheses (Sect. 32.2).
- the expectations of the statistic, as described by the sampling distribution (Sect. 32.3).
- the observations and the test statistic (Sect. 32.4).
- quantifying the consistency between the statistic and parameter values: computing \(P\)-values (Sect. 32.5).
- interpreting \(P\)-values (Sect. 32.6).
- mistakes that can be made when making conclusions (Sect. 32.7).
- wording conclusions (Sect. 32.8).
- practical importance and statistical significance (Sect. 32.9).
- statistical validity in hypothesis testing (Sect. 32.10).
Hypothesis testing starts by assuming the null hypothesis is true. The onus is on the data to provide evidence to refute this default position. That is, the null hypothesis is retained unless the evidence suggests otherwise.
32.2 About hypotheses and assumptions
Two statistical hypotheses are stated about the population parameter: the null hypothesis \(H_0\), and the alternative hypothesis \(H_1\). Unless evidence suggests otherwise, the null hypothesis is retained.
The word hypothesis means 'a possible explanation'.
Scientific hypotheses refer to potential scientific explanations that can be tested by collecting data. For example, an engineer may hypothesise that replacing sand with glass in the manufacture of concrete will produce desirable characteristics (Devaraj et al. 2021).
Statistical hypotheses refer to statistical explanations that are required to determine whether the evidence (i.e., data) supports the scientific hypotheses. The statistical hypotheses are the foundation of the logic of hypothesis testing. One of the statistical hypotheses may align with the scientific hypothesis.
This book discusses statistical hypotheses.
32.2.1 Null hypotheses
Statistical hypotheses are always about a parameter. Hypothesising, for example, that the sample mean body temperature is equal to \(37.0^\circ\text{C}\) is silly, because the sample mean clearly is \(36.8052^\circ\text{C}\) (Chap. 31), and its value will vary from sample to sample anyway. The RQ is about the unknown population: the P in POCI stands for Population.
The null hypothesis \(H_0\) offers one possible reason why the value of the statistic (such as the sample mean) is not the same as the proposed value of the parameter (such as the population mean): sampling variation. Every sample is different, and we have data from just one of the many possible samples. The value of the statistic will vary from sample to sample; the statistic may not be equal to the parameter, just because of the random sample obtained.
Definition 32.1 (Null hypothesis) The null hypothesis proposes that sampling variation explains the difference between the proposed value of the parameter, and the observed value of the statistic.
Null hypotheses always contain an 'equals', because (as part of the decision making process) a specific value must be assumed for the parameter, so we can describe what we might expect from the sample. For example: the population mean equals \(100\), is less than or equal to \(100\), or is more than or equal to \(100\).
The null hypothesis always assumes the difference between the statistic and the assumed value of the parameter is due to sampling variation. This may mean:
- there is no change in the value of the parameter compared to an established or accepted value (for descriptive RQs);
- there is no change in the value of the parameter for the units of analysis (i.e., for repeated-measures RQs);
- there is no difference between the value of the parameter in two (or more) groups (i.e., for relational RQs); or
- there is no relationship between the variables, as measured by a parameter value (for correlational RQs).
The null hypothesis always has the form 'no difference, no change, no relationship' regarding the population parameter. It is the 'sampling variation' explanation.
Defining the parameter carefully is important!
For example, if a parameter is about the difference between two means (say, in Group A and Group B), then the parameter description must clarify if the parameter is the 'Group A mean minus the Group B mean', or the 'Group B mean minus the Group A mean'. Either is fine (though one may be easier to understand), but the direction used must be clearly stated.
32.2.2 Alternative hypotheses
The other statistical hypothesis is called the alternative hypothesis \(H_1\) (or \(H_a\)). The alternative hypothesis offers another possible reason why the value of the statistic (such as the sample proportion) is not the same as the proposed value of the parameter (such as the population proportion): the value of the parameter really is not the value claimed in the null hypothesis (but does not explain why).
Definition 32.2 (Alternative hypothesis) The alternative hypothesis proposes that the difference between the proposed value of the parameter and the observed value of the statistic cannot be explained by sampling variation. It proposes that the value of the parameter is not the value claimed in the null hypothesis.
Alternative hypotheses can be one-tailed or two-tailed. A two-tailed alternative hypothesis means, for example, that the population mean could be either smaller or larger than what is claimed. A one-tailed alternative hypothesis admits only one of those two possibilities. Most (but certainly not all) hypothesis tests are two-tailed.
The decision about whether the alternative hypothesis is one- or two-tailed is made by reading the RQ (not by looking at the data), and what the RQ asks. The RQ and hypotheses should (in principle) be formed before the data are obtained, or at least before looking at the data if the data are already collected.
The idea of hypothesis testing is the same whether the alternative hypothesis is one- or two-tailed: based on the data and the statistic, a decision is to be made about whether the alternative hypotheses is supported by the data.
Example 32.1 (Alternative hypotheses) For the body-temperature study, the alternative hypothesis is two-tailed: the RQ asks if the population mean is \(37.0^\circ\text{C}\) or not. Two possibilities are considered: that \(\mu\) could be either larger or smaller than \(37.0^\circ\text{C}\).
A one-tailed alternative hypothesis would be appropriate if the RQ asked 'Is the population mean internal body temperature greater than \(37.0^\circ\text{C}\)?', or 'Is the population mean internal body temperature smaller than \(37.0^\circ\text{C}\)?'. One-tailed RQs such as these would only be asked if there were good scientific reasons to suspect a difference in one direction specifically.
Important points about forming hypotheses:
- Hypotheses always concern a population parameter.
- Null hypothesis always have the form 'no difference, no change, no relationship'.
- Alternative hypothesis are one- or two-tailed, depending on the RQ.
- Null hypotheses always contain an 'equals'.
- Hypotheses emerge from the RQ (not the data): the RQ and the hypotheses could be written down before collecting the data.
32.3 About sampling distributions and expectations
The sampling distribution describes, approximately, how the value of the statistic (such as \(\hat{p}\) or \(\bar{x}\)) varies across all possible samples, when \(H_0\) is true; it describes the sampling variation. Under certain circumstances, many sampling distributions have an approximate normal distribution.
When the sampling distribution is described by a normal distribution, the mean of the normal distribution is the parameter value given in the assumption (\(H_0\)), and the standard deviation of the normal distribution is called the standard error. In some cases, the sample statistic does not have a normal distribution, but a quantity easily derived from the sample statistic does have a normal distribution (for example, for odds ratios^{9} in Chap. 35).
The variation in the sampling distribution (as measured by the standard error) depend on the sample size. For example, in one roll of a die, rolling a ⚁, and hence finding a sample proportion of \(\hat{p} = 1\), is not unreasonable. However, in \(20\ 000\) rolls, a sample proportion of \(\hat{p} = 1\) would be incredibly unlikely for a fair die.
32.4 About observations and the test statistic
The sampling distribution describes what values the statistic can take over all possible samples of a given size. When the sampling distribution has an approximate normal distribution, the observed value of the test statistic is \[ \text{test statistic} = \frac{\text{test statistic} - \text{centre of the sampling distribution}} {\text{standard deviation of the sampling distribution}}. \] This is called a test statistic, since the calculation is based on sample data (so it is a statistic) and used in a hypothesis test. This test statistic may be a \(z\)-score or a \(t\)-score. Other test statistics, when the sampling distribution is not a normal distribution, are used too (as in Chap. 35).
A \(t\)-score and \(z\)-score both measure the number of standard deviations that an observation is from the mean: \[ \frac{\text{a value that can vary} - \text{mean of the distribution}} {\text{standard deviation of the distribution}}. \] Then:
- If the quantity that varies is an individual observation \(x\), the measure of variation is the standard deviation of the individual observations.
- If the quantity that varies is a sample statistic, the measure of variation is a standard error, which measures the variation in the sample statistic.
When testing means, the test statistic is a \(t\)-score if the measure of variation use a sample standard deviation.
32.5 About finding \(P\)-values
As demonstrated in Sect. 30.5, \(P\)-values can be approximated (using the \(68\)--\(95\)--\(99.7\) rule or tables) when the sampling distribution has an approximate normal distribution. The \(P\)-value is the area more extreme than the calculated \(z\)- or \(t\)-score (i.e., in the tails of the distribution). The \(68\)--\(95\)--\(99.7\) rule can be used to approximate this tail area.
A lower-case \(p\) or upper-case \(P\) can be used to denote a \(P\)-value. We use an upper-case \(P\), since we use \(p\) to denote a population proportion.
For two-tailed tests, the \(P\)-value is the combined area in the left and right tails. For one-tailed tests, the \(P\)-value is the area in just the left or right tail (as appropriate, according to the alternative hypothesis; see Sect. 31.10).
If the sampling distribution has an approximate normal distribution, the one-tailed \(P\)-value is half the value of the two-tailed \(P\)-value.
Some software (such as SPSS) always reports two-tailed \(P\)-values.
More accurate approximations of the \(P\)-value can be found using tables. For more precise \(P\)-values, use the \(P\)-values from software output.
32.6 About interpreting \(P\)-values
A \(P\)-value is the probability of observing the value of the sample statistic (or something even more extreme) over repeated sampling, under the assumption that the null hypothesis is true. Since the null hypothesis is initially assumed true, the onus is on the data to present evidence to contradict the null hypothesis. That is, the null hypothesis is retained unless the evidence suggests otherwise.
Conclusion are always about the parameters. \(P\)-values tell us about the unknown parameters, based on what we observed from one of the many possible values of the statistic.
A 'big' \(P\)-value mean that the sample statistic (such as \(\bar{p}\)) could reasonably have occurred through sampling variation in one of the many possible samples, if the assumption made about the parameter (stated in \(H_0\)) was true. A 'small' \(P\)-value mean that the sample statistic (such as \(\hat{p}\)) is unlikely to have occurred through sampling variation in one of the many possible samples, if the assumption made about the parameter (stated in \(H_0\)) was true.
Commonly, a \(P\)-value smaller than \(5\)% is considered 'small', but this is arbitrary and sometimes the threshold is discipline-dependent. More reasonably, \(P\)-values should be interpreted as giving varying degrees of evidence in support of the alternative hypothesis (Table 32.1), but these are only guidelines.
If the \(P\)-value is... | Write the conclusion as... |
---|---|
Larger than 0.10 | Insufficient evidence to support \(H_1\) |
Between 0.05 and 0.10 | Slight evidence to support \(H_1\) |
Between 0.01 and 0.05 | Moderate evidence to support \(H_1\) |
Between 0.001 and 0.01 | Strong evidence to support \(H_1\) |
Smaller than 0.001 | Very strong evidence to support \(H_1\) |
Definition 32.3 ($P$-value) A \(P\)-value is the likelihood of observing the sample statistic (or something more extreme) over repeated sampling, under the assumption that the null hypothesis about the population parameter is true.
\(P\)-values are never exactly zero. Some software (e.g., jamovi) reports very small \(P\)-values as '\(P < 0.001\)' (i.e., the \(P\)-value is smaller than \(0.001\)). Some software (e.g., SPSS) reports very small \(P\)-values as '\(P = 0.000\)'. In either case, we should still write \(P < 0.001\).
\(P\)-values are commonly used in research, but must be used and interpreted correctly (Greenland et al. 2016). Specifically:
- A \(P\)-value is not the probability that the null hypothesis is true.
- A \(P\)-value does not prove anything (only one possible sample was studied).
- A big \(P\)-value does not mean the null hypothesis \(H_0\) is true, or that \(H_1\) is false.
- A small \(P\)-value does not mean the null hypothesis \(H_0\) is false, or that \(H_1\) is true.
- A small \(P\)-value does not mean the results are practically important (Sect. 32.9).
- A small \(P\)-value does not necessarily mean a large difference between the statistic and parameter; it means that the difference (whether large or small) could not reasonably be attributed to sampling variation (chance).
Sometimes the results of a study are reported as being statistically significant. This usually means that the \(P\)-value is less than \(0.05\), though a different \(P\)-value is sometimes used as the 'threshold', so check!
To avoid confusion, the word 'significant' should be avoided in writing about research unless 'statistical significance' is actually meant. In other situations, consider using words like 'substantial'.
32.7 About mistakes that can be made in reaching conclusions
Hypothesis testing is about making a decision about a population using a sample. Since the sample is just one of countless possible samples that could have been observed, making an incorrect conclusion is always a possibility.
Two mistakes can be made:
- Incorrectly concluding that evidence supports the alternative hypothesis. Of course, the researchers do not know they are incorrect, but the possibility of making this mistake is always present. This is a false positive, or a Type I error.
- Incorrectly concluding there is no evidence to support the alternative hypothesis. Of course, the researchers do not know they are incorrect, but the possibility of making this mistake is always present. This is a false negative, or a Type II error.
Ideally, neither of these errors would be made; however, sampling variation means that neither can ever be completely eliminated. In practice, hypothesis testing begins by assuming the null hypothesis is true, and hence places the onus on the data to provide compelling evidence in favour of the alternative hypothesis. This means researchers usually prioritise minimising the chance of a Type I error.
A Type I error is like declaring an innocent person guilty (recall: innocence is presumed). Similarly, a Type II error is like declaring a guilty person innocent. The law generally sees a Type I error as more grievous that a Type II error, just as in research.
32.8 About writing conclusions
In general, communicating the result of a hypothesis test requires stating:
- the answer to the RQ;
- the evidence used to reach that conclusion (such as the \(t\)-score and \(P\)-value, clarifying if the \(P\)-value is one-tailed or two-tailed); and
- sample summary statistics (such as sample means and sample sizes), including a CI (indicating the precision with which the parameter has been estimated).
Since we assume the null hypothesis is true, conclusions are worded (in context) in terms of how strongly the evidence supports the alternative hypothesis.
Since the null hypothesis is initially assumed to be true, the onus is on the data to provide evidence in support of the alternative hypothesis. That is, the null hypothesis is retained unless the evidence suggests otherwise. Hence, conclusions are always worded in terms of how much evidence supports the alternative hypothesis.
We do not say whether the evidence supports the null hypothesis; the null hypothesis is already assumed to be true. Even if the current sample presents no evidence to contradict the assumption, future evidence may emerge. That is:
'No evidence of a difference' is not the same as 'evidence of no difference'.
Suppose we conclude that 'there is no evidence that the mean IQ is greater than \(100\) in football players'. This does not mean there is evidence of no difference between the mean IQ for football players and the population mean IQ of \(100\).
32.9 About practical importance and statistical significance
Hypothesis tests assess statistical significance, which answers the question: 'Can sampling variation reasonably explain the difference between the value of the statistic and the assumed value of the parameter?' Even very small differences between the statistic and the parameter can be statistically different if the sample size is sufficiently large.
In contrast, practical importance answers the question: 'Is the difference between the values of the statistic and the parameter of any importance in practice?' Whether a results is of practical importance depends upon the context: what the data are being used for. 'Practical importance' and 'statistical significance' are separate issues.
Example 32.2 (Practical importance) In the body-temperature study (Sect. 31.1), very strong evidence exists that the mean body temperature had changed ('statistical significance'). But the change was so small that, for most purposes, it has no practical importance. In other (e.g., medical) situations, it may have practical importance however.
Example 32.3 (Practical importance) Maunder et al. (2020) studied the use of herbal medicines for weight loss, and found that the intervention:
... resulted in a statistically significant weight loss compared to placebo, although this was not considered clinically significant.
This means that the difference in weight loss between placebo and intervention was unlikely to be explained by chance (\(P < 0.001\); i.e., 'statistical significant'), but the difference was so small that it was unlikely to be of any use in practice ('practical importance'). In this context, the researchers decided that a weight loss of at least \(2.5\) kg was of practical importance. In the study, the sample mean weight loss was \(1.61\) kg.
32.10 About statistical validity
When performing hypothesis tests, statistical validity conditions must be true to ensure that the mathematics behind computing the \(P\)-value is sound. For instance, the statistical validity conditions may ensure that the sampling distribution is sufficiently like a normal distribution for the \(68\)--\(95\)--\(99.7\) rule to apply.
If the statistical validity conditions are not met, the \(P\)-values (and hence conclusions) maybe inappropriate. These statistical validity conditions are usually the same as, or similar to, those for the corresponding CIs (Sect. 25.4).
32.11 Chapter summary
Hypothesis testing formalises the decision-making process. Starting with an assumption about a population parameter of interest, a description of what values the sample statistic might take is produced: this describes what values the statistic is expected to take over all possible samples. This sampling distribution is often a normal distribution, or related to a normal distribution.
The statistic (the sample estimate) is then observed, and a test statistic is computed to describe this sample statistic. Using a \(P\)-value, a decision is made about whether the sample evidence supports or contradicts the initial assumption, and hence a conclusion is made. When the sampling distribution is an approximate normal distribution, the test statistic is a \(t\)-score or \(z\)-score, and \(P\)-values can often be approximated using the \(68\)--\(95\)--\(99.7\) rule.
32.12 Quick review questions
- True or false?
When a \(P\)-value is very small, a very large difference exists between the statistic and parameter.
- True or false?
The alternative hypothesis is one-tailed if the sample statistic is larger than the hypothesised population mean.
- What is wrong (if anything) with this null hypothesis: \(H_0 = 37\)?
- True or false: When the sampling distribution is a normal distribution, the standard deviation of this normal distribution is called the standard error.
- True or false?
Both \(z\)-scores and \(t\)-scores can be test statistics.
- True or false? \(P\)-values can never be exactly zero.
- True or false? A \(P\)-value is the probability
that the null hypothesis is true.
32.13 Exercises
Answers to odd-numbered exercises are available in App. E.
Exercise 32.1 Assuming the tests are statistically valid, use the \(68\)--\(95\)--\(99.7\) rule to approximate the two-tailed \(P\)-value if:
- the \(t\)-score is \(3.4\).
- the \(t\)-score is \(-2.9\).
- the \(t\)-score is \(-2.1\).
- the \(t\)-score is \(-6.7\).
Exercise 32.2 Assuming the tests are statistically valid, use the \(68\)--\(95\)--\(99.7\) rule to approximate the two-tailed \(P\)-value if:
- the \(t\)-score is \(1.05\).
- the \(t\)-score is \(-1.3\).
- the \(t\)-score is \(6.7\).
- the \(t\)-score is \(0.1\).
Exercise 32.3 Consider the \(t\)-scores in Exercise 32.1. Use the \(68\)--\(95\)--\(99.7\) rule to approximate the one-tailed \(P\)-values in each case.
Exercise 32.4 Consider the \(t\)-scores in Exercise 32.2. Use the \(68\)--\(95\)--\(99.7\) rule to approximate the one-tailed \(P\)-values in each case.
Exercise 32.5 Suppose a hypothesis test results in a \(P\)-value of \(0.0501\). What would we conclude? What if the \(P\)-value was \(0.0499\)?
Exercise 32.6 Suppose a hypothesis test results in a \(P\)-value of \(0.011\). What would we conclude? What if the \(P\)-value was \(0.009\)?
Exercise 32.7 Consider the study to determine the mean body temperature (Chap. 31), where \(\bar{x} = 36.8052^{\circ}\text{C}\). Explain why each of these sets of hypotheses are incorrect.
- \(H_0\): \(\bar{x} = 37.0\); \(H_1\): \(\bar{x} \ne 37.0\).
- \(H_0\): \(\mu = 37\); \(H_1\): \(\mu > 37\).
- \(H_0\): \(\mu = 37\); \(H_1\): \(\mu = 36.8052\).
Exercise 32.8 Consider the study to determine the mean body temperature (Chap. 31), where \(\bar{x} = 36.8052^{\circ}\text{C}\). Explain why each of these sets of hypotheses are incorrect.
- \(H_0\): \(\bar{x} = 36.8052\); \(H_1\): \(\bar{x} > 36.8052\).
- \(H_0\): \(\mu = 36.8052\); \(H_1\): \(\mu \ne 36.8052\).
- \(H_0\): \(\mu > 37.0\); \(H_1\): \(\bar{x} > 37.0\).
Exercise 32.9 The recommended daily energy intake for women is \(7725\) kJ (for a particular cohort, in a particular country; Altman (1991)). The daily energy intake for 11 women was measured to see if this is being adhered to. The RQ was 'Is the population mean daily energy intake \(7725\) kJ?'
The test produced \(P = 0.018\). What, if anything, is wrong with these conclusions after completing the hypothesis test?
- There is moderate evidence (\(P = 0.018\)) that the energy intake is not meeting the recommended daily energy intake.
- There is moderate evidence (\(P = 0.018\)) that the sample mean energy intake is not meeting the recommended daily energy intake.
- There is moderate evidence (\(P = 0.018\)) that the population energy intake is not meeting the recommended daily energy intake.
- The study proves that the population energy intake is not meeting the recommended daily energy intake (\(P = 0.018\)).
Exercise 32.10 A study compared ALDI batteries to another brand of battery. In one test (comparing the time taken for \(1.5\) volt AA batteries to reach \(1.1\) volts), the ALDI brand battery took \(5.73\) hours, and the other brand (Energizer) took \(5.44\) hours (P. K. Dunn 2013).
- What is the null hypothesis for the test?
- The \(P\)-value for comparing these two means is about \(P = 0.70\). What does this mean?
- Is this difference likely to be of any practical importance? Explain.
- What would be a correct conclusion for ALDI to report from the study? Explain.
- What else would be useful to know when comparing the two brands of batteries?
Exercise 32.11 An ecologist is studying two different grasses to help combat soil salinity, by comparing to a new grass (Grass A) to a native grass (Grass B). She uses \(50\) different sites, allocating the two grasses at random to the sites (\(25\) sites for each grass).
After \(12\) months, the ecologist records whether the soil salinity at each site has improved, and hence computes the odds that each grass will improve the salinity. She finds a statistically significant difference between the odds in the two groups.
Which of these statements is consistent with this conclusion?
- The \(\text{OR} = 4.1\) and \(P = 0.36\)
- The \(\text{OR} = 4.1\) and \(P = 0.0001\)
- The \(\text{OR} = 0.91\) and \(P = 0.36\)
- The \(\text{OR} = 0.91\) and \(P = 0.0001\)
How would the other statements be interpreted then?