29 More about hypothesis testing

So far, you have learnt to ask a RQ, identify different ways of obtaining data, design the study, collect the data describe the data, summarise data graphically and numerically, understand the tools of inference, and to form confidence intervals.

In this chapter,
you will learn about hypothesis tests. You will learn to:

  • communicate the results of hypothesis tests.
  • interpret \(P\)-values.

29.1 Introduction

In Chap. 28, hypothesis tests for one mean were studied. In later chapters, hypothesis tests are discussed in other contexts, too.

The general approach to hypothesis testing is the same for any hypothesis test, and so some general ideas are discussed in this chapter. All hypothesis tests answer questions about unknown population quantities (such as the population mean \(\mu\)), based on sample statistics (such as the sample mean \(\bar{x}\)).

The sections that follow discuss:

  • The assumptions and forming hypotheses (Sect. 29.2).
  • The sampling distribution, and the expectations (Sect. 29.3).
  • The observations and the test statistic (Sect. 29.4).
  • Weighing the evidence for consistency: \(P\)-values (Sect. 29.6).
  • Wording conclusions (Sect. 29.7).

When raw data are provided, begin by producing graphical and numerical summaries of the data. The statistical validity conditions, which vary for different hypothesis tests, should always be checked to see if the test is statistically valid.

29.2 About hypotheses and assumptions

Two hypotheses are made about the population parameter:

29.2.1 Null hypotheses

Hypotheses always concern a population parameter. Hypothesising, for example, that the sample mean body temperature is equal to \(37.0^\circ\text{C}\) is pointless, because it clearly isn't: the sample mean is \(36.8051^\circ\text{C}\). Besides, the RQ is about the unknown population: the P in POCI stands for Population.

The null hypothesis \(H_0\) offers one possible reason why the value of the sample statistic (such as the sample mean) is not the same as the value of the proposed population parameter (such as the population mean): sampling variation.

Every sample is different, and so the sample statistic will vary from sample to sample; it may not be equal to the population parameter, just because of the sample used by chance.

Null hypotheses always have an 'equals' in them (for example, the population mean equals 100, is less than or equal to 100, or is more than or equal to 100), because (as part of the decision making process), something specific must be assumed for the population parameter.

The parameter can take many different forms, depending on the context. The null hypothesis about the parameter is the default value of that parameter; for example,

  • there is no difference between the parameter value in two (or more) groups;
  • there is no change in the parameter value; or
  • there is no relationship as measured by a parameter value.

Hypothesis testing starts by assuming that the null hypothesis is true.

The onus is on the data to provide evidence to refute this default position.

The null hypothesis is always about a population parameter, and always has the form 'no difference, no change, no relationship'.

Definition 29.1 (Null hypothesis) The null hypothesis proposes that sampling variation explains the difference between the proposed value of the parameter, and the observed value of the statistic.

29.2.2 Alternative hypotheses

The other hypothesis is called the alternative hypothesis \(H_1\). The alternative hypothesis offers another possible reason why the value of the sample statistic (such as the sample mean) is not the same as the value of the proposed population parameter (such as the population mean).

The alternative hypothesis proposes that the value of the population parameter really is not the value claimed in the null hypothesis.

Definition 29.2 (Alternative hypothesis) The alternative hypothesis proposes that the difference between the proposed value of the parameter and the observed value of the statistic cannot be explained by sampling variation: the proposed value of the parameter is probably not true.```

Alternative hypotheses can be one-tailed or two-tailed. A two-tailed alternative hypothesis means, for example, that the population mean could be either smaller or larger than what is claimed. A one-tailed alternative hypothesis admits only one of those two possibilities. Most (but not all) hypothesis tests are two-tailed.

The decision about whether the alternative hypothesis is one- or two-tailed is made by reading the RQ (not by looking at the data).

Indeed, the RQ and hypotheses should (in principle) be formed before the data are obtained, or at least before looking at the data if the data are already collected.

The ideas are the same whether the alternative hypothesis is one- or two-tailed: based on the data and the sample statistic, a decision is to be made about whether the alternative hypotheses is supported by the data.

Example 29.1 (Alternative hypotheses) For the body-temperature study, the alternative hypothesis is two-tailed: The RQ asks if the population mean is \(37.0^\circ\text{C}\) or not. That is, two possibilities are considered: that \(\mu\) could be either larger or smaller than \(37.0^\circ\text{C}\).

A one-tailed alternative hypothesis would be appropriate if the RQ was: 'Is the population mean internal body temperature greater than \(37.0^\circ\text{C}\)?', or Is the population mean internal body temperature smaller than \(37.0^\circ\text{C}\)?.

Important points about forming hypotheses:

  • Hypotheses always concern a population parameter.
  • Null hypotheses always contain an 'equals'.
  • Alternative hypothesis are one-tailed or two-tailed, depending on the RQ.
  • Hypotheses emerge from the RQ (not the data): The RQ and the hypotheses could be written down before collecting the data.

29.3 About sampling distributions and expectations

The sampling distribution describes, approximately, how the sample statistic (such as \(\bar{x}\)) is likely to vary from sample to sample over many repeated samples, when \(H_0\) is true: it describes the sampling variation.

Under certain circumstances, sampling distributions often have an approximate normal distribution, which is the basis for computing \(P\)-values (or approximating \(P\)-values using the 68--95--99.7 rule).

When the sampling distribution is described by a normal distribution, the mean of the normal distribution is the parameter value given in the assumption (\(H_0\)), and the standard deviation of the normal distribution is called the standard error.

In some cases, the sample statistic may not have a normal distribution, but a quantity easily derived from the sample statistic does have a normal distribution (for example, the odds ratio467).

29.4 About observations and the test statistic

The sampling distribution describes what values the sample statistic can reasonably be expected to have, over many repeated samples.

Since the sampling distribution of the statistic has an approximate normal distribution under certain conditions, the observed value of the sample statistic can be expressed as a something like a \(z\)-score (called a \(t\)-score when the population standard deviation is unknown). In general, \(t\)-scores always have the same form:

\[ \text{statistic} = \frac{\text{sample statistic} - \text{assumed population parameter}} {\text{measure of variation of the sample statistic}}. \] The \(t\)-score here is the test statistic, since it is based on sample data ('a statistic') and used in a hypothesis test.

A \(t\)-score is similar to a \(z\)-score; both the \(z\)- and \(t\)-scores have the same form: \[ \frac{\text{sample value} - \text{population value}} {\text{measure of variation of the sample value}}. \]

Then:

  • If the 'sample value' refers to an individual observation \(x\), the measure of variation is the standard deviation, because the standard deviation measures the variation in the individual observations.
  • If the 'sample value' is a sample statistic, the measure of variation is a standard error, because the standard deviation measures the variation in the sample statistic.

In both cases, if the measure of variation uses a known population value, a \(z\)-score is found; if the measure of variation uses a sample value, a \(t\)-score is found.

29.5 About finding \(P\)-values

As demonstrated in Sect. 28.5.1, often \(P\)-values can be approximated by using the the 68--95--99.7 rule and using a diagram of a normal distribution. The \(P\)-value is the area more extreme than the calculated \(t\)-score; the 68--95--99.7 rule can be used to approximate this tail area.

For two-tailed tests, the \(P\)-value is the combined area in the left and right tails. For one-tailed tests, the \(P\)-value is the area in just the left or right tail.

When software reports two-tailed \(P\)-values, a one-tailed \(P\) is found by halving the two-tailed \(P\)-value.

More accurate estimates of the \(P\)-value can be found using \(z\)-tables, though we do not demonstrate this in this book. Even more precise estimates of \(P\)-values can be found using specially-prepared \(t\)-tables. Again, we do not do so in this book.

For more precise \(P\)-values, we will take the \(P\)-values from software output.

When using software to obtain \(P\)-values, be sure to check if the software reports one- or two-tailed \(P\)-values.

For example, some software (such as SPSS) always reports two-tailed \(P\)-values.

29.6 About interpreting \(P\)-values

A \(P\)-value is the likelihood of observing the sample statistic (or something even more extreme) over repeated sampling, under the assumption that the null hypothesis about the population parameter is true.

\(P\)-values can be computed because the sampling distribution often has an approximate normal distribution.

TABLE 29.1: A guideline for interpreting \(P\)-values. \(P\)-values should be interpreted in context.
If the \(P\)-value is... Write the conclusion as...
Larger than 0.10 Insufficient evidence to support \(H_1\)
Between 0.05 and 0.10 Slight evidence to support \(H_1\)
Between 0.01 and 0.05 Moderate evidence to support \(H_1\)
Between 0.001 and 0.01 Strong evidence to support \(H_1\)
Smaller than 0.001 Very strong evidence to support \(H_1\)

Conclusion are always about the population values.

No-one needs \(P\)-values to see if the sample values are the same: We can just look at them, and see.

\(P\)-values are needed to determine what we learn about the unknown population values, based on what we see in the sample values.

Commonly, a \(P\)-value smaller than 5% is considered 'small', but this is arbitrary. More reasonably, \(P\)-values should be interpreted as giving varying degrees of evidence in support of the alternative hypothesis (Table 29.1), but these are only guidelines.

Conclusions should be written in the context of the problem. Sometimes, authors will write that the results are 'statistically significant' when \(P<0.05\).

Definition 29.3 (P-value) A \(P\)-value is the likelihood of observing the sample statistic (or something more extreme) over repeated sampling, under the assumption that the null hypothesis about the population parameter is true.

\(P\)-values are never exactly zero. When SPSS reports that '\(P=0.000\)', it means that the \(P\)-value is less than 0.001, which we write as '\(P<0.001\)'.

jamovi usually reports very small \(P\)-values as '\(P<0.001\)'.

\(P\)-values are commonly used in research, but they need to be used and interpreted correctly.468 Specifically:

  • A \(P\)-value is not the probability that the null hypothesis is true.
  • A \(P\)-value does not prove anything.
  • A big \(P\)-value does not mean that the null hypothesis \(H_0\) is true, or that \(H_1\) is false.
  • A small \(P\)-value does not mean that the null hypothesis \(H_0\) is false, or that \(H_0\) is true.
  • A small \(P\)-value does not indicate that the results are practically important (Sect. 29.8).
  • A small \(P\)-value does not mean a large difference between the statistic and parameter; it means that the difference could not reasonably be attributed to sampling variation (chance).

Sometimes, the results from hypothesis tests are called "significant" or "statistically significant".

This means that the \(P\)-value is small (traditionally, but arbitrarily, \(P < 0.05\)), and hence the evidence supports the alternative hypothesis.

To avoid confusion, the word "significant" should be avoided in writing about research unless "statistical significance" is what is actually what is meant. In other situations, consider using words like "substantial".

29.7 About writing conclusions

When reporting a conclusion, three things should be included:

  1. The answer to the RQ;
  2. The evidence used to reach that conclusion (such as the \(t\)-score and \(P\)-value, clarifying if the \(P\)-value is one-tailed or two-tailed); and
  3. Some sample summary statistics (such as sample means and sample sizes), including a CI (which indicates the precision with which the statistic has been estimated).

Conclusions can never be made with certainty from one sample. Partly this is because a sample has been studied, while the RQ asks about the whole population: The entire population wasn't studied.

For this reason, care must be taken when answering the RQ. A hypothesis test never proves anything: It might conclude that evidence exists (perhaps weak evidence; perhaps strong evidence) to support the alternative hypothesis. Of course, there may be no evidence to support the alternative hypothesis either.

Since the value of the parameter in the null hypothesis is assumed true, the onus is on the data to provide evidence to refute this default position. For this reason, conclusions are worded in terms of the level of support for the alternative hypothesis.

Conclusions are always made in terms of how much evidence supports the alternative hypothesis. Hypothesis tests assume the null hypothesis is true, so the onus is on the data to provide evidence in support of the alternative hypothesis.

What is wrong with the following conclusion?

The evidence proves that the mean internal body temperature has changed.

29.8 About practical importance and statistical significance

Hypothesis tests assess statistical significance, which answers the question: 'Is there evidence of a difference between the value of the statistic and the value of the assumed parameter?'. Even very small differences between the sample statistic and the population parameter can be statistically different if the sample size is large enough.

In contrast, practical importance asks the question:

Is the difference between the value of the statistic and the value of the assumed parameter of any practical importance?

'Practical importance' and 'statistical significance' are two separate (but both important) issues. Whether a results is of practical importance depends upon the context: what the data are being used for, by whom, and for what purpose.

Example 29.2 (Practical importance) In the body-temperature study, very strong evidence exists that the mean body temperature had changed ('statistical significance').

But the change was so small, that for most purposes it has no practical importance. (There may be other (e.g., medical) situations where it does have practical importance however.)

Practical importance depends on the context in which the results will be used.

Example 29.3 (Practical importance) A study of some herbal medicines469 for weight loss found:

Phaseolus vulgaris resulted in a statistically significant weight loss compared to placebo, although this was not considered clinically significant.

In other words, although the difference in weight loss between placebo and Phaseolus vulgaris was unikely to be explained by chance (\(P<0.001\), which is 'statistical significant'), the difference was so small in size (a mean weight loss of just 1.61 kg) that it was unlikely to be of any use in practice ('practical importance').

In this context, a weight loss of at least 2.5 kg was considered to be of practical importance.

29.9 Validity and hypothesis testing

When performing hypothesis tests, certain statistical validity conditions must be true. These conditions ensure that the sampling distribution is sufficiently close to a normal distribution for the 68--95--99.7 rule rule to apply and hence for \(P\)-values to be computed470.

If these conditions are not met, the sampling distribution may not be normally distributed, so the \(P\)-values (and hence conclusions) maybe inappropriate.

In addition to the statistical validity condition, the internal validity and external validity of the study should be discussed also (Fig. 29.1). These are usually (but not always) the same as for CIs (Sect. 21.3).

Regarding external validity, all the computations in this book assume a simple random sample. If the sample is from a random sampling method, but not from a simple random sample, then methods exist for conducting hypothesis tests that are externally valid, but are more complicated than those described in this book.

If the sample is a non-random sample, then the hypothesis test may be reasonable for the quite specific population that is represented by the sample; however, the sample probably does not represent the more general population that is probably intended.

Externally validity requires that a study is also internally valid. Internal validity can only be discussed if details are known about the study design.

Three types of validities for studies.

FIGURE 29.1: Three types of validities for studies.

In addition, hypothesis tests also require that the sample size is less than 10% of the population size; however this is almost always the case.

29.10 Summary

Hypothesis testing formalises the steps of the decision-making process. Starting with an assumption about a population parameter of interest, a description of what values the sample statistic might take (based on this assumption) is produced: this describes what values the statistic is expected to take, just through sampling variation. This sampling distribution is often a normal distribution, or related to a normal distribution.

The sample statistic (the estimate) is then observed, and a test statistic, which often is a \(t\)-score, is computed to describe this sample statistic. Using a \(P\)-value, a decision is made about whether the sample evidence supports or contradicts the initial assumption, and hence a conclusion is made. Since \(t\)-scores are like \(z\)-scores, \(P\)-values can often be approximated using the 68--95--99.7 rule.

29.11 Quick review questions

  1. True or false? When a \(P\)-value is very small, a very large difference exists between the statistic and parameter.

  2. True or false? The alternative hypothesis is one-tailed if the sample statistic is larger than the hypothesised population mean.

  3. What is wrong (if anything) with this null hypothesis: \(H_0=37\)?

  4. True or false: When the sampling distribution is a normal distribution, the standard deviation of this normal distribution is called the standard error.

  5. True or false? Both \(z\)-scores and \(t\)-scores can be test statistics.

  6. True or false? \(P\)-values can never be exactly zero.

  7. True or false? A \(P\)-value is the probability that the null hypothesis is true.

Progress:

29.12 Exercises

Selected answers are available in Sect. D.27.

Exercise 29.1 Use the 68--95--99.7 rule to approximate the two-tailed \(P\)-value if:

  1. the \(t\)-score is \(3.4\).
  2. the \(t\)-score is \(-2.9\).
  3. the \(t\)-score is \(1.2\).
  4. the \(t\)-score is \(-0.95\).
  5. the \(t\)-score is \(-0.2\).
  6. the \(t\)-score is \(6.7\).

Exercise 29.2 Consider the \(t\)-scores in Exercise 29.1. Use the 68--95--99.7 rule to approximate the one-tailed \(P\)-values in each case.

Exercise 29.3 Suppose a hypothesis test results in a \(P\)-value of 0.0501. What would we conclude? What about if the \(P\)-value was 0.0499?

Exercise 29.4 Consider again the study to determine the mean body temperature, where \(\bar{x} = 36.8051^{\circ}\text{C}\). What, if anything, is wrong with these hypotheses? Explain.

  1. \(H_0\): \(\bar{x} = 36\) and \(H_1\): \(\bar{x} \ne 36\).
  2. \(H_0\): \(\bar{x} = 36.8051\) and \(H_1\): \(\bar{x} > 36.8051\).
  3. \(H_0\): \(\mu = 36.8051\) and \(H_1\): \(\mu \ne 36.8051\).
  4. \(H_0\): \(\mu = 36\) and \(H_1\): \(\mu = 36.8051\).
  5. \(H_0\): \(\mu > 36\) and \(H_1\): \(\bar{x} > 36\).
  6. \(H_0\): \(\mu = 36\) and \(H_1\): \(\mu > 36\).

Exercise 29.5 The recommended daily energy intake for women is 7725kJ (for a particular cohort, in a particular country; Altman471). The daily energy intake for 11 women was measured to see if this is being adhered to. The RQ was

Is the population mean daily energy intake 7725kJ?

The test produced \(P=0.018\). What, if anything, is wrong with these conclusions after completing the hypothesis test?

  1. There is moderate evidence (\(P = 0.018\)) that the energy intake is not meeting the recommended daily energy intake.
  2. There is moderate evidence (\(P = 0.018\)) that the sample mean energy intake is not meeting the recommended daily energy intake.
  3. There is moderate evidence (\(P = 0.018\)) that the population energy intake is not meeting the recommended daily energy intake.

Exercise 29.6 A study compared ALDI batteries to another brand of battery. In one test comparing the length of time it takes for 1.5 volt AA batteries to reach 1.1 volts, the ALDI brand battery took 5.73 hours, and the other brand (Energizer) took 5.44 hours.472

  1. The \(P\)-value for comparing these two means is about \(P=0.70\). What does this mean?
  2. Is this difference likely to be of any practical importance? Explain.
  3. What would be a useful, but correct, conclusion for ALDI to report from the study? Explain.
  4. What else would be useful to know in comparing the two brands of batteries?