Chapter 16: Hypothesis Testing

Are your observations unexpected?

Motivating Scenarios: .

We want to understand the standard by which scientists take results seriously versus chalking them up to sampling error.
We are visiting a doctor, reading a newspaper, or a scientific paper, and are told results are “statistically significant” or have a p-value of XXX, and we want to know what that means.
We have done a statistical analysis and want to guard against the possibility that our seemingly exceptional results are merely attributable to sampling error.

There is one MAJOR GOAL here: You should understand a p-value and its limitations. But to break this down:
Learning goals: By the end of this chapter, you should be able to:

Figure 1: My brief (about five minute long) introduction to the idea behind null hypothesis significance testing. This is optional.

Explain why we create null models and what makes a good one.
Explain the motivation for Null Hypothesis Significance Testing (NHST).
Explain the role of a test statistic in hypothesis testing.
Describe a p-value in relation to the sampling distribution under the null.
Explain what a false positive and a false negative are, and how likely we are to observe one or the other as sample size increases.
Explain why a p-value is not “the probability that the null is true.”
Explain the concept of statistical power and how it relates to sample size and effect size.

In addition to this reading, the other assignment is to read the Statement on P-values from the American Statistical Association (Wasserstein and Lazar 2016), and the article The p-value statement, five years on(Matthews 2021), then watch the video from “Calling Bullshit” on the “Prosecutor’s Fallacy.”

Review and Motivation for NHST

In our introduction to statistics and our section on summarizing data, we introduced estimation as a central goal of statistics—summarizing what we observe to estimate population parameters from a sample.

We also discussed, in our introduction, sampling, and uncertainty sections, that all estimates are subject to sampling error. The goal of null hypothesis significance testing (NHST) is to determine whether our observations can be reasonably explained by sampling error. Let’s work through an example:

Imagine we conducted an experiment where 15,000 people received the Moderna Covid vaccine, and another 15,000 received a placebo. This experimental design is intended to:

Estimate parameters of interest (e.g., the probability of contracting Covid, or the frequency of severe Covid among those who contracted it) in vaccinated and unvaccinated populations.
Compare these parameters between the vaccine and placebo groups.

Here are the estimates from the data:

Vaccine group: 11 Covid cases, 0 severe cases.
Placebo group: 185 Covid cases, 30 severe cases.

So, did the vaccine work? There are certainly fewer Covid cases in the vaccine group.

However, these are estimates, not parameters. We didn’t study entire populations but rather drew samples from them. As a result, our estimates reflect both sampling variability and potential true differences between populations. Before rolling out a vaccination campaign, we need to determine whether these results can be explained by something other than a real effect.

What leads samples to deviate from a population? In principle, in addition to a real effect, sampling bias, non-independent sampling, and sampling error could all lead to a deviation between estimates and true population parameters. Our goal in null hypotheses significance testing is to see if results are easily explained by sampling error. That is, NHST helps us assess whether the differences between our observations and expectations can be attributed to sampling error.

Null Hypothesis Significance Testing

Figure 2: My brief (about five minute long) introduction to the steps that make up null-hypothesis significance testing.

In null hypothesis significance testing, we aim to determine how easily our results can be explained by a “null model.” To do this, we follow four key steps:

State the null hypothesis and its alternative.
Calculate a test statistic to summarize the data.
Compare the observed test statistic to the sampling distribution of this statistic under the null model.
Interpret the results. If the test statistic falls in an extreme tail of the sampling distribution, we reject the null hypothesis; otherwise, we do not.

Steps 1-3 are relatively straightforward.

⚠️ However, Step 4 is quite strange and represents one of the more challenging concepts in statistics. Part of the difficulty lies in the fact that what we traditionally do in the field doesn’t entirely make sense ⚠️.

Statistical Hypotheses

Scientific hypotheses are exciting.

They should be well-motivated by our understanding of the biological question.
There are potentially infinite biological hypotheses, and we’ll explore many over the course of this term.
The answers to these hypotheses should be meaningful and consequential.
Strong biological hypotheses are grounded in an understanding of the process being studied.
We usually have some sense of the expected strength of the effect our biological hypothesis predicts.

We examine biological hypotheses through statistical hypotheses.

Statistical hypotheses, on the other hand, are boring.
In null hypothesis significance testing, we repeatedly examine the same two statistical hypotheses.
Statistical hypotheses don’t care about your theories or have any sense of effect size.

While we can’t perfectly align biological and statistical hypotheses, a good study makes its best effort!

Unfortunately, as scientists, we’re usually trying to evaluate support for an exciting scientific hypothesis. But in frequentist statistics, we do this in a somewhat backwards way—by testing the plausibility of a boring statistical hypothesis, known as the null hypothesis.

The Null Hypothesis

So, what is the null hypothesis? The Null Hypothesis, also called $H_0$ , skeptically argues that the data come from a boring population described by the null model.

The null model is extremely specific. It is meant to represent the process of sampling as accurately as possible, and claims that this process can explain any interesting observations in our data. For the case above, the null hypothesis for the effect of the vaccine on Covid cases is:

$H_0$ : The frequency of Covid cases does not differ between vaccinated and unvaccinated populations.

Russell Westbrook dramatically yawning, with the text 'Cool story, bro.' at the bottom.

Figure 3: The null hypothesis is unimpressed by your sampling error.

The Alternative Hypothesis

The alternative hypothesis, $H_A$ , is much vaguer than the null. While the null hypothesis claims that nothing is happening, the alternative hypothesis claims that something is happening, but it doesn’t specify exactly what. For the case above, the alternative hypothesis for the effect of the vaccine is:

$H_A$ : The frequency of Covid cases differs between vaccinated and unvaccinated populations.

One-tailed and Two-tailed Tests

Notice that the alternative hypothesis above is that the frequency of Covid cases differs between the vaccinated and unvaccinated groups. What if we only cared to know whether the vaccine provided greater immunity than no vaccine? In theory, we could look at only one side of the null distribution. However, in practice, one-tailed tests are almost always a bad idea. For example, we would definitely want to know if the vaccine somehow made us more susceptible to Covid. For that reason, we generally avoid one-tailed tests.

Rare cases when a one-tailed test is appropriate occur when both extremes of the outcome are on the same side of the null distribution. For instance, if I were studying the absolute value of something, the null hypothesis would be that it’s zero, and the alternative would be that it’s greater than zero. We’ll see that some test statistics, like the

$F$ statistic and (often) the

$\chi^2$ statistic, only have one relevant tail.

A comic shows two stick figures talking. One says, *I cant believe schools are still teaching kids about the null hypothesis.* The second figure responds, *I remember reading a big study that conclusively disproved it years ago.* In the background, a child sits at a desk, appearing to work on something.

Figure 4: From xkcd. Rollover text: Heck, my eighth grade science class managed to conclusively reject it just based on a classroom experiment. Its pretty sad to hear about million-dollar research teams who can’t even manage that.

The Test Statistic

So how do we evaluate whether our data could easily be produced by the null model? The first step is to summarize the data with a test statistic. For now, let’s summarize our results as the number of cases in the vaccine group divided by the total number of cases in the study. For the Moderna study on Covid cases, this equals $\frac{11}{11+185} = \frac{11}{196} \approx 0.0561$ .

A test statistic can be just about anything that makes sense for the data, but commonly used test statistics, such as $F$ , $\chi^2$ , $Z$ , and $t$ , are popular because their behavior is well understood.

The Sampling Distribution Under the Null Hypothesis

Because the null model is specific, we can generate the expected distribution of the test statistic by creating a sampling distribution from the null model. For now, I will provide you with sampling distributions of test statistics under the null. Later, we’ll learn more about how to generate these distributions ourselves.

We can visualize the sampling distribution under the null as a histogram, just like any other sampling distribution:

A histogram visualizes the sampling distribution of a test statistic under the null hypothesis. The x-axis represents the values of the test statistic, while the y-axis shows the relative frequency of these values. The bars of the histogram are filled with a light salmon color, and there are 100 bins. A label *Null sampling distribution* is placed near the center of the plot.

Figure 5: A sampling distribution. The code used to generate it is available here.

Next, we compare our actual test statistic to its sampling distribution under the null hypothesis. For example, figure 6 shows that the null model for the Moderna Phase II vaccine trial would almost never generate such a low incidence of Covid infection among the vaccinated compared to the unvaccinated, as observed in the Phase 3 trial.

A histogram shows the null sampling distribution for Moderna data, representing the proportion of total COVID-19 cases observed in the vaccinated sample. The x-axis is labeled *Proportion of total covid cases observed in the vaccinated sample* and ranges from 0 to 1. The y-axis is labeled *prob* for probability. Most of the distribution is centered around 0.50, with the bars filled in light salmon color. An arrow labeled *Estimate from Phase III trial* points to a value near 0.056 on the x-axis, indicating a much lower proportion of cases in the vaccinated group compared to the null distribution.

Figure 6: Data from Moderna Press Release. The code to generate this figure is available here.

P-values

The observed test statistic shown in figure 7a is centered within the null sampling distribution. If this value arose from the null model, it wouldn’t stand out. By contrast, like in the Moderna trial, the null model rarely generates a test statistic as extreme as the one shown in Figure 7b.

Two histograms are shown side by side, each representing a null sampling distribution. Both histograms display a distribution of test statistics with light orange bars filling most of the area and a few darker blue bars indicating the test statistic's extreme values. In the left plot titled *Unremarkable test statistic,* a test statistic of 0.2 is highlighted. An arrow points to the x-axis where the test statistic lies. The areas beyond this value, representing the tail probabilities, are shaded darker blue, with small annotations labeling the areas under the left and right tails. The subtitle reads, *Test Stat = 0.2; P-value = 0.055,* indicating the total area in the tails and the corresponding p-value. In the right plot titled *Surprising test statistic,* a more extreme test statistic of 0.5 is highlighted with another arrow. The regions beyond this value, which are much smaller, are also shaded in blue. Annotations show the calculated areas for these extreme values. The subtitle reads, *Test Stat = 0.5; P-value = 0.01.*

Figure 7: A sampling distribution with an unremarkable test statistic (a) versus a highly unlikely one (b).

We use the P-value to quantify how surprising it would be to observe a test statistic as extreme (or more extreme) under the null model. To calculate the P-value, we sum (or integrate) the area under the curve from our observation outward to the tails of the distribution. Since we are equally surprised by extreme values on both the lower (left) and upper (right) tails, we typically sum the extremes on both sides.

In Figure 7a, we sum the areas as or more extreme than our observation on both the lower and upper tails, yielding a P-value of 0.185 + 0.185 = 0.37. This confirms what we can already see: the test statistic is unremarkable, and 37% of samples from the null would be as or more extreme.

Similarly, in Figure 7b, summing the areas as or more extreme than the observation on both tails gives a P-value of 0.012 + 0.013 = 0.025. This confirms that the test statistic is surprising, as only 2.5% of samples from the null would be as or more extreme.

Figure 8: OPTIONAL: A brief (approx 6 minutes) lecture on p-values and drawing conclusions

Interpretting results and drawing a conclusion

A P-value can make or break a scientist’s research.
P-values are often the measuring stick by which the gauge the significance of their work.

— From the video in this fivethirtyeight.com article.

	p-value > α Fail to Reject H₀	p-value ≤ α Reject H₀
True null hypothesis (H₀ is TRUE)	Fail to reject true null hypothesis Occurs with probability 1 - α This is the correct decision	Reject true null hypothesis Occurs with probability α (Type I Error)
False null hypothesis (H₀ is FALSE)	Fail to reject false null hypothesis Occurs with probability β (Type II Error)	Reject false null hypothesis Occurs with probability 1 - β (akaPower) This is the correct decision

So what do scientists do with a p-value? Why can it “make or break” your research?

By convention, if our p-value is smaller than $\alpha$ (a number which is traditionally set, somewhat arbitrarily, to 0.05), we reject the null hypothesis and say the result is statistically significant. What does this mean? It means that if we set $\alpha$ at the customary cut-off value of 0.05, we will get a false positive—that is, we will reject a true null hypothesis—in 5% of studies where the null is actually true.

By convention, if our p-value is greater than $\alpha$ , we say we fail to reject the null hypothesis. However, failing to reject the null does not mean the null is true. In fact, there are many times when we will fail to reject a false null hypothesis. The probability of these false negatives, denoted as $\beta$ , is not a value we choose; rather, it depends on both the size of our sample and the magnitude of the true effect (i.e., the difference between the null hypothesis and the true population parameter). Power, the probability of rejecting a false null hypothesis, equals $1-\beta$ . Often, in planning an experiment, we select a sample size large enough to ensure we have sufficient power to reject the null for an effect size that we’re interested in.

These customary rituals are taken quite seriously by some scientists. For certain audiences, the difference between a p-value of 0.051 and 0.049 is the difference between a non-significant and significant result—and potentially between publication or rejection. I, and many others (e.g., this article by Amrhein, Greenland, and McShane (2019)), think this is a bad custom, and not all scientists adhere to it. Nonetheless, this is the world you will navigate, so you should be aware of these customs.

The comic titled *p-value Interpretation* humorously depicts various p-value ranges and their subjective interpretations, showing how researchers often adjust their conclusions based on how close the values are to common significance thresholds. For p-values of 0.001, 0.01, 0.02, 0.03, 0.04, and 0.049, the interpretation is *highly significant* or *significant,* but when the p-value is exactly 0.050, there's a sense of uncertainty with a need to *redo calculations.* At 0.051 and 0.06, the result is on the *edge of significance.* When p-values range from 0.07 to 0.099, they are labeled as *highly suggestive* or significant at a more lenient level like p<0.10, while p-values greater than or equal to 0.1 are dismissed with an emphasis on subgroup analysis.

Figure 9: xkcd poking fun at null hypothesis significance testing. Rollover text: If all else fails use significance at the $lpha$ = 0.5 level and hope no one notices.

The Effect of Sample Size

When the null hypothesis is true, p-values will be uniformly distributed, and the false positive rate will be $\alpha$ regardless of sample size.

When the null hypothesis is false, we will observe more smaller p-values as the sample size increases, and the true positive rate will increase with the sample size.

This figure contains two panels, labeled *a* and *b.* **Panel *a*** displays the cumulative distribution of p-values across different sample sizes for two scenarios: when the null hypothesis is true and when the null hypothesis is false. Each curve is color-coded by sample size, with smaller p-values more common when the null hypothesis is false. A vertical line is drawn at p = 0.05. **Panel *b*** shows how the probability of rejecting the null hypothesis (y-axis) changes with increasing sample size (x-axis on a log2 scale). When the null hypothesis is true, the rejection probability stays at the alpha level of 0.05, while when the null hypothesis is false, the rejection probability increases as the sample size grows, illustrating increased statistical power.

Figure 10: a) When the null hypothesis is true, p-values are uniformly distributed regardless of sample size. b) When the null hypothesis is false, we observe more low p-values as sample size increases. Therefore, the probability of rejecting the null remains constant at alpha when the null is true, but increases as sample size grows when the null is false. In both a and b, the null hypothesis assumes a value of 0, while the true parameter is 0.3 standard deviations away from zero in the panel on the right. Code is available here

Problems with P-values and Their Interpretation

Why the Interpretation of P-values is Hard

A meme from The Simpsons showing Principal Skinner in two panels. In the first panel, labeled *P-VALUES,* Skinner thinks, *Maybe I don't make sense.* In the second panel, Skinner reassures himself, *No, it's all the scientists who don't understand me who are wrong.* This meme humorously represents the confusion and debate around p-value interpretation in statistics.

Figure 11: Why are p-values so confusing?

Students struggle to understand the meaning of a p-value, and I think it’s because:

None of the customs mentioned above make sense,
P-values don’t measure the thing we care about.
- They have nothing to do with the alternative hypothesis.
- They have even less to do with the motivating biological hypothesis.
Null hypothesis significance testing is not logically sound. We want to know something about what isn’t the null, but we never actually consider the alternative model. We simply assume that the alternative hypothesis is likely correct when the observations are unusual for the null model.

The common misinterpretations below, I believe, reflect people wishing that a p-value reported something useful or logical.

A P-VALUE IS NOT “the probability that the null hypothesis is true” NOR IS IT “the probability that what we observed is due to chance.” These are both incorrect because the p-value is simply the probability that we would observe our data, or something more extreme, assuming the null hypothesis is true. But I sympathize with these misinterpretations, because it would be great if that was what a p-value told us.

A P-VALUE DOES NOT say anything about the alternative hypothesis. A p-value simply describes how unusual it would be for the null model to generate such an extreme result. Again, I understand the desire to have the p-value tell us about the alternative hypothesis, because this is usually more interesting. Sadly, p-values can’t do that.

A P-VALUE DOES NOT measure the importance of a result. Again, such a measure would be great to have, but we don’t have that. The importance of a result depends on its effect size and its role in the biological problem we’re investigating.

What does this mean for us as scientists? It means that we have two challenging responsibilities:

We need to understand the process of null hypothesis testing and be able to participate in the associated customs and rituals.
At the same time, we need to responsibly interpret our statistics. Some important things to remember are:
- Rejecting $H_0$ does not mean $H_0$ is false.
- Failing to reject $H_0$ does not mean $H_0$ is true, nor does it mean there is no effect.
- We can reject or fail to reject $H_0$ for many reasons unrelated to our biological hypothesis. A good practice is to think about a plausible effect size of our biological hypothesis and see if the size of the reported effect is consistent with our biological model.

The Prosecutor’s Fallacy and Bayesian Approaches

Figure 12: The prosecutor’s fallacy as explained by Calling Bullshit. 11 min and 58 seconds. REQUIRED.

The video above (Fig 12) makes a clear point:

We calculate $P = P(\text{Data or more extreme}|\text{H_0})$ , but we really want to know $P = P(\text{H_0}|\text{Data or more extreme})$ .

We must always remind ourselves that with a p-value, we have $P = P(\text{Data or more extreme}|\text{H_0})$ .

Later in the term, we’ll see that Bayes’ theorem sets up a different way to do stats, one that answers questions like “what’s the probability of the null hypothesis given my data?” by flipping these conditional probabilities. However, for most of this class, we cover classic frequentist statistics, so we have to remember that we are not answering that question.

This is a good time to remember the other assignment and read the Statement on P-values from the American Statistical Association (Wasserstein and Lazar 2016), and the article The p-value statement, five years on(Matthews 2021).

Why Do We Still Use Null Hypothesis Significance Testing?

Q: Why do so many colleges and grad schools teach p = 0.05?
A: Because that’s still what the scientific community and journal editors use.
Q: Why do so many people still use p = 0.05?
A: Because that’s what they were taught in college or grad school.

— George Cobb, on the American Statistics Association forum

So, with all the issues with p-values and null hypothesis testing, why am I teaching it, and why do we still use it as a field?

The first answer is well summarized by George Cobb above. I teach this because this is how science is often done, and you should understand the culture of science and its rituals. When you read studies, you will see p-values, and when you write results, people will expect p-values. At the same time, you should recognize that this isn’t the only way to do statistics. For example, Bayesian stats is quite mainstream. We will return to Bayesian stats at the end of the term.

The second answer is that null hypothesis significance testing works well. Scientists have used this approach for decades and have made continual progress. So, although the theoretical underpinnings of null hypothesis significance testing are shaky, it’s practically quite useful. Unlike George Cobb, I believe we keep using p-values and p = 0.05 because it seems to work well enough. That said, I believe that the nuanced understanding I’ve tried to equip you with in this chapter helps us make even better use of p-values.

Hypothesis testing quiz

Figure 13: The accompanying hypothesis testing quiz link.

Hypothesis testing: Definitions

The Null Hypothesis: A skeptical explanation, made for the sake of argument, which suggests that data come from an uninteresting or “boring” population.

The Alternative Hypothesis: A claim that the data do not come from the “boring” population.

The Test Statistic: A single number that summarizes the data. We compare the observed test statistic to its sampling distribution under the null model.

P-value: The probability that a random sample drawn from the null model would be as extreme or more extreme than what is observed.

$\alpha$ : The probability that we reject a true null hypothesis. We can decide what this is, but by convention, $\alpha$ is usually set at 0.05.

False Positive: Rejecting the null when the null is true.

False Negative: Failing to reject the null when it is false.

Power: The probability of rejecting a false null hypothesis. We cannot directly set this — it depends on sample size and the size of the effect, but we can design experiments aiming for a certain level of power.

OPTIONAL: Alternatives to Null Hypothesis Significance Testing

Due to the issues surrounding p-values, such as the arbitrary distinction between “significant” and “non-significant” results, some have proposed alternative approaches to statistics.

These alternatives include banning p-values, replacing them with confidence intervals, and conducting Bayesian analyses, among others. I highly recommend the paper, Some Natural Solutions to the p-Value Communication Problem—and Why They Won’t Work (Gelman and Carlin 2017), for a fun take on these proposals.

Bayesian Statistics as an Alternative Approach

Here, I briefly introduce Bayesian statistics. Bayesian statistics aims to find the probability of a model given the data, using Bayes’ theorem. This is often the type of question we want to answer. However, a word of caution—frequentists believe there is no probability associated with the true parameter, as populations have fixed parameters. In contrast, Bayesians believe that parameters have probability distributions that reflect uncertainty or prior knowledge about them. This represents a fundamentally different way of thinking about the world.

$P(\text{Model}|\text{Data}) = \frac{P(\text{Data}|\text{Model}) \times P(\text{Model})}{P(\text{Data})}$

We can break this down with new terminology:

$P(\text{Model}|\text{Data})$ : the “posterior probability” — the probability of the model given the observed data.
$P(\text{Data}|\text{Model})$ : the “likelihood” — the probability of observing the data under the given model.
$P(\text{Model})$ : the “prior” — our prior belief about the probability of the model or parameter values before seeing the data. This can come from previous studies, expert knowledge, or assumptions.
$P(\text{Data})$ : the “evidence” — the overall probability of the data, which serves to normalize the result. It can be computationally intensive to calculate, but there are methods like Markov Chain Monte Carlo (MCMC) to approximate this value.

$\text{Posterior Probability} = \frac{\text{Likelihood}(\text{Model}|\text{Data}) \times \text{Prior}}{\text{Evidence}}$

Notably, Bayesian methods allow us to study “credible intervals” — regions with a 95% probability of containing the true population parameter, as opposed to “confidence intervals,” which in frequentist statistics only describe the frequency with which the interval will contain the true parameter in repeated samples.

References

Amrhein, V, S Greenland, and B McShane. 2019. “Scientists Rise up Against Statistical Significance.” Nature 567 (7748): 305.

Gelman, Andrew, and John Carlin. 2017. “Some Natural Solutions to the p-Value Communication Problem—and Why They Won’t Work.” Journal of the American Statistical Association 112 (519): 899–901. https://doi.org/10.1080/01621459.2017.1311263.

Matthews, Robert. 2021. “The p-Value Statement, Five Years On.” Significance 18 (2): 16–19. https://doi.org/https://doi.org/10.1111/1740-9713.01505.

Wasserstein, Ronald L., and Nicole A. Lazar. 2016. “The ASA Statement on p-Values: Context, Process, and Purpose.” The American Statistician 70 (2): 129–33. https://doi.org/10.1080/00031305.2016.1154108.