Are your observations unexpected?
Motivating Scenarios: .
We want to understand the standard by which scientists take results seriously versus chalking them up to sampling error.
We are visiting a doctor, reading a newspaper, or a scientific paper, and are told results are “statistically significant” or have a p-value of XXX, and we want to know what that means.
We have done a statistical analysis and want to guard against the possibility that our seemingly exceptional results are merely attributable to sampling error.
There is one MAJOR GOAL here: You should understand a p-value and its limitations. But to break this down:
Learning goals: By the end of this chapter, you should be able to:
In our introduction to statistics and our section on summarizing data, we introduced estimation as a central goal of statistics—summarizing what we observe to estimate population parameters from a sample.
We also discussed, in our introduction, sampling, and uncertainty sections, that all estimates are subject to sampling error. The goal of null hypothesis significance testing (NHST) is to determine whether our observations can be reasonably explained by sampling error. Let’s work through an example:
Imagine we conducted an experiment where 15,000 people received the Moderna Covid vaccine, and another 15,000 received a placebo. This experimental design is intended to:
Here are the estimates from the data:
So, did the vaccine work? There are certainly fewer Covid cases in the vaccine group.
However, these are estimates, not parameters. We didn’t study entire populations but rather drew samples from them. As a result, our estimates reflect both sampling variability and potential true differences between populations. Before rolling out a vaccination campaign, we need to determine whether these results can be explained by something other than a real effect.
What leads samples to deviate from a population? In principle, in addition to a real effect, sampling bias, non-independent sampling, and sampling error could all lead to a deviation between estimates and true population parameters. Our goal in null hypotheses significance testing is to see if results are easily explained by sampling error. That is, NHST helps us assess whether the differences between our observations and expectations can be attributed to sampling error.
In null hypothesis significance testing, we aim to determine how easily our results can be explained by a “null model.” To do this, we follow four key steps:
Steps 1-3 are relatively straightforward.
⚠️ However, Step 4 is quite strange and represents one of the more challenging concepts in statistics. Part of the difficulty lies in the fact that what we traditionally do in the field doesn’t entirely make sense ⚠️.
Scientific hypotheses are exciting.
We examine biological hypotheses through statistical hypotheses.
While we can’t perfectly align biological and statistical hypotheses, a good study makes its best effort!
Unfortunately, as scientists, we’re usually trying to evaluate support for an exciting scientific hypothesis. But in frequentist statistics, we do this in a somewhat backwards way—by testing the plausibility of a boring statistical hypothesis, known as the null hypothesis.
@canoodleson Hypotheses and hypothesis testing in stats
♬ original sound - Christina
So, what is the null hypothesis? The Null Hypothesis, also called \(H_0\), skeptically argues that the data come from a boring population described by the null model.
The null model is extremely specific. It is meant to represent the process of sampling as accurately as possible, and claims that this process can explain any interesting observations in our data. For the case above, the null hypothesis for the effect of the vaccine on Covid cases is:
\(H_0\): The frequency of Covid cases does not differ between vaccinated and unvaccinated populations.
The alternative hypothesis, \(H_A\), is much vaguer than the null. While the null hypothesis claims that nothing is happening, the alternative hypothesis claims that something is happening, but it doesn’t specify exactly what. For the case above, the alternative hypothesis for the effect of the vaccine is:
\(H_A\): The frequency of Covid cases differs between vaccinated and unvaccinated populations.
One-tailed and Two-tailed Tests
Notice that the alternative hypothesis above is that the frequency of Covid cases differs between the vaccinated and unvaccinated groups. What if we only cared to know whether the vaccine provided greater immunity than no vaccine? In theory, we could look at only one side of the null distribution. However, in practice, one-tailed tests are almost always a bad idea. For example, we would definitely want to know if the vaccine somehow made us more susceptible to Covid. For that reason, we generally avoid one-tailed tests.
Rare cases when a one-tailed test is appropriate occur when both extremes of the outcome are on the same side of the null distribution. For instance, if I were studying the absolute value of something, the null hypothesis would be that it’s zero, and the alternative would be that it’s greater than zero. We’ll see that some test statistics, like the \(F\) statistic and (often) the \(\chi^2\) statistic, only have one relevant tail.So how do we evaluate whether our data could easily be produced by the null model? The first step is to summarize the data with a test statistic. For now, let’s summarize our results as the number of cases in the vaccine group divided by the total number of cases in the study. For the Moderna study on Covid cases, this equals \(\frac{11}{11+185} = \frac{11}{196} \approx 0.0561\).
A test statistic can be just about anything that makes sense for the data, but commonly used test statistics, such as \(F\), \(\chi^2\), \(Z\), and \(t\), are popular because their behavior is well understood.
Because the null model is specific, we can generate the expected distribution of the test statistic by creating a sampling distribution from the null model. For now, I will provide you with sampling distributions of test statistics under the null. Later, we’ll learn more about how to generate these distributions ourselves.
We can visualize the sampling distribution under the null as a histogram, just like any other sampling distribution:
Next, we compare our actual test statistic to its sampling distribution under the null hypothesis. For example, figure 6 shows that the null model for the Moderna Phase II vaccine trial would almost never generate such a low incidence of Covid infection among the vaccinated compared to the unvaccinated, as observed in the Phase 3 trial.
The observed test statistic shown in figure 7a is centered within the null sampling distribution. If this value arose from the null model, it wouldn’t stand out. By contrast, like in the Moderna trial, the null model rarely generates a test statistic as extreme as the one shown in Figure 7b.
We use the P-value to quantify how surprising it would be to observe a test statistic as extreme (or more extreme) under the null model. To calculate the P-value, we sum (or integrate) the area under the curve from our observation outward to the tails of the distribution. Since we are equally surprised by extreme values on both the lower (left) and upper (right) tails, we typically sum the extremes on both sides.
In Figure 7a, we sum the areas as or more extreme than our observation on both the lower and upper tails, yielding a P-value of 0.185 + 0.185 = 0.37. This confirms what we can already see: the test statistic is unremarkable, and 37% of samples from the null would be as or more extreme.
Similarly, in Figure 7b, summing the areas as or more extreme than the observation on both tails gives a P-value of 0.012 + 0.013 = 0.025. This confirms that the test statistic is surprising, as only 2.5% of samples from the null would be as or more extreme.
A P-value can make or break a scientist’s research.
P-values are often the measuring stick by which the gauge the significance of their work.
p-value > α Fail to Reject H₀ |
p-value ≤ α Reject H₀ |
|
---|---|---|
True null hypothesis (H₀ is TRUE) |
Fail to reject true null hypothesis Occurs with probability 1 - α This is the correct decision |
Reject true null hypothesis Occurs with probability α (Type I Error) |
False null hypothesis (H₀ is FALSE) |
Fail to reject false null hypothesis Occurs with probability β (Type II Error) |
Reject false null hypothesis Occurs with probability 1 - β (akaPower) This is the correct decision |
So what do scientists do with a p-value? Why can it “make or break” your research?
By convention, if our p-value is smaller than \(\alpha\) (a number which is traditionally set, somewhat arbitrarily, to 0.05), we reject the null hypothesis and say the result is statistically significant. What does this mean? It means that if we set \(\alpha\) at the customary cut-off value of 0.05, we will get a false positive—that is, we will reject a true null hypothesis—in 5% of studies where the null is actually true.
By convention, if our p-value is greater than \(\alpha\), we say we fail to reject the null hypothesis. However, failing to reject the null does not mean the null is true. In fact, there are many times when we will fail to reject a false null hypothesis. The probability of these false negatives, denoted as \(\beta\), is not a value we choose; rather, it depends on both the size of our sample and the magnitude of the true effect (i.e., the difference between the null hypothesis and the true population parameter). Power, the probability of rejecting a false null hypothesis, equals \(1-\beta\). Often, in planning an experiment, we select a sample size large enough to ensure we have sufficient power to reject the null for an effect size that we’re interested in.
These customary rituals are taken quite seriously by some scientists. For certain audiences, the difference between a p-value of 0.051 and 0.049 is the difference between a non-significant and significant result—and potentially between publication or rejection. I, and many others (e.g., this article by Amrhein, Greenland, and McShane (2019)), think this is a bad custom, and not all scientists adhere to it. Nonetheless, this is the world you will navigate, so you should be aware of these customs.
When the null hypothesis is true, p-values will be uniformly distributed, and the false positive rate will be \(\alpha\) regardless of sample size.
When the null hypothesis is false, we will observe more smaller p-values as the sample size increases, and the true positive rate will increase with the sample size.
Students struggle to understand the meaning of a p-value, and I think it’s because:
The common misinterpretations below, I believe, reflect people wishing that a p-value reported something useful or logical.
A P-VALUE IS NOT “the probability that the null hypothesis is true” NOR IS IT “the probability that what we observed is due to chance.” These are both incorrect because the p-value is simply the probability that we would observe our data, or something more extreme, assuming the null hypothesis is true. But I sympathize with these misinterpretations, because it would be great if that was what a p-value told us.
A P-VALUE DOES NOT say anything about the alternative hypothesis. A p-value simply describes how unusual it would be for the null model to generate such an extreme result. Again, I understand the desire to have the p-value tell us about the alternative hypothesis, because this is usually more interesting. Sadly, p-values can’t do that.
A P-VALUE DOES NOT measure the importance of a result. Again, such a measure would be great to have, but we don’t have that. The importance of a result depends on its effect size and its role in the biological problem we’re investigating.
What does this mean for us as scientists? It means that we have two challenging responsibilities:
The video above (Fig 12) makes a clear point:
We calculate \(P = P(\text{Data or more extreme}|\text{H_0})\), but we really want to know \(P = P(\text{H_0}|\text{Data or more extreme})\).
We must always remind ourselves that with a p-value, we have \(P = P(\text{Data or more extreme}|\text{H_0})\).
Later in the term, we’ll see that Bayes’ theorem sets up a different way to do stats, one that answers questions like “what’s the probability of the null hypothesis given my data?” by flipping these conditional probabilities. However, for most of this class, we cover classic frequentist statistics, so we have to remember that we are not answering that question.
Q: Why do so many colleges and grad schools teach p = 0.05?
A: Because that’s still what the scientific community and journal editors use.
Q: Why do so many people still use p = 0.05?
A: Because that’s what they were taught in college or grad school.
So, with all the issues with p-values and null hypothesis testing, why am I teaching it, and why do we still use it as a field?
The first answer is well summarized by George Cobb above. I teach this because this is how science is often done, and you should understand the culture of science and its rituals. When you read studies, you will see p-values, and when you write results, people will expect p-values. At the same time, you should recognize that this isn’t the only way to do statistics. For example, Bayesian stats is quite mainstream. We will return to Bayesian stats at the end of the term.
The second answer is that null hypothesis significance testing works well. Scientists have used this approach for decades and have made continual progress. So, although the theoretical underpinnings of null hypothesis significance testing are shaky, it’s practically quite useful. Unlike George Cobb, I believe we keep using p-values and p = 0.05 because it seems to work well enough. That said, I believe that the nuanced understanding I’ve tried to equip you with in this chapter helps us make even better use of p-values.
The Null Hypothesis: A skeptical explanation, made for the sake of argument, which suggests that data come from an uninteresting or “boring” population.
The Alternative Hypothesis: A claim that the data do not come from the “boring” population.
The Test Statistic: A single number that summarizes the data. We compare the observed test statistic to its sampling distribution under the null model.
P-value: The probability that a random sample drawn from the null model would be as extreme or more extreme than what is observed.
\(\alpha\): The probability that we reject a true null hypothesis. We can decide what this is, but by convention, \(\alpha\) is usually set at 0.05.
False Positive: Rejecting the null when the null is true.
False Negative: Failing to reject the null when it is false.
Power: The probability of rejecting a false null hypothesis. We cannot directly set this — it depends on sample size and the size of the effect, but we can design experiments aiming for a certain level of power.Due to the issues surrounding p-values, such as the arbitrary distinction between “significant” and “non-significant” results, some have proposed alternative approaches to statistics.
These alternatives include banning p-values, replacing them with confidence intervals, and conducting Bayesian analyses, among others. I highly recommend the paper, Some Natural Solutions to the p-Value Communication Problem—and Why They Won’t Work (Gelman and Carlin 2017), for a fun take on these proposals.
Here, I briefly introduce Bayesian statistics. Bayesian statistics aims to find the probability of a model given the data, using Bayes’ theorem. This is often the type of question we want to answer. However, a word of caution—frequentists believe there is no probability associated with the true parameter, as populations have fixed parameters. In contrast, Bayesians believe that parameters have probability distributions that reflect uncertainty or prior knowledge about them. This represents a fundamentally different way of thinking about the world.
\[P(\text{Model}|\text{Data}) = \frac{P(\text{Data}|\text{Model}) \times P(\text{Model})}{P(\text{Data})}\]
We can break this down with new terminology:
\[\text{Posterior Probability} = \frac{\text{Likelihood}(\text{Model}|\text{Data}) \times \text{Prior}}{\text{Evidence}}\]
Notably, Bayesian methods allow us to study “credible intervals” — regions with a 95% probability of containing the true population parameter, as opposed to “confidence intervals,” which in frequentist statistics only describe the frequency with which the interval will contain the true parameter in repeated samples.