26.6 Natural Experiments
A natural experiment is an observational study in which an exogenous event, policy change, or external factor creates as-if random variation in treatment assignment across units. Unlike randomized controlled trials (RCTs)—where researchers actively manipulate treatment assignment—natural experiments leverage naturally occurring circumstances that approximate randomization.
In many fields, including economics, marketing, political science, and epidemiology, natural experiments provide an indispensable tool for causal inference, particularly when RCTs are impractical, unethical, or prohibitively expensive.
Key Characteristics of Natural Experiments
- Exogenous Shock: Treatment assignment is determined by an external event, policy, or regulation rather than by researchers.
- As-If Randomization: The event must create variation that is plausibly unrelated to unobserved confounders, mimicking an RCT.
- Comparability of Treatment and Control Groups: The study design should ensure that treated and untreated units are comparable except for their exposure to the intervention.
Examples of Natural Experiments in Economics and Marketing
- Minimum Wage Policy and Employment
A classic example comes from Card and Krueger (1993) study on the minimum wage. When New Jersey increased its minimum wage while neighboring Pennsylvania did not, this created a natural experiment. By comparing fast-food employment trends in both states, the study estimated the causal effect of the minimum wage increase on employment.
- Advertising Bans and Consumer Behavior
Suppose a country bans advertising for a particular product, such as tobacco or alcohol, while a similar neighboring country does not. This policy creates a natural experiment: researchers can compare sales trends before and after the ban in both countries to estimate the causal impact of advertising restrictions on consumer demand.
- The Facebook Outage as a Natural Experiment
In October 2021, Facebook experienced a global outage, making its advertising platform temporarily unavailable. For businesses that relied on Facebook ads, this outage created an exogenous shock in digital marketing strategies. Researchers could compare advertisers’ sales and website traffic before, during, and after the outage to assess the impact of social media advertising.
- Lottery-Based Admission to Schools
In many cities, students apply to competitive schools via lottery-based admissions. Since admission is randomly assigned, this creates a natural experiment for studying the causal effect of elite schooling on future earnings, college attendance, or academic performance.
These examples illustrate how natural experiments are leveraged to estimate causal effects when randomization is infeasible.
Why Are Natural Experiments Important?
Natural experiments are powerful tools for identifying causal relationships because they often eliminate selection bias—a major issue in observational studies. However, they also present challenges:
- Treatment Assignment is Not Always Perfectly Random: Unlike RCTs, natural experiments rely on assumptions about the as-if randomness of treatment.
- Potential for Confounding: Even if treatment appears random, hidden factors might still bias results.
- Repeated Use of the Same Natural Experiment: When researchers analyze the same natural experiment multiple times, it increases the risk of false discoveries due to multiple hypothesis testing.
These statistical challenges, especially the risks of false positives, are crucial to understand and address.
26.6.1 The Problem of Reusing Natural Experiments
Recent simulations demonstrate that when the number of estimated outcomes far exceeds the number of true effects (NOutcome≫NTrue effect), the proportion of false positive findings can exceed 50% (Heath et al. 2023, 2331). This problem arises due to:
- Data Snooping: If multiple hypotheses are tested on the same dataset, the probability of finding at least one statistically significant result purely by chance increases.
- Researcher Degrees of Freedom: The flexibility in defining outcomes, selecting models, and specifying robustness checks can lead to p-hacking and publication bias.
- Dependence Across Tests: Many estimated outcomes are correlated, meaning traditional multiple testing corrections may not adequately control for Type I errors.
This problem is exacerbated when:
Studies use the same policy change, regulatory event, or shock across different settings.
Multiple subgroups and model specifications are tested without proper corrections.
P-values are interpreted without adjusting for multiple testing bias.
26.6.2 Statistical Challenges in Reusing Natural Experiments
When the same natural experiment is analyzed in multiple studies, or even within a single study across many different outcomes, the probability of obtaining spurious significant results increases. Key statistical challenges include:
- Family-Wise Error Rate (FWER) Inflation
Each additional hypothesis tested increases the probability of at least one false rejection of the null hypothesis (Type I error). If we test m independent hypotheses at the nominal significance level α, the probability of making at least one Type I error is:
P(At least one false positive)=1−(1−α)m.
For example, with α=0.05 and m=20 tests:
P(At least one false positive)=1−(0.95)20≈0.64.
This means that even if all null hypotheses are true, we expect a 64% chance of falsely rejecting at least one.
- False Discovery Rate (FDR) and Dependent Tests
FWER corrections such as Bonferroni are conservative and may be too stringent when outcomes are correlated. In cases where researchers test multiple related hypotheses, False Discovery Rate (FDR) control provides an alternative by limiting the expected proportion of false discoveries among rejected hypotheses.
- Multiple Testing in Sequential Experiments
In many longitudinal or rolling studies, results are reported over time as more data becomes available. Chronological testing introduces additional biases:
- Repeated interim analyses increase the probability of stopping early on false positives.
- Outcomes tested at different times require corrections that adjust for sequential dependence.
To address these issues, researchers must apply multiple testing corrections.
26.6.3 Solutions: Multiple Testing Corrections
To mitigate the risks of false positives in natural experiment research, various statistical corrections can be applied.
- Family-Wise Error Rate (FWER) Control
The most conservative approach controls the probability of at least one false positive:
Bonferroni Correction:
p∗i=m⋅pi,
where p∗i is the adjusted p-value and m is the total number of hypotheses tested.
Holm-Bonferroni Method (Holm 1979): Less conservative than Bonferroni, adjusting significance thresholds in a stepwise fashion.
Sidak Correction (Šidák 1967): Accounts for multiple comparisons assuming independent tests.
Romano-Wolf Stepwise Correction (Romano and Wolf 2005, 2016): Recommended for natural experiments as it controls for dependence across tests.
Hochberg’s Sharper FWER Control (Hochberg 1988): A step-up procedure, more powerful than Holm’s method.
- False Discovery Rate (FDR) Control
FDR-controlling methods are less conservative than FWER-based approaches, allowing for some false positives while controlling their expected proportion.
- Benjamini-Hochberg (BH) Procedure (Benjamini and Hochberg 1995): Limits the expected proportion of false discoveries among rejected hypotheses.
- Adaptive Benjamini-Hochberg (Benjamini and Hochberg 2000): Adjusts for situations where the true proportion of null hypotheses is unknown.
- Benjamini-Yekutieli (BY) Correction (Benjamini and Yekutieli 2001): Accounts for arbitrary dependence between tests.
- Two-Stage Benjamini-Hochberg (Benjamini, Krieger, and Yekutieli 2006): A more powerful version of BH, particularly useful in large-scale studies.
- Sequential Approaches to Multiple Testing
Two major frameworks exist for applying multiple testing corrections over time:
- Chronological Sequencing
Outcomes are ordered by the date they were first reported.
Multiple testing corrections are applied sequentially, progressively increasing the statistical significance threshold over time.
- Best Foot Forward Policy
Outcomes are ranked from most to least likely to be rejected based on experimental data.
Frequently used in clinical trials where primary outcomes are given priority.
New outcomes are added only if linked to primary treatment effects.
- Alternatively, refer to the rules of thumb from Table AI (Heath et al. 2023, 2356).
These approaches ensure that the p-value correction is consistent with the temporal structure of natural experiments.
- Romano-Wolf Correction
The Romano-Wolf correction is highly recommended for handling multiple testing in natural experiments:
# Install required packages
# install.packages("fixest")
# install.packages("wildrwolf")
library(fixest)
library(wildrwolf)
# Load example data
data(iris)
# Fit multiple regression models
fit1 <- feols(Sepal.Width ~ Sepal.Length, data = iris)
fit2 <- feols(Petal.Length ~ Sepal.Length, data = iris)
fit3 <- feols(Petal.Width ~ Sepal.Length, data = iris)
# Apply Romano-Wolf stepwise correction
res <- rwolf(
models = list(fit1, fit2, fit3),
param = "Sepal.Length",
B = 500
)
#>
|
| | 0%
#>
|
|======================= | 33%
|
|=============================================== | 67%
|
|======================================================================| 100%
res
#> model Estimate Std. Error t value Pr(>|t|) RW Pr(>|t|)
#> 1 1 -0.0618848 0.04296699 -1.440287 0.1518983 0.141716567
#> 2 2 1.858433 0.08585565 21.64602 1.038667e-47 0.001996008
#> 3 3 0.7529176 0.04353017 17.29645 2.325498e-37 0.001996008
- General Multiple Testing Adjustments
For other multiple testing adjustments, use the multtest
package:
# Install package if necessary
# BiocManager::install("multtest")
library(multtest)
# Define multiple correction procedures
procs <-
c("Bonferroni",
"Holm",
"Hochberg",
"SidakSS",
"SidakSD",
"BH",
"BY",
"ABH",
"TSBH")
# Generate random p-values for demonstration
p_values <- runif(10)
# Apply multiple testing corrections
adj_pvals <- mt.rawp2adjp(p_values, procs)
# Print results in a readable format
adj_pvals |> causalverse::nice_tab()
#> adjp.rawp adjp.Bonferroni adjp.Holm adjp.Hochberg adjp.SidakSS adjp.SidakSD
#> 1 0.12 1 1 0.75 0.72 0.72
#> 2 0.22 1 1 0.75 0.92 0.89
#> 3 0.24 1 1 0.75 0.94 0.89
#> 4 0.29 1 1 0.75 0.97 0.91
#> 5 0.36 1 1 0.75 0.99 0.93
#> 6 0.38 1 1 0.75 0.99 0.93
#> 7 0.44 1 1 0.75 1.00 0.93
#> 8 0.59 1 1 0.75 1.00 0.93
#> 9 0.65 1 1 0.75 1.00 0.93
#> 10 0.75 1 1 0.75 1.00 0.93
#> adjp.BH adjp.BY adjp.ABH adjp.TSBH_0.05 index h0.ABH h0.TSBH
#> 1 0.63 1 0.63 0.63 2 10 10
#> 2 0.63 1 0.63 0.63 6 10 10
#> 3 0.63 1 0.63 0.63 8 10 10
#> 4 0.63 1 0.63 0.63 3 10 10
#> 5 0.63 1 0.63 0.63 10 10 10
#> 6 0.63 1 0.63 0.63 1 10 10
#> 7 0.63 1 0.63 0.63 7 10 10
#> 8 0.72 1 0.72 0.72 9 10 10
#> 9 0.72 1 0.72 0.72 5 10 10
#> 10 0.75 1 0.75 0.75 4 10 10
# adj_pvals$adjp