26.6 Natural Experiments

A natural experiment is an observational study in which an exogenous event, policy change, or external factor creates as-if random variation in treatment assignment across units. Unlike randomized controlled trials (RCTs)—where researchers actively manipulate treatment assignment—natural experiments leverage naturally occurring circumstances that approximate randomization.

In many fields, including economics, marketing, political science, and epidemiology, natural experiments provide an indispensable tool for causal inference, particularly when RCTs are impractical, unethical, or prohibitively expensive.

Key Characteristics of Natural Experiments

Exogenous Shock: Treatment assignment is determined by an external event, policy, or regulation rather than by researchers.
As-If Randomization: The event must create variation that is plausibly unrelated to unobserved confounders, mimicking an RCT.
Comparability of Treatment and Control Groups: The study design should ensure that treated and untreated units are comparable except for their exposure to the intervention.

Examples of Natural Experiments in Economics and Marketing

Minimum Wage Policy and Employment

A classic example comes from Card and Krueger (1993) study on the minimum wage. When New Jersey increased its minimum wage while neighboring Pennsylvania did not, this created a natural experiment. By comparing fast-food employment trends in both states, the study estimated the causal effect of the minimum wage increase on employment.

Advertising Bans and Consumer Behavior

Suppose a country bans advertising for a particular product, such as tobacco or alcohol, while a similar neighboring country does not. This policy creates a natural experiment: researchers can compare sales trends before and after the ban in both countries to estimate the causal impact of advertising restrictions on consumer demand.

The Facebook Outage as a Natural Experiment

In October 2021, Facebook experienced a global outage, making its advertising platform temporarily unavailable. For businesses that relied on Facebook ads, this outage created an exogenous shock in digital marketing strategies. Researchers could compare advertisers’ sales and website traffic before, during, and after the outage to assess the impact of social media advertising.

Lottery-Based Admission to Schools

In many cities, students apply to competitive schools via lottery-based admissions. Since admission is randomly assigned, this creates a natural experiment for studying the causal effect of elite schooling on future earnings, college attendance, or academic performance.

These examples illustrate how natural experiments are leveraged to estimate causal effects when randomization is infeasible.

Why Are Natural Experiments Important?

Natural experiments are powerful tools for identifying causal relationships because they often eliminate selection bias—a major issue in observational studies. However, they also present challenges:

Treatment Assignment is Not Always Perfectly Random: Unlike RCTs, natural experiments rely on assumptions about the as-if randomness of treatment.
Potential for Confounding: Even if treatment appears random, hidden factors might still bias results.
Repeated Use of the Same Natural Experiment: When researchers analyze the same natural experiment multiple times, it increases the risk of false discoveries due to multiple hypothesis testing.

These statistical challenges, especially the risks of false positives, are crucial to understand and address.

26.6.1 The Problem of Reusing Natural Experiments

Recent simulations demonstrate that when the number of estimated outcomes far exceeds the number of true effects ( $N_{\text{Outcome}} \gg N_{\text{True effect}}$ ), the proportion of false positive findings can exceed 50% (Heath et al. 2023, 2331). This problem arises due to:

Data Snooping: If multiple hypotheses are tested on the same dataset, the probability of finding at least one statistically significant result purely by chance increases.
Researcher Degrees of Freedom: The flexibility in defining outcomes, selecting models, and specifying robustness checks can lead to p-hacking and publication bias.
Dependence Across Tests: Many estimated outcomes are correlated, meaning traditional multiple testing corrections may not adequately control for Type I errors.

This problem is exacerbated when:

Studies use the same policy change, regulatory event, or shock across different settings.
Multiple subgroups and model specifications are tested without proper corrections.
P-values are interpreted without adjusting for multiple testing bias.

26.6.2 Statistical Challenges in Reusing Natural Experiments

When the same natural experiment is analyzed in multiple studies, or even within a single study across many different outcomes, the probability of obtaining spurious significant results increases. Key statistical challenges include:

Family-Wise Error Rate (FWER) Inflation

Each additional hypothesis tested increases the probability of at least one false rejection of the null hypothesis (Type I error). If we test $m$ independent hypotheses at the nominal significance level $\alpha$ , the probability of making at least one Type I error is:

$P(\text{At least one false positive}) = 1 - (1 - \alpha)^m.$

For example, with $\alpha = 0.05$ and $m = 20$ tests:

$P(\text{At least one false positive}) = 1 - (0.95)^{20} \approx 0.64.$

This means that even if all null hypotheses are true, we expect a 64% chance of falsely rejecting at least one.

False Discovery Rate (FDR) and Dependent Tests

FWER corrections such as Bonferroni are conservative and may be too stringent when outcomes are correlated. In cases where researchers test multiple related hypotheses, False Discovery Rate (FDR) control provides an alternative by limiting the expected proportion of false discoveries among rejected hypotheses.

Multiple Testing in Sequential Experiments

In many longitudinal or rolling studies, results are reported over time as more data becomes available. Chronological testing introduces additional biases:

Repeated interim analyses increase the probability of stopping early on false positives.
Outcomes tested at different times require corrections that adjust for sequential dependence.

To address these issues, researchers must apply multiple testing corrections.

26.6.3 Solutions: Multiple Testing Corrections

To mitigate the risks of false positives in natural experiment research, various statistical corrections can be applied.

Family-Wise Error Rate (FWER) Control

The most conservative approach controls the probability of at least one false positive:

Bonferroni Correction:

$p^*_i = m \cdot p_i,$

where $p^*_i$ is the adjusted p-value and $m$ is the total number of hypotheses tested.
Holm-Bonferroni Method (Holm 1979): Less conservative than Bonferroni, adjusting significance thresholds in a stepwise fashion.
Sidak Correction (Šidák 1967): Accounts for multiple comparisons assuming independent tests.
Romano-Wolf Stepwise Correction (Romano and Wolf 2005, 2016): Recommended for natural experiments as it controls for dependence across tests.
Hochberg’s Sharper FWER Control (Hochberg 1988): A step-up procedure, more powerful than Holm’s method.

False Discovery Rate (FDR) Control

FDR-controlling methods are less conservative than FWER-based approaches, allowing for some false positives while controlling their expected proportion.

Benjamini-Hochberg (BH) Procedure (Benjamini and Hochberg 1995): Limits the expected proportion of false discoveries among rejected hypotheses.
Adaptive Benjamini-Hochberg (Benjamini and Hochberg 2000): Adjusts for situations where the true proportion of null hypotheses is unknown.
Benjamini-Yekutieli (BY) Correction (Benjamini and Yekutieli 2001): Accounts for arbitrary dependence between tests.
Two-Stage Benjamini-Hochberg (Benjamini, Krieger, and Yekutieli 2006): A more powerful version of BH, particularly useful in large-scale studies.

Sequential Approaches to Multiple Testing

Two major frameworks exist for applying multiple testing corrections over time:

Chronological Sequencing

Outcomes are ordered by the date they were first reported.
Multiple testing corrections are applied sequentially, progressively increasing the statistical significance threshold over time.

Best Foot Forward Policy

Outcomes are ranked from most to least likely to be rejected based on experimental data.
Frequently used in clinical trials where primary outcomes are given priority.
New outcomes are added only if linked to primary treatment effects.

Alternatively, refer to the rules of thumb from Table AI (Heath et al. 2023, 2356).

These approaches ensure that the p-value correction is consistent with the temporal structure of natural experiments.

Romano-Wolf Correction

The Romano-Wolf correction is highly recommended for handling multiple testing in natural experiments:

# Install required packages
# install.packages("fixest")
# install.packages("wildrwolf")

library(fixest)
library(wildrwolf)

# Load example data
data(iris)

# Fit multiple regression models
fit1 <- feols(Sepal.Width ~ Sepal.Length, data = iris)
fit2 <- feols(Petal.Length ~ Sepal.Length, data = iris)
fit3 <- feols(Petal.Width ~ Sepal.Length, data = iris)

# Apply Romano-Wolf stepwise correction
res <- rwolf(
  models = list(fit1, fit2, fit3), 
  param = "Sepal.Length",  
  B = 500
)
#> 
  |                                                                            
  |                                                                      |   0%
#> 
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |======================================================================| 100%

res
#>   model   Estimate Std. Error   t value     Pr(>|t|) RW Pr(>|t|)
#> 1     1 -0.0618848 0.04296699 -1.440287    0.1518983 0.145708583
#> 2     2   1.858433 0.08585565  21.64602 1.038667e-47 0.001996008
#> 3     3  0.7529176 0.04353017  17.29645 2.325498e-37 0.001996008

General Multiple Testing Adjustments

For other multiple testing adjustments, use the multtest package:

# Install package if necessary
# BiocManager::install("multtest")

library(multtest)

# Define multiple correction procedures
procs <-
    c("Bonferroni",
      "Holm",
      "Hochberg",
      "SidakSS",
      "SidakSD",
      "BH",
      "BY",
      "ABH",
      "TSBH")

# Generate random p-values for demonstration
p_values <- runif(10)

# Apply multiple testing corrections
adj_pvals <- mt.rawp2adjp(p_values, procs)

# Print results in a readable format
adj_pvals |> causalverse::nice_tab()
#>    adjp.rawp adjp.Bonferroni adjp.Holm adjp.Hochberg adjp.SidakSS adjp.SidakSD
#> 1       0.12               1         1          0.75         0.72         0.72
#> 2       0.22               1         1          0.75         0.92         0.89
#> 3       0.24               1         1          0.75         0.94         0.89
#> 4       0.29               1         1          0.75         0.97         0.91
#> 5       0.36               1         1          0.75         0.99         0.93
#> 6       0.38               1         1          0.75         0.99         0.93
#> 7       0.44               1         1          0.75         1.00         0.93
#> 8       0.59               1         1          0.75         1.00         0.93
#> 9       0.65               1         1          0.75         1.00         0.93
#> 10      0.75               1         1          0.75         1.00         0.93
#>    adjp.BH adjp.BY adjp.ABH adjp.TSBH_0.05 index h0.ABH h0.TSBH
#> 1     0.63       1     0.63           0.63     2     10      10
#> 2     0.63       1     0.63           0.63     6     10      10
#> 3     0.63       1     0.63           0.63     8     10      10
#> 4     0.63       1     0.63           0.63     3     10      10
#> 5     0.63       1     0.63           0.63    10     10      10
#> 6     0.63       1     0.63           0.63     1     10      10
#> 7     0.63       1     0.63           0.63     7     10      10
#> 8     0.72       1     0.72           0.72     9     10      10
#> 9     0.72       1     0.72           0.72     5     10      10
#> 10    0.75       1     0.75           0.75     4     10      10

# adj_pvals$adjp

References

Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society: Series B (Methodological) 57 (1): 289–300.

———. 2000. “On the Adaptive Control of the False Discovery Rate in Multiple Testing with Independent Statistics.” Journal of Educational and Behavioral Statistics 25 (1): 60–83.

Benjamini, Yoav, Abba M Krieger, and Daniel Yekutieli. 2006. “Adaptive Linear Step-up Procedures That Control the False Discovery Rate.” Biometrika 93 (3): 491–507.

Benjamini, Yoav, and Daniel Yekutieli. 2001. “The Control of the False Discovery Rate in Multiple Testing Under Dependency.” Annals of Statistics, 1165–88.

Card, David, and Alan B Krueger. 1993. “Minimum Wages and Employment: A Case Study of the Fast Food Industry in New Jersey and Pennsylvania.” National Bureau of Economic Research.

Heath, Davidson, Matthew C Ringgenberg, Mehrdad Samadi, and Ingrid M Werner. 2023. “Reusing Natural Experiments.” The Journal of Finance 78 (4): 2329–64.

Hochberg, Yosef. 1988. “A Sharper Bonferroni Procedure for Multiple Tests of Significance.” Biometrika 75 (4): 800–802.

Holm, Sture. 1979. “A Simple Sequentially Rejective Multiple Test Procedure.” Scandinavian Journal of Statistics, 65–70.

Romano, Joseph P, and Michael Wolf. 2005. “Stepwise Multiple Testing as Formalized Data Snooping.” Econometrica 73 (4): 1237–82.

———. 2016. “Efficient Computation of Adjusted p-Values for Resampling-Based Stepdown Multiple Testing.” Statistics & Probability Letters 113: 38–40.

Šidák, Zbyněk. 1967. “Rectangular Confidence Regions for the Means of Multivariate Normal Distributions.” Journal of the American Statistical Association 62 (318): 626–33.