30.3 Simple Difference-in-Differences
Difference-in-Differences originated as a tool to analyze natural experiments, but its applications extend far beyond that. DID is built on the Fixed Effects Estimator, making it a fundamental approach for policy evaluation and causal inference in observational studies.
DID leverages inter-temporal variation between groups:
- Cross-sectional comparison: Helps avoid omitted variable bias due to common trends.
- Time-series comparison: Helps mitigate omitted variable bias due to cross-sectional heterogeneity.
30.3.1 Basic Setup of DID
Consider a simple setting with:
- Treatment Group (Di=1)
- Control Group (Di=0)
- Pre-Treatment Period (T=0)
- Post-Treatment Period (T=1)
After Treatment (T=1) | Before Treatment (T=0) | |
---|---|---|
Treated (Di=1) | E[Y1i(1)|Di=1] | E[Y0i(0)|Di=1] |
Control (Di=0) | E[Y0i(1)|Di=0] | E[Y0i(0)|Di=0] |
The fundamental challenge: We cannot observe E[Y0i(1)|Di=1]—i.e., the counterfactual outcome for the treated group had they not received treatment.
DID estimates the Average Treatment Effect on the Treated using the following formula:
E[Y1(1)−Y0(1)|D=1]={E[Y(1)|D=1]−E[Y(1)|D=0]}−{E[Y(0)|D=1]−E[Y(0)|D=0]}
This formulation differences out time-invariant unobserved factors, assuming the parallel trends assumption holds.
- For the treated group, we isolate the difference between being treated and not being treated.
- If the control group would have experienced a different trajectory, the DID estimate may be biased.
- Since we cannot observe treatment variation in the control group, we cannot infer the treatment effect for this group.
# Load required libraries
library(dplyr)
library(ggplot2)
set.seed(1)
# Simulated dataset for illustration
data <- data.frame(
time = rep(c(0, 1), each = 50), # Pre (0) and Post (1)
treated = rep(c(0, 1), times = 50), # Control (0) and Treated (1)
error = rnorm(100)
)
# Generate outcome variable
data$outcome <- 5 + 3 * data$treated + 2 * data$time + 4 * data$treated * data$time + data$error
# Compute averages for 2x2 table
table_means <- data %>%
group_by(treated, time) %>%
summarize(mean_outcome = mean(outcome), .groups = "drop") %>%
mutate(
group = paste0(ifelse(treated == 1, "Treated", "Control"), ", ",
ifelse(time == 1, "Post", "Pre"))
)
# Display the 2x2 table
table_2x2 <- table_means %>%
select(group, mean_outcome) %>%
tidyr::spread(key = group, value = mean_outcome)
print("2x2 Table of Mean Outcomes:")
#> [1] "2x2 Table of Mean Outcomes:"
print(table_2x2)
#> # A tibble: 1 × 4
#> `Control, Post` `Control, Pre` `Treated, Post` `Treated, Pre`
#> <dbl> <dbl> <dbl> <dbl>
#> 1 7.19 5.20 14.0 8.00
# Calculate Diff-in-Diff manually
Y11 <- table_means$mean_outcome[table_means$group == "Treated, Post"] # Treated, Post
Y10 <- table_means$mean_outcome[table_means$group == "Treated, Pre"] # Treated, Pre
Y01 <- table_means$mean_outcome[table_means$group == "Control, Post"] # Control, Post
Y00 <- table_means$mean_outcome[table_means$group == "Control, Pre"] # Control, Pre
diff_in_diff_formula <- (Y11 - Y10) - (Y01 - Y00)
# Estimate DID using OLS
model <- lm(outcome ~ treated * time, data = data)
ols_estimate <- coef(model)["treated:time"]
# Print results
results <- data.frame(
Method = c("Diff-in-Diff Formula", "OLS Estimate"),
Estimate = c(diff_in_diff_formula, ols_estimate)
)
print("Comparison of DID Estimates:")
#> [1] "Comparison of DID Estimates:"
print(results)
#> Method Estimate
#> Diff-in-Diff Formula 4.035895
#> treated:time OLS Estimate 4.035895
# Visualization
ggplot(data, aes(x = as.factor(time), y = outcome, color = as.factor(treated), group = treated)) +
stat_summary(fun = mean, geom = "point", size = 3) +
stat_summary(fun = mean, geom = "line", linetype = "dashed") +
labs(
title = "Difference-in-Differences Visualization",
x = "Time (0 = Pre, 1 = Post)",
y = "Outcome",
color = "Group"
) +
scale_color_manual(labels = c("Control", "Treated"), values = c("blue", "red")) +
theme_minimal()
Control (0) | Treated (1) | |
---|---|---|
Pre (0) | ˉY00=5 | ˉY10=8 |
Post (1) | ˉY01=7 | ˉY11=14 |
The table organizes the mean outcomes into four cells:
Control Group, Pre-period (ˉY00): Mean outcome for the control group before the intervention.
Control Group, Post-period (ˉY01): Mean outcome for the control group after the intervention.
Treated Group, Pre-period (ˉY10): Mean outcome for the treated group before the intervention.
Treated Group, Post-period (ˉY11): Mean outcome for the treated group after the intervention.
The DID treatment effect calculated from the simple formula of averages is identical to the estimate from an OLS regression with an interaction term.
The treatment effect is calculated as:
DID=(ˉY11−ˉY10)−(ˉY01−ˉY00)
Compute manually:
(ˉY11−ˉY10)−(ˉY01−ˉY00)
Use OLS regression:
Yit=β0+β1treatedi+β2timet+β3(treatedi⋅timet)+ϵit
Using the simulated table:
DID=(14−8)−(7−5)=6−2=4
This matches the interaction term coefficient (β3=4) from the OLS regression.
Both methods give the same result!
30.3.2 Extensions of DID
30.3.2.1 DID with More Than Two Groups or Time Periods
DID can be extended to multiple treatments, multiple controls, and more than two periods:
Yigt=αg+γt+βIgt+δXigt+ϵigt
where:
αg = Group-Specific Fixed Effects (e.g., firm, region).
γt = Time-Specific Fixed Effects (e.g., year, quarter).
β = DID Effect.
Igt = Interaction Terms (Treatment × Post-Treatment).
δXigt = Additional Covariates.
This is known as the Two-Way Fixed Effects DID model. However, TWFE performs poorly under staggered treatment adoption, where different groups receive treatment at different times.
30.3.2.2 Examining Long-Term Effects (Dynamic DID)
To examine the dynamic treatment effects (that are not under rollout/staggered design), we can create a centered time variable.
Centered Time Variable | Interpretation |
---|---|
t=−2 | Two periods before treatment |
t=−1 | One period before treatment |
t=0 | Last pre-treatment period right before treatment period (Baseline/Reference Group) |
t=1 | Treatment period |
t=2 | One period after treatment |
Dynamic Treatment Model Specification
By interacting this factor variable, we can examine the dynamic effect of treatment (i.e., whether it’s fading or intensifying):
Y=α0+α1Group+α2Time+β−T1Treatment+β−(T1−1)Treatment+⋯+β−1Treatment+β1+⋯+βT2Treatment
where:
β0 (Baseline Period) is the reference group (i.e., drop from the model).
T1 = Pre-Treatment Period.
T2 = Post-Treatment Period.
Treatment coefficients (βt) measure the effect over time.
Key Observations:
Pre-treatment coefficients should be close to zero (β−T1,…,β−1≈0), ensuring no pre-trend bias.
Post-treatment coefficients should be significantly different from zero (β1,…,βT2≠0), measuring the treatment effect over time.
Higher standard errors (SEs) with more interactions: Including too many lags can reduce precision.
30.3.3 Goals of DID
- Pre-Treatment Coefficients Should Be Insignificant
- Ensure that β−T1,…,β−1=0 (similar to a Placebo Test).
- Post-Treatment Coefficients Should Be Significant
- Verify that β1,…,βT2≠0.
- Examine whether the trend in post-treatment coefficients is increasing or decreasing over time.
library(tidyverse)
library(fixest)
od <- causaldata::organ_donations %>%
# Treatment variable
dplyr::mutate(California = State == 'California') %>%
# centered time variable
dplyr::mutate(center_time = as.factor(Quarter_Num - 3))
# where 3 is the reference period precedes the treatment period
class(od$California)
#> [1] "logical"
class(od$State)
#> [1] "character"
cali <- feols(Rate ~ i(center_time, California, ref = 0) |
State + center_time,
data = od)
etable(cali)
#> cali
#> Dependent Var.: Rate
#>
#> California x center_time = -2 -0.0029 (0.0051)
#> California x center_time = -1 0.0063** (0.0023)
#> California x center_time = 1 -0.0216*** (0.0050)
#> California x center_time = 2 -0.0203*** (0.0045)
#> California x center_time = 3 -0.0222* (0.0100)
#> Fixed-Effects: -------------------
#> State Yes
#> center_time Yes
#> _____________________________ ___________________
#> S.E.: Clustered by: State
#> Observations 162
#> R2 0.97934
#> Within R2 0.00979
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
iplot(cali, pt.join = T)