30.6 Two-Way Fixed Effects
A generalization of the Difference-in-Differences model is the two-way fixed effects (TWFE) model, which accounts for multiple groups and multiple time periods by including both unit and time fixed effects. In practice, TWFE is frequently used to estimate causal effects in panel data settings. However, it is not a design-based, non-parametric causal estimator (Imai and Kim 2021), and it can suffer from severe biases if the treatment effect is heterogeneous across units or time.
When applying TWFE to datasets with multiple treatment groups and staggered treatment timing, the estimated causal coefficient is a weighted average of all possible two-group, two-period DiD comparisons. Crucially, some of these weights can be negative (Goodman-Bacon 2021), which leads to potential biases. The weighting scheme depends on:
- Group sizes
- Variation in treatment timing
- Placement in the middle of the panel (units in the middle tend to get the highest weight)
30.6.1 Canonical TWFE Model
The canonical TWFE model is typically written as:
Yit=αi+λt+τWit+βXit+ϵit,
where:
Yit = Outcome for unit i at time t
αi = Unit fixed effect
λt = Time fixed effect
τ = Causal effect of treatment
Wit = Treatment indicator (1 if treated, 0 otherwise)
Xit = Covariates
ϵit = Error term
An illustrative TWFE event-study model (Stevenson and Wolfers 2006):
Yit=∑kβk⋅Treatmentkit+ηi+λt+Controlsit+ϵit,
where:
Treatmentkit: Indicator for whether unit i is in its k-th year relative to treatment at time t.
ηi: Unit fixed effects, controlling for time-invariant unobserved heterogeneity.
λt: Time fixed effects, capturing overall macro shocks.
Standard Errors: Typically clustered at the group or cohort level.
Usually, researchers drop the period immediately before treatment (k=−1) to avoid collinearity. However, dropping this or another period inappropriately can shift or bias the estimates.
When there are only two time periods (T=2), TWFE simplifies to the traditional DiD model. Under homogeneous treatment effects and if the parallel trends assumption holds, ˆτOLS is unbiased. Specifically, the model assumes (Imai and Kim 2021):
- Homogeneous treatment effects across units and time periods, meaning:
- No dynamic treatment effects (i.e., treatment effects do not evolve over time).
- The treatment effect is constant across units (Goodman-Bacon 2021; Clément De Chaisemartin and d’Haultfoeuille 2020; L. Sun and Abraham 2021; Borusyak, Jaravel, and Spiess 2021).
- Parallel trends assumption
- Linear additive effects are valid (Imai and Kim 2021).
However, in practice, treatment effects are often heterogeneous. If effects vary by cohort or over time, then standard TWFE estimates can be biased—particularly when there is staggered adoption or dynamic treatment effects (Goodman-Bacon 2021; Clément De Chaisemartin and d’Haultfoeuille 2020; L. Sun and Abraham 2021; Borusyak, Jaravel, and Spiess 2021). Hence, to use the TWFE, we actually have to argue why the effects are homogeneous to justify TWFE use:
- Assess treatment heterogeneity: If heterogeneity exists, TWFE may produce biased estimates. Researchers should:
- Plot treatment timing across units.
- Decompose the treatment effect using the Goodman-Bacon decomposition to identify negative weights.
- Check the proportion of never-treated observations: When 80% or more of the sample is never treated, TWFE bias is negligible.
- Beware of bias worsening with long-run effects.
- Dropping relative time periods:
- If all units eventually receive treatment, two relative time periods must be dropped to avoid multicollinearity.
- Some software packages drop periods randomly; if a post-treatment period is dropped, bias may result.
- The standard approach is to drop periods -1 and -2.
- Sources of treatment heterogeneity:
- Delayed treatment effects: The impact of treatment may take time to manifest.
- Evolving effects: Treatment effects can increase or change over time (e.g., phase-in effects).
TWFE compares different types of treatment/control groups:
- Valid comparisons:
- Newly treated units vs. control units
- Newly treated units vs. not-yet treated units
- Problematic comparisons:
- Newly treated units vs. already treated units (since already treated units do not represent the correct counterfactual).
- Strict exogeneity violations:
- Presence of time-varying confounders
- Feedback from past outcomes to treatment (Imai and Kim 2019)
- Functional form restrictions:
- Assumes treatment effect homogeneity.
- No carryover effects or anticipation effects (Imai and Kim 2019).
30.6.2 Limitations of TWFE
TWFE DiD is valid only under strong assumptions that the treatment effect does not vary across units or over time. In reality, we almost always see some form of treatment heterogeneity:
- No dynamic treatment effects: The model requires that the treatment effect not evolve over time.
- No unit-level differences: The treatment effect must be constant across all units.
- Linear additive effects: TWFE assumes that the underlying data-generating process is captured by additive fixed effects plus a constant treatment effect (Imai and Kim 2021).
If any of these assumptions are violated, TWFE can produce biased estimates. Specifically:
- Negative Weights & Biased Estimates: With multiple groups and staggered timing, the TWFE estimate becomes a complicated average of “two-group, two-period” DiD comparisons, some of which can receive negative weights (Goodman-Bacon 2021).
- Potential Bias from Dropping Relative Time Periods: If all units eventually get treated, software often drops a reference period (or periods) to avoid multicollinearity. If the dropped period is post-treatment, the bias can worsen. Researchers often drop relative time −1 or −2.
- Delayed or Evolving Treatment Effects: If the effect of treatment takes time to manifest or changes over time, TWFE’s single coefficient τ can be misleading.
When two time periods only exist, TWFE collapses back to the traditional DiD model, making these problems far less severe. But as soon as one moves beyond a single treatment period or has variation in treatment timing, these issues become critical.
Several authors (L. Sun and Abraham 2021; Callaway and Sant’Anna 2021; Goodman-Bacon 2021) have raised concerns that TWFE DiD regressions under staggered adoption:
- Mixes Cohorts: May unintentionally compare newly treated units to already treated units, conflating post-treatment behavior of early adopters with the pre-treatment trends of later adopters.
- Negative Weights: Some group comparisons receive negative weights, which can reverse the sign of the overall estimate.
- Pre-Treatment Leads: Leads may appear non-zero if earlier-treated groups remain in the sample while later adopters are still untreated.
- Long-Run Effects: Heterogeneity in lagged (long-run) effects can exacerbate bias.
In fields like finance and accounting, newer estimators often reveal null or much smaller effects than standard TWFE once bias is properly accounted for (Baker, Larcker, and Wang 2022).
30.6.3 Diagnosing and Addressing Bias in TWFE
Researchers can identify and mitigate the biases arising from heterogeneous treatment effects through diagnostic checks and alternative estimators:
- Purpose: Decomposes the TWFE DiD estimate into the sum of all two-group, two-period comparisons.
- Insight: Reveals which comparisons have negative weights and how much each comparison contributes to the overall estimate (Goodman-Bacon 2021).
- Implementation: Identify subgroups by treatment timing, then examine each group–time pair to see how it contributes to the aggregate TWFE coefficient.
- Visual Inspection: Always plot the distribution of treatment timing across units.
- High Risk of Bias: If treatment is staggered and many units differ in their adoption times, standard TWFE will often be biased.
- Assessing Treatment Heterogeneity Directly
- Check for Variation in Effects: If there is a theoretical or empirical reason to believe that treatment effects differ by subgroup or over time, TWFE might not be appropriate.
- Size of Never-Treated Sample: When 80% or more of the sample is never treated, the potential for bias in TWFE is smaller. However, large shares of treated units with varied adoption times raise red flags.
- Long-Run Effects: Bias can worsen if the treatment effect accumulates or changes over time.
30.6.3.1 Goodman-Bacon Decomposition
The Goodman-Bacon decomposition (Goodman-Bacon 2021) is a powerful diagnostic tool for understanding the TWFE estimator in settings with staggered treatment adoption. This approach clarifies how the TWFE DiD estimate is a weighted average of many 2×2 difference-in-differences comparisons between groups treated at different times (or never treated).
Key Takeaways
- A pairwise DiD estimate (τ) receives more weight when:
- The treatment happens closer to the midpoint of the observation window.
- The comparison involves more observations (e.g., more units or more years).
- Comparisons between early-treated and later-treated groups can produce negative weights, potentially biasing the aggregate TWFE estimate.
We illustrate the decomposition using the castle
dataset from the bacondecomp
package:
library(bacondecomp)
library(tidyverse)
# Load and inspect the castle dataset
castle <- bacondecomp::castle %>%
dplyr::select(l_homicide, post, state, year)
head(castle)
#> l_homicide post state year
#> 1 2.027356 0 Alabama 2000
#> 2 2.164867 0 Alabama 2001
#> 3 1.936334 0 Alabama 2002
#> 4 1.919567 0 Alabama 2003
#> 5 1.749841 0 Alabama 2004
#> 6 2.130440 0 Alabama 2005
Running the Goodman-Bacon Decomposition
# Apply Goodman-Bacon decomposition
df_bacon <- bacon(
formula = l_homicide ~ post,
data = castle,
id_var = "state",
time_var = "year"
)
#> type weight avg_est
#> 1 Earlier vs Later Treated 0.05976 -0.00554
#> 2 Later vs Earlier Treated 0.03190 0.07032
#> 3 Treated vs Untreated 0.90834 0.08796
# Display weighted average of the decomposition
weighted_avg <- sum(df_bacon$estimate * df_bacon$weight)
weighted_avg
#> [1] 0.08181162
Comparing with the TWFE Estimate
library(broom)
# Fit a TWFE model
fit_tw <- lm(l_homicide ~ post + factor(state) + factor(year), data = castle)
tidy(fit_tw)
#> # A tibble: 61 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 1.95 0.0624 31.2 2.84e-118
#> 2 post 0.0818 0.0317 2.58 1.02e- 2
#> 3 factor(state)Alaska -0.373 0.0797 -4.68 3.77e- 6
#> 4 factor(state)Arizona 0.0158 0.0797 0.198 8.43e- 1
#> 5 factor(state)Arkansas -0.118 0.0810 -1.46 1.44e- 1
#> 6 factor(state)California -0.108 0.0810 -1.34 1.82e- 1
#> 7 factor(state)Colorado -0.696 0.0810 -8.59 1.14e- 16
#> 8 factor(state)Connecticut -0.785 0.0810 -9.68 2.08e- 20
#> 9 factor(state)Delaware -0.547 0.0810 -6.75 4.18e- 11
#> 10 factor(state)Florida -0.251 0.0798 -3.14 1.76e- 3
#> # ℹ 51 more rows
Interpretation: The TWFE estimate (approx. 0.08) equals the weighted average of the Bacon decomposition estimates, confirming the decomposition’s validity.
Visualizing the Decomposition
library(ggplot2)
ggplot(df_bacon) +
aes(
x = weight,
y = estimate,
color = type
) +
geom_point() +
labs(
x = "Weight",
y = "Estimate",
color = "Comparison Type"
) +
causalverse::ama_theme()
Insight: This plot shows the contribution of each 2×2 DiD comparison, highlighting how estimates with large weights dominate the overall TWFE coefficient.
Interpretation and Practical Implications
- Purpose: Decomposes the TWFE DiD estimate into the sum of all two-group, two-period comparisons.
- Insight: Reveals how much each comparison contributes to the overall estimate and whether any have negative or misleading effects.
- Implementation:
- Identify subgroups by treatment timing.
- Compute DiD for each 2×2 comparison (early vs. late, late vs. never, etc.).
- Evaluate how these contribute to the final TWFE estimate.
When time-varying covariates are included that allow for identification within treatment timing groups, certain problematic comparisons (like “early vs. late”) may no longer influence the TWFE estimator directly. These scenarios may collapse into simpler within-group estimates, improving identification.
Summary Table: Goodman-Bacon Comparison Types
Comparison Type | Description | Common Issue |
---|---|---|
Treated vs. Never | Clean comparisons if never-treated units exist | Often reliable |
Early vs. Late | Later group is control in earlier period | May introduce bias |
Late vs. Early | Early group is control in later period | May reverse causality |
Treated vs. Treated | Within-treatment variation by timing | Sensitive to dynamics |
30.6.4 Remedies for TWFE’s Shortcomings
This section outlines alternative estimators and design-based approaches that explicitly handle heterogeneous treatment effects, staggered adoption (Baker, Larcker, and Wang 2022), and dynamic treatment effects better than standard TWFE (e.g., Modern Estimators for Staggered Adoption).
Callaway and Sant’Anna (2021) propose a two-step approach:
- Group-time treatment effects: In each time period, estimate the effect for the cohort that first received treatment in that period (compared to a never-treated group).
- Aggregate: Use a bootstrap procedure to account for autocorrelation and clustering, then aggregate across groups.
- Advantages: Allows for heterogeneous treatment effects across groups and over time; compares treated groups only with never-treated units (or well-chosen controls).
- Implementation:
did
package in R.
L. Sun and Abraham (2021) build on Callaway and Sant’Anna (2021) to handle event-study settings:
- Lags and Leads: Capture dynamic treatment effects by including time lags and leads relative to the event (treatment).
- Cohort-Specific Estimates: Estimate separate paths of outcomes for each cohort, controlling for other cohorts carefully.
- Interaction-Weighted Estimator: Adjusts for differences in when treatment began.
- Implementation:
fixest
package in R.
Imai and Kim (2021) develop methods allowing units to switch in and out of treatment:
- Matching to create a weighted version of TWFE, addressing some of the bias from heterogeneous effects.
- Implementation:
wfe
andPanelMatch
R packages.
- Two-Stage Difference-in-Differences (DiD2S)
Gardner (2022) propose two-stage DiD:
- Idea: Partial out fixed effects first, then perform a second-stage regression that focuses on within-group/time variation.
- Strength: Handles heterogeneous treatment effects well, especially when never-treated units are present.
- Implementation:
did2s
R package.
- If a study has never-treated units, Clément De Chaisemartin and d’Haultfoeuille (2020) suggest an switching DiD estimator to recover the average treatment effect.
- Caveat: This approach still fails to detect heterogeneity if treatment effects vary with exposure length (L. Sun and Shapiro 2022).
- Design-Based Approaches: Arkhangelsky et al. (2024) offer further refinements that incorporate inverse probability weighting.
- Goal: Improve balance and reduce bias from non-random treatment timing.
- Stacked DID (Simpler but Biased)
- Build stacked datasets for each treatment cohort, running separate regressions for each “event window.”
- This approach is simpler but can still carry biases if the underlying assumptions are violated (Gormley and Matsa 2011; Cengiz et al. 2019; Deshpande and Li 2019).
- Doubly Robust Difference-in-Differences Estimators (DR-DID) (Sant’Anna and Zhao 2020)
- DR-DID estimators combine outcome regression and propensity score weighting to identify treatment effects, remaining consistent if either model is correctly specified.
- They achieve local efficiency under joint correctness and can be applied to both panel and repeated cross-section data.
- Nonlinear Difference-in-Differences
30.6.5 Best Practices and Recommendations
Below are practical guidelines for deciding when to use TWFE and how to diagnose or address potential bias.
- When is TWFE Appropriate?
- Single Treatment Period: TWFE DiD works well if there is only one treatment period for all treated units (no variation in timing).
- Homogeneous Effects: If strong theoretical or empirical reasons suggest constant treatment effects across cohorts and over time, TWFE remains a reasonable choice.
- Diagnosing and Addressing Bias with Staggered Adoption
- Plot Treatment Timing: Examine the distribution of treatment timing across units. If treatment adoption is highly staggered, TWFE is likely to produce biased estimates.
- Decomposition Methods: Use the Goodman-Bacon Decomposition (Goodman-Bacon 2021) to see how TWFE pools comparisons (and whether negative weights emerge). If decomposition is infeasible (e.g., unbalanced panels), the share of never-treated units can indicate potential bias severity.
- Decomposes the TWFE DiD estimate into two-group, two-period comparisons.
- Identifies which comparisons receive negative weights, which can lead to biased estimates.
- Helps determine the influence of specific groups on the overall estimate.
- Discuss Heterogeneity: Explicitly state the likelihood of treatment effect heterogeneity; incorporate it into the research design.
- Event-Study Specifications within TWFE
- Avoid Arbitrary Binning: Do not collapse multiple time periods into a single bin unless you can justify homogeneous effects within that bin.
- Full Relative-Time Indicators: Include flexible event-time indicators, carefully choosing a reference period (commonly −1, the year before treatment). Specifically, Include fully flexible relative time indicators, and justify the reference period (usually l=−1 or the period just prior to treatment).
- Beware of Multicollinearity: Including leads and lags can cause multicollinearity and artificially produce significant “pre-trends.”
- Drop the Right Periods: If all units eventually get treated, dropping post-treatment periods accidentally can bias results.
- Consider Alternative Estimators