30.6 Two-Way Fixed Effects

A generalization of the Difference-in-Differences model is the two-way fixed effects (TWFE) model, which accounts for multiple groups and multiple time periods by including both unit and time fixed effects. In practice, TWFE is frequently used to estimate causal effects in panel data settings. However, it is not a design-based, non-parametric causal estimator (Imai and Kim 2021), and it can suffer from severe biases if the treatment effect is heterogeneous across units or time.

When applying TWFE to datasets with multiple treatment groups and staggered treatment timing, the estimated causal coefficient is a weighted average of all possible two-group, two-period DiD comparisons. Crucially, some of these weights can be negative (Goodman-Bacon 2021), which leads to potential biases. The weighting scheme depends on:

Group sizes
Variation in treatment timing
Placement in the middle of the panel (units in the middle tend to get the highest weight)

30.6.1 Canonical TWFE Model

The canonical TWFE model is typically written as:

\[ Y_{it} = \alpha_i + \lambda_t + \tau W_{it} + \beta X_{it} + \epsilon_{it}, \]

where:

\(Y_{it}\) = Outcome for unit \(i\) at time \(t\)
\(\alpha_i\) = Unit fixed effect
\(\lambda_t\) = Time fixed effect
\(\tau\) = Causal effect of treatment
\(W_{it}\) = Treatment indicator (\(1\) if treated, \(0\) otherwise)
\(X_{it}\) = Covariates
\(\epsilon_{it}\) = Error term

An illustrative TWFE event-study model (Stevenson and Wolfers 2006):

\[ \begin{aligned} Y_{it} &= \sum_{k} \beta_{k} \cdot Treatment_{it}^{k} + \eta_{i} + \lambda_{t} + Controls_{it} + \epsilon_{it}, \end{aligned} \]

where:

\(Treatment_{it}^k\): Indicator for whether unit \(i\) is in its \(k\)-th year relative to treatment at time \(t\).
\(\eta_i\): Unit fixed effects, controlling for time-invariant unobserved heterogeneity.
\(\lambda_t\): Time fixed effects, capturing overall macro shocks.
Standard Errors: Typically clustered at the group or cohort level.

Usually, researchers drop the period immediately before treatment (\(k=-1\)) to avoid collinearity. However, dropping this or another period inappropriately can shift or bias the estimates.

When there are only two time periods \((T=2)\), TWFE simplifies to the traditional DiD model. Under homogeneous treatment effects and if the parallel trends assumption holds, \(\hat{\tau}_{OLS}\) is unbiased. Specifically, the model assumes (Imai and Kim 2021):

Homogeneous treatment effects across units and time periods, meaning:
- No dynamic treatment effects (i.e., treatment effects do not evolve over time).
- The treatment effect is constant across units (Goodman-Bacon 2021; Clément De Chaisemartin and d’Haultfoeuille 2020; L. Sun and Abraham 2021; Borusyak, Jaravel, and Spiess 2021).
Parallel trends assumption
Linear additive effects are valid (Imai and Kim 2021).

However, in practice, treatment effects are often heterogeneous. If effects vary by cohort or over time, then standard TWFE estimates can be biased, particularly when there is staggered adoption or dynamic treatment effects (Goodman-Bacon 2021; Clément De Chaisemartin and d’Haultfoeuille 2020; L. Sun and Abraham 2021; Borusyak, Jaravel, and Spiess 2021). Hence, to use the TWFE, we actually have to argue why the effects are homogeneous to justify TWFE use:

Assess treatment heterogeneity: If heterogeneity exists, TWFE may produce biased estimates. Researchers should:
- Plot treatment timing across units.
- Decompose the treatment effect using the Goodman-Bacon decomposition to identify negative weights.
- Check the proportion of never-treated observations: When 80% or more of the sample is never treated, TWFE bias is negligible.
- Beware of bias worsening with long-run effects.
Dropping relative time periods:
- If all units eventually receive treatment, two relative time periods must be dropped to avoid multicollinearity.
- Some software packages drop periods randomly; if a post-treatment period is dropped, bias may result.
- The standard approach is to drop periods -1 and -2.
Sources of treatment heterogeneity:
- Delayed treatment effects: The impact of treatment may take time to manifest.
- Evolving effects: Treatment effects can increase or change over time (e.g., phase-in effects).

TWFE compares different types of treatment/control groups:

Valid comparisons:
- Newly treated units vs. control units
- Newly treated units vs. not-yet treated units
Problematic comparisons:
- Newly treated units vs. already treated units (since already treated units do not represent the correct counterfactual).
- Strict exogeneity violations:
  - Presence of time-varying confounders
  - Feedback from past outcomes to treatment (Imai and Kim 2019)
- Functional form restrictions:
  - Assumes treatment effect homogeneity.
  - No carryover effects or anticipation effects (Imai and Kim 2019).

30.6.2 Limitations of TWFE

TWFE DiD is valid only under strong assumptions that the treatment effect does not vary across units or over time. In reality, we almost always see some form of treatment heterogeneity:

No dynamic treatment effects: The model requires that the treatment effect not evolve over time.
No unit-level differences: The treatment effect must be constant across all units.
Linear additive effects: TWFE assumes that the underlying data-generating process is captured by additive fixed effects plus a constant treatment effect (Imai and Kim 2021).

If any of these assumptions are violated, TWFE can produce biased estimates. Specifically:

Negative Weights & Biased Estimates: With multiple groups and staggered timing, the TWFE estimate becomes a complicated average of “two-group, two-period” DiD comparisons, some of which can receive negative weights (Goodman-Bacon 2021).
Potential Bias from Dropping Relative Time Periods: If all units eventually get treated, software often drops a reference period (or periods) to avoid multicollinearity. If the dropped period is post-treatment, the bias can worsen. Researchers often drop relative time \(-1\) or \(-2\).
Delayed or Evolving Treatment Effects: If the effect of treatment takes time to manifest or changes over time, TWFE’s single coefficient \(\tau\) can be misleading.

When two time periods only exist, TWFE collapses back to the traditional DiD model, making these problems far less severe. But as soon as one moves beyond a single treatment period or has variation in treatment timing, these issues become critical.

Several authors (L. Sun and Abraham 2021; Callaway and Sant’Anna 2021; Goodman-Bacon 2021) have raised concerns that TWFE DiD regressions under staggered adoption:

Mixes Cohorts: May unintentionally compare newly treated units to already treated units, conflating post-treatment behavior of early adopters with the pre-treatment trends of later adopters.
Negative Weights: Some group comparisons receive negative weights, which can reverse the sign of the overall estimate.
Pre-Treatment Leads: Leads may appear non-zero if earlier-treated groups remain in the sample while later adopters are still untreated.
Long-Run Effects: Heterogeneity in lagged (long-run) effects can exacerbate bias.

In fields like finance and accounting, newer estimators often reveal null or much smaller effects than standard TWFE once bias is properly accounted for (Baker, Larcker, and Wang 2022).

30.6.3 Diagnosing and Addressing Bias in TWFE

Researchers can identify and mitigate the biases arising from heterogeneous treatment effects through diagnostic checks and alternative estimators:

Goodman-Bacon Decomposition

Purpose: Decomposes the TWFE DiD estimate into the sum of all two-group, two-period comparisons.
Insight: Reveals which comparisons have negative weights and how much each comparison contributes to the overall estimate (Goodman-Bacon 2021).
Implementation: Identify subgroups by treatment timing, then examine each group–time pair to see how it contributes to the aggregate TWFE coefficient.

Plotting Treatment Timing

Visual Inspection: Always plot the distribution of treatment timing across units.
High Risk of Bias: If treatment is staggered and many units differ in their adoption times, standard TWFE will often be biased.

Assessing Treatment Heterogeneity Directly

Check for Variation in Effects: If there is a theoretical or empirical reason to believe that treatment effects differ by subgroup or over time, TWFE might not be appropriate.
Size of Never-Treated Sample: When 80% or more of the sample is never treated, the potential for bias in TWFE is smaller. However, large shares of treated units with varied adoption times raise red flags.
Long-Run Effects: Bias can worsen if the treatment effect accumulates or changes over time.

30.6.3.1 Goodman-Bacon Decomposition

The Goodman-Bacon decomposition (Goodman-Bacon 2021) is a powerful diagnostic tool for understanding the TWFE estimator in settings with staggered treatment adoption. This approach clarifies how the TWFE DiD estimate is a weighted average of many 2×2 difference-in-differences comparisons between groups treated at different times (or never treated).

Key Takeaways

A pairwise DiD estimate (\(\tau\)) receives more weight when:
- The treatment happens closer to the midpoint of the observation window.
- The comparison involves more observations (e.g., more units or more years).
Comparisons between early-treated and later-treated groups can produce negative weights, potentially biasing the aggregate TWFE estimate.

We illustrate the decomposition using the castle dataset from the bacondecomp package:

library(bacondecomp)
library(tidyverse)

# Load and inspect the castle dataset
castle <- bacondecomp::castle %>% 
  dplyr::select(l_homicide, post, state, year)
head(castle)
#>   l_homicide post   state year
#> 1   2.027356    0 Alabama 2000
#> 2   2.164867    0 Alabama 2001
#> 3   1.936334    0 Alabama 2002
#> 4   1.919567    0 Alabama 2003
#> 5   1.749841    0 Alabama 2004
#> 6   2.130440    0 Alabama 2005

Running the Goodman-Bacon Decomposition

# Apply Goodman-Bacon decomposition
df_bacon <- bacon(
  formula = l_homicide ~ post,
  data = castle,
  id_var = "state",
  time_var = "year"
)
#>                       type  weight  avg_est
#> 1 Earlier vs Later Treated 0.05976 -0.00554
#> 2 Later vs Earlier Treated 0.03190  0.07032
#> 3     Treated vs Untreated 0.90834  0.08796

# Display weighted average of the decomposition
weighted_avg <- sum(df_bacon$estimate * df_bacon$weight)
weighted_avg
#> [1] 0.08181162

Comparing with the TWFE Estimate

library(broom)

# Fit a TWFE model
fit_tw <- lm(l_homicide ~ post + factor(state) + factor(year), data = castle)
tidy(fit_tw)
#> # A tibble: 61 × 5
#>    term                     estimate std.error statistic   p.value
#>    <chr>                       <dbl>     <dbl>     <dbl>     <dbl>
#>  1 (Intercept)                1.95      0.0624    31.2   2.84e-118
#>  2 post                       0.0818    0.0317     2.58  1.02e-  2
#>  3 factor(state)Alaska       -0.373     0.0797    -4.68  3.77e-  6
#>  4 factor(state)Arizona       0.0158    0.0797     0.198 8.43e-  1
#>  5 factor(state)Arkansas     -0.118     0.0810    -1.46  1.44e-  1
#>  6 factor(state)California   -0.108     0.0810    -1.34  1.82e-  1
#>  7 factor(state)Colorado     -0.696     0.0810    -8.59  1.14e- 16
#>  8 factor(state)Connecticut  -0.785     0.0810    -9.68  2.08e- 20
#>  9 factor(state)Delaware     -0.547     0.0810    -6.75  4.18e- 11
#> 10 factor(state)Florida      -0.251     0.0798    -3.14  1.76e-  3
#> # ℹ 51 more rows

Interpretation: The TWFE estimate (approx. 0.08) equals the weighted average of the Bacon decomposition estimates, confirming the decomposition’s validity.

Visualizing the Decomposition (Figure 30.8)

library(ggplot2)

ggplot(df_bacon) +
  aes(
    x = weight,
    y = estimate,
    color = type
  ) +
  geom_point() +
  labs(
    x = "Weight",
    y = "Estimate",
    color = "Comparison Type"
  ) +
  causalverse::ama_theme()

Scatter plot of treatment effect estimates versus their weights. Points are colored by comparison type: red (earlier vs later treated), green (later vs earlier treated), and blue (treated vs untreated). Most comparisons cluster near zero weight, but a few blue points have large weights and high positive estimates, indicating that untreated comparisons drive much of the overall effect. A legend in the top right explains the color coding.

Figure 30.8: Decomposition of Treatment Effects by Comparison Type and Weight

Insight: This plot shows the contribution of each 2×2 DiD comparison, highlighting how estimates with large weights dominate the overall TWFE coefficient.

Interpretation and Practical Implications

Purpose: Decomposes the TWFE DiD estimate into the sum of all two-group, two-period comparisons.
Insight: Reveals how much each comparison contributes to the overall estimate and whether any have negative or misleading effects.
Implementation:
- Identify subgroups by treatment timing.
- Compute DiD for each 2×2 comparison (early vs. late, late vs. never, etc.).
- Evaluate how these contribute to the final TWFE estimate.

When time-varying covariates are included that allow for identification within treatment timing groups, certain problematic comparisons (like “early vs. late”) may no longer influence the TWFE estimator directly. These scenarios may collapse into simpler within-group estimates, improving identification (Table 30.6).

Table 30.6: Goodman-Bacon Comparison Types
Comparison Type	Description	Common Issue
Treated vs. Never	Clean comparisons if never-treated units exist	Often reliable
Early vs. Late	Later group is control in earlier period	May introduce bias
Late vs. Early	Early group is control in later period	May reverse causality
Treated vs. Treated	Within-treatment variation by timing	Sensitive to dynamics

30.6.4 Remedies for TWFE’s Shortcomings

This section outlines alternative estimators and design-based approaches that explicitly handle heterogeneous treatment effects, staggered adoption (Baker, Larcker, and Wang 2022), and dynamic treatment effects better than standard TWFE (e.g., Modern Estimators for Staggered Adoption).

Group-Time Average Treatment Effects

Callaway and Sant’Anna (2021) propose a two-step approach:

Group-time treatment effects: In each time period, estimate the effect for the cohort that first received treatment in that period (compared to a never-treated group).
Aggregate: Use a bootstrap procedure to account for autocorrelation and clustering, then aggregate across groups.

Advantages: Allows for heterogeneous treatment effects across groups and over time; compares treated groups only with never-treated units (or well-chosen controls).
Implementation: did package in R.

Event-Study Design with Cohort-Specific Estimates

L. Sun and Abraham (2021) build on Callaway and Sant’Anna (2021) to handle event-study settings:

Lags and Leads: Capture dynamic treatment effects by including time lags and leads relative to the event (treatment).
Cohort-Specific Estimates: Estimate separate paths of outcomes for each cohort, controlling for other cohorts carefully.
Interaction-Weighted Estimator: Adjusts for differences in when treatment began.
Implementation: fixest package in R.

Panel Match DiD Estimator with In-and-Out Treatment Conditions

Imai and Kim (2021) develop methods allowing units to switch in and out of treatment:

Matching to create a weighted version of TWFE, addressing some of the bias from heterogeneous effects.
Implementation: wfe and PanelMatch R packages.

Two-Stage Difference-in-Differences (DiD2S)

Gardner (2022) propose two-stage DiD:

Idea: Partial out fixed effects first, then perform a second-stage regression that focuses on within-group/time variation.
Strength: Handles heterogeneous treatment effects well, especially when never-treated units are present.
Implementation: did2s R package.

Switching DiD Estimator

If a study has never-treated units, Clément De Chaisemartin and d’Haultfoeuille (2020) suggest an switching DiD estimator to recover the average treatment effect.
Caveat: This approach still fails to detect heterogeneity if treatment effects vary with exposure length (L. Sun and Shapiro 2022).

Design-Based Approaches: Arkhangelsky et al. (2024) offer further refinements that incorporate inverse probability weighting.
Goal: Improve balance and reduce bias from non-random treatment timing.

Stacked DID (Simpler but Biased)
- Build stacked datasets for each treatment cohort, running separate regressions for each “event window.”
- This approach is simpler but can still carry biases if the underlying assumptions are violated (Gormley and Matsa 2011; Cengiz et al. 2019; Deshpande and Li 2019).
Doubly Robust Difference-in-Differences Estimators (DR-DID) (Sant’Anna and Zhao 2020)
- DR-DID estimators combine outcome regression and propensity score weighting to identify treatment effects, remaining consistent if either model is correctly specified.
- They achieve local efficiency under joint correctness and can be applied to both panel and repeated cross-section data.
Nonlinear Difference-in-Differences

30.6.5 Best Practices and Recommendations

Below are practical guidelines for deciding when to use TWFE and how to diagnose or address potential bias.

When is TWFE Appropriate?
- Single Treatment Period: TWFE DiD works well if there is only one treatment period for all treated units (no variation in timing).
- Homogeneous Effects: If strong theoretical or empirical reasons suggest constant treatment effects across cohorts and over time, TWFE remains a reasonable choice.
Diagnosing and Addressing Bias with Staggered Adoption
- Plot Treatment Timing: Examine the distribution of treatment timing across units. If treatment adoption is highly staggered, TWFE is likely to produce biased estimates.
- Decomposition Methods: Use the Goodman-Bacon Decomposition (Goodman-Bacon 2021) to see how TWFE pools comparisons (and whether negative weights emerge). If decomposition is infeasible (e.g., unbalanced panels), the share of never-treated units can indicate potential bias severity.
  - Decomposes the TWFE DiD estimate into two-group, two-period comparisons.
  - Identifies which comparisons receive negative weights, which can lead to biased estimates.
  - Helps determine the influence of specific groups on the overall estimate.
- Discuss Heterogeneity: Explicitly state the likelihood of treatment effect heterogeneity; incorporate it into the research design.
Event-Study Specifications within TWFE
- Avoid Arbitrary Binning: Do not collapse multiple time periods into a single bin unless you can justify homogeneous effects within that bin.
- Full Relative-Time Indicators: Include flexible event-time indicators, carefully choosing a reference period (commonly \(-1\), the year before treatment). Specifically, Include fully flexible relative time indicators, and justify the reference period (usually \(l = -1\) or the period just prior to treatment).
- Beware of Multicollinearity: Including leads and lags can cause multicollinearity and artificially produce significant “pre-trends.”
- Drop the Right Periods: If all units eventually get treated, dropping post-treatment periods accidentally can bias results.
Consider Alternative Estimators

References

Arkhangelsky, Dmitry, Guido W Imbens, Lihua Lei, and Xiaoman Luo. 2024. “Design-Robust Two-Way-Fixed-Effects Regression for Panel Data.” Quantitative Economics 15 (4): 999–1034.

Baker, Andrew C, David F Larcker, and Charles CY Wang. 2022. “How Much Should We Trust Staggered Difference-in-Differences Estimates?” Journal of Financial Economics 144 (2): 370–95.

Borusyak, Kirill, Xavier Jaravel, and Jann Spiess. 2021. “Revisiting Event Study Designs: Robust and Efficient Estimation.” arXiv Preprint arXiv:2108.12419.

Callaway, Brantly, and Pedro HC Sant’Anna. 2021. “Difference-in-Differences with Multiple Time Periods.” Journal of Econometrics 225 (2): 200–230.

Cengiz, Doruk, Arindrajit Dube, Attila Lindner, and Ben Zipperer. 2019. “The Effect of Minimum Wages on Low-Wage Jobs.” The Quarterly Journal of Economics 134 (3): 1405–54.

De Chaisemartin, Clément, and Xavier d’Haultfoeuille. 2020. “Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects.” American Economic Review 110 (9): 2964–96.

Deshpande, Manasi, and Yue Li. 2019. “Who Is Screened Out? Application Costs and the Targeting of Disability Programs.” American Economic Journal: Economic Policy 11 (4): 213–48.

Gardner, John. 2022. “Two-Stage Differences in Differences.” arXiv Preprint arXiv:2207.05943.

Goodman-Bacon, Andrew. 2021. “Difference-in-Differences with Variation in Treatment Timing.” Journal of Econometrics 225 (2): 254–77.

Gormley, Todd A, and David A Matsa. 2011. “Growing Out of Trouble? Corporate Responses to Liability Risk.” The Review of Financial Studies 24 (8): 2781–821.

Imai, Kosuke, and In Song Kim. 2019. “When Should We Use Unit Fixed Effects Regression Models for Causal Inference with Longitudinal Data?” American Journal of Political Science 63 (2): 467–90.

———. 2021. “On the Use of Two-Way Fixed Effects Regression Models for Causal Inference with Panel Data.” Political Analysis 29 (3): 405–15.

Sant’Anna, Pedro HC, and Jun Zhao. 2020. “Doubly Robust Difference-in-Differences Estimators.” Journal of Econometrics 219 (1): 101–22.

Stevenson, Betsey, and Justin Wolfers. 2006. “Bargaining in the Shadow of the Law: Divorce Laws and Family Distress.” The Quarterly Journal of Economics 121 (1): 267–88.

Sun, Liyang, and Sarah Abraham. 2021. “Estimating Dynamic Treatment Effects in Event Studies with Heterogeneous Treatment Effects.” Journal of Econometrics 225 (2): 175–99.

Sun, Liyang, and Jesse M Shapiro. 2022. “A Linear Panel Model with Heterogeneous Coefficients and Variation in Exposure.” Journal of Economic Perspectives 36 (4): 193–204.