18.8 Interaction Debate: Binning Estimators vs. Generalized Additive Models

While the classical moderation framework, as outlined above, is the dominant approach in applied research, it assumes that the specified interaction term fully captures how the relationship between \(X\) and \(Y\) changes with \(M\). In practice, this approach relies on strong functional form assumptions, particularly linearity in both the main effects and the interaction. If these assumptions are violated, the estimated interaction effect may be biased or misleading.

It is precisely these concerns that have motivated a recent and influential methodological debate about how interactions should be estimated and interpreted in observational data. At the center of this debate are two competing approaches: binning-based estimators, which aim to relax functional form assumptions through localized estimation, and generalized additive models (GAMs), which model nonlinear relationships directly. Understanding this debate is critical, because the choice of method can fundamentally change the conclusions we draw from moderation analyses.

Imagine you’re a business researcher studying how advertising effectiveness varies with market competition. You run a regression with an interaction term between advertising spend and competitive intensity. Your results show statistical significance, but are they valid? This question lies at the heart of a methodological debate that has profound implications for how we analyze interactions in observational data.

In 2019, political scientists Jens Hainmueller, Jonathan Mummolo, and Yiqing Xu (HMX) published a highly influential paper proposing the “binning estimator” as a solution to problems with multiplicative interaction models (Hainmueller, Mummolo, and Xu 2019). Their approach has been widely adopted, with over 1,200 citations by the time of this writing. However, in 2024, Uri Simonsohn challenged this method, arguing that it can produce severely biased results when the underlying relationships are nonlinear, a condition that is arguably the norm rather than the exception in real-world data (Simonsohn 2024).

This debate affects thousands of studies across business, economics, and social sciences. The choice between methods can determine whether you conclude that:

Marketing effectiveness increases with firm size (or doesn’t)
Employee training has differential effects across experience levels (or doesn’t)
Product quality matters more in competitive markets (or doesn’t)

18.8.1 The Stakes

Consider that approximately 71% of articles in top journals test for interactions (Simonsohn 2024). If the standard methods are flawed, this represents a massive potential for incorrect conclusions. The debate centers on a fundamental question: What exactly are we trying to estimate when we probe an interaction?

Team HMX (Hainmueller, Mummolo, Xu + Liu, Liu) (Hainmueller, Mummolo, and Xu 2019; Hainmueller et al. 2025):

Advocate for the binning estimator and kernel methods
Focus on flexible estimation without strong functional form assumptions
Emphasize practical diagnostics for applied researchers

Team Simonsohn (Simonsohn 2024) (Blog post 1, 2):

Champions Generalized Additive Models (GAMs)
Argues that binning violates ceteris paribus principles
Emphasizes the importance of handling nonlinearities correctly

Before diving into the debate, let’s establish a solid foundation. An interaction effect occurs when the relationship between two variables depends on the value of a third variable.

The standard linear interaction model is the workhorse model in social sciences:

\[Y = \beta_0 + \beta_1 D + \beta_2 X + \beta_3 D \times X + \epsilon\]

Where:

\(Y\) = outcome variable (e.g., sales revenue)
\(D\) = treatment/focal variable (e.g., advertising spend)
\(X\) = moderator (e.g., market competition)
\(D \times X\) = interaction term

The marginal effect of \(D\) on \(Y\) is:

\[\frac{\partial Y}{\partial D} = \beta_1 + \beta_3 X\]

This tells us that the effect of advertising on sales is \(\beta_1\) when competition is zero, and changes by \(\beta_3\) for each unit increase in competition.

For example, consider a retail business studying how price changes affect sales, moderated by customer loyalty status:

\[ Sales = \beta_0 + \beta_1(Price) + \beta_2(Loyalty) + \beta_3(Price × Loyalty) + \epsilon \]

If \(\beta_3\) is positive, it suggests loyal customers are less price-sensitive.

However, the standard model assumes all relationships are linear. This means:

The effect of price on sales changes at a constant rate with loyalty
The relationships don’t curve or bend
Effects are symmetric (increases and decreases have opposite but equal effects)

But what if these assumptions are violated?

18.8.2 The Core Problem: When Linearity Fails

The real world is frustratingly nonlinear. Consider these business realities:

Common Nonlinearities in Business:

Diminishing Returns: Marketing effectiveness often follows a logarithmic pattern
Threshold Effects: Quality improvements may not matter until they cross a perceptibility threshold
Saturation Points: Customer satisfaction can’t exceed 100%
Network Effects: Value may increase exponentially with user base size

The Three Problems Identified

Problem 1 (HMX): Researchers often probe interactions at extreme or impossible values of the moderator.
Problem 2 (HMX): The interaction itself may be nonlinear.
Problem 3 (Simonsohn): When predictors are correlated and have nonlinear effects, the interaction term captures these nonlinearities, leading to false positives.

To understand problem 3, consider the true model: \[Y = D^2 + \epsilon\]

Where \(D\) and \(X\) are correlated (\(r = 0.5\)), but \(X\) doesn’t actually affect \(Y\).

If we estimate: \[Y = \beta_0 + \beta_1 D + \beta_2 X + \beta_3 D \times X + \epsilon\]

The interaction term \(\beta_3\) will be significant even though there’s no true interaction! This happens because:

The omitted \(D^2\) term correlates with \(D \times X\) (due to the correlation between \(D\) and \(X\))
The interaction term acts as a proxy for the missing nonlinearity
We mistakenly conclude that the effect of \(D\) depends on \(X\)

Imagine studying whether employee training effectiveness depends on prior experience. If both training hours and experience affect productivity nonlinearly, and they’re correlated (more experienced employees often receive more training), you might falsely conclude that training works better for experienced employees when really you’re just capturing the nonlinear effect of training itself.

18.8.3 Binning Estimator Approach

HMX proposed the binning estimator as a practical solution. Here’s how it works:

Split the moderator into bins (typically terciles: low, medium, high)
Estimate separate regressions within each bin
Compare effects across bins

For three bins, estimate: \[Y = \sum_{j=1}^{3} \{\mu_j + \alpha_j D + \eta_j (X-\bar{x}_j) + \beta_j(X-\bar{x}_j)D\}G_j + \epsilon\]

Where:

\(G_j\) = indicator for bin \(j\)
\(\bar{x}_j\) = median value of \(X\) in bin \(j\)
\(\alpha_j\) = effect of \(D\) at the median of bin \(j\)

Advantages Claimed by HMX

Simplicity: Easy to implement and understand
Flexibility: Doesn’t impose strict functional form
Diagnostics: Reveals nonlinearities in the interaction
Common Support: Only estimates effects where data exist

18.8.4 Simonsohn’s Critique

Simonsohn provides a devastating example that illustrates the core problem with mathematical precision. Consider PhD admissions where professors rate PhD applicants:

The True Data Generating Process: \(\text{Rating} = \log(\text{GRE}) + \epsilon\)

Key Facts:

Research experience does NOT enter the rating function
But: \(\text{Corr}(\text{GRE}, \text{Experience}) = 0.5\) (more experienced applicants tend to have higher GRE scores)
Researchers want to know: Does research experience moderate the GRE-rating relationship?

What the Linear Model Estimates: \(\text{Rating} = \beta_0 + \beta_1 \text{GRE} + \beta_2 \text{Experience} + \beta_3 (\text{GRE} \times \text{Experience}) + \epsilon\)

The researcher finds \(\beta_3 < 0\) and significant! Interpretation: “GRE matters less for experienced applicants.”

What the Binning Estimator Shows:

Let’s work through the math. Suppose:

Low experience bin: Mean GRE = 400
Medium experience bin: Mean GRE = 550
High experience bin: Mean GRE = 700

Within each bin, the binning estimator calculates: \(\frac{\partial \text{Rating}}{\partial \text{GRE}} \bigg|_{\text{bin}} \approx \frac{\Delta \log(\text{GRE})}{\Delta \text{GRE}} \bigg|_{\text{mean GRE in bin}}\)

Since Rating = log(GRE), the true marginal effect is: \(\frac{\partial \text{Rating}}{\partial \text{GRE}} = \frac{1}{\text{GRE}}\)

Therefore:

Low bin (GRE \(\approx\) 400): Marginal effect \(\approx\) 1/400 = 0.0025
Medium bin (GRE \(\approx\) 550): Marginal effect \(\approx\) 1/550 = 0.0018
High bin (GRE \(\approx\) 700): Marginal effect \(\approx\) 1/700 = 0.0014

The Spurious Finding:

The binning estimator shows a declining marginal effect across experience levels (0.0025 \(\to\) 0.0018 \(\to\) 0.0014), leading to the false conclusion that “GRE matters less for experienced applicants.”

Why This Happens

Omitted Variable Bias: The true model contains log(GRE), which is approximately GRE - GRE\(^2\) /2 + GRE\(^3\) /3 + … by Taylor expansion
Correlation Structure: Since Experience correlates with GRE, it also correlates with GRE²
The Interaction Term as Proxy: The interaction GRE × Experience partially captures the omitted GRE² term
Binning Doesn’t Help: Within each bin, we still have:
- Different average GRE levels
- The same nonlinear relationship
- Violation of ceteris paribus

18.8.5 Simonsohn’s Core Criticism

The binning estimator violates ceteris paribus. When comparing across bins, you’re changing both:

The moderator value (intentionally)
The average value of correlated predictors (unintentionally)

This confounding makes it impossible to isolate the true interaction effect.

18.8.6 Generalized Additive Models Alternative

Simonsohn advocates for GAMs as a superior alternative. Let’s understand what they are and why he believes they solve the problem.

A GAM extends the linear model by replacing linear terms with smooth functions:

Linear Model: \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon\)

GAM: \(Y = \beta_0 + f_1(X_1) + f_2(X_2) + \epsilon\)

Where \(f_1\) and \(f_2\) are smooth functions estimated from the data.

GAMs can model interactions flexibly: \[Y = f_1(D) + f_2(X) + f_3(D, X) + \epsilon\]

Where \(f_3(D, X)\) captures any interaction beyond the main effects.

How GAMs Work

Basis Expansion: Each smooth function is represented as a weighted sum of basis functions (like splines)
Penalized Estimation: A penalty prevents overfitting by controlling wiggliness
Automatic Selection: The degree of smoothness is determined by the data

A smooth function in a GAM is represented as: \[f(x) = \sum_{k=1}^{K} \beta_k b_k(x)\]

Where:

\(b_k(x)\) are basis functions (e.g., cubic splines)
\(\beta_k\) are coefficients to be estimated
\(K\) determines the maximum complexity

Advantages of GAMs

Flexibility: Can capture any smooth relationship
Ceteris Paribus: Properly isolates effects
No Binning: Uses all data efficiently
Automatic Complexity: Data determines the functional form

Simonsohn proposes “GAM Simple Slopes” for probing interactions:

Fit the GAM with interaction
Calculate predicted values at specific moderator values
Plot the relationship between \(D\) and \(Y\) at each moderator value

This maintains ceteris paribus by holding other variables constant.

18.8.7 Mathematical Foundations of the Disagreement

The core disagreement is philosophical: What is the estimand (target of estimation)?

Conditional Marginal Effect (CME)

What HMX target: \[\theta(x) = E\left[\frac{\partial Y_i(d)}{\partial d} \bigg| X_i = x\right]\]

This marginalizes over the distribution of \(D\) and other covariates \(Z\) at \(X = x\).

Conditional Average Partial Effect (CAPE)

What Simonsohn argues GAMs estimate: \[\rho(d, x) = E\left[\frac{\partial Y_i(d)}{\partial d} \bigg| D_i = d, X_i = x\right]\]

This conditions on specific values of \(D\).

The Fundamental Difference

HMX argue their estimand answers: “What’s the average effect of \(D\) for units with \(X = x\)?”
Simonsohn argues researchers want: “What’s the effect of \(D\) at \(X = x\), holding all else constant?”

A Business Translation

HMX Estimand: “What’s the average effect of price changes for stores in high-competition markets?” (Includes the fact that high-competition stores might have different pricing patterns)
Simonsohn Estimand: “If we took a store and changed only its price, how would the effect differ in high vs. low competition?” (Pure ceteris paribus effect)

18.8.8 Mathematical Example

True model: \(Y = D^2 - 0.5D + \epsilon\)

With \(Corr(D, X) = 0.5\):

HMX’s CME: \(\theta(x) = x - 0.5\) (Not zero! Increases with \(X\))

Why? As \(X\) increases, the distribution of \(D\) shifts up. Since \(\frac{\partial Y}{\partial D} = 2D - 0.5\) increases with \(D\), the average effect increases with \(X\).

Simonsohn’s Interpretation: There’s no interaction because \(X\) doesn’t appear in the true model.

18.8.9 When to Use Each Method

Use Traditional Linear Models When:

You have strong theoretical reasons to expect linear relationships
Sample size is small (< 200)
Interpretability is paramount
You’ve tested and confirmed linearity assumptions

Use Binning Estimator When:

You have experimental data (random assignment)
You need a quick diagnostic tool
You’re presenting to non-technical audiences
As a robustness check, not primary analysis

Use GAMs When:

You have observational data
Sample size is adequate (> 500 preferred)
You suspect nonlinear relationships
You need to maintain ceteris paribus

Best Practices for Any Method

Always visualize your data first

# Scatterplot matrix
pairs(~ Y + D + X, data = data, 
      lower.panel = panel.smooth)

Test for nonlinearity

# Reset test for linearity
library(lmtest)
resettest(lm_model, power = 2:3)

Check correlations among predictors
```
cor(data[, c("D", "X")])
```
Consider theoretical expectations
- Are diminishing returns plausible?
- Could there be threshold effects?
- Is the scale bounded?
Report multiple approaches
- Primary analysis with GAM
- Robustness check with binning
- Show how conclusions change (or don’t)

18.8.10 A Decision Tree

Start: Do I need to test an interaction?
│
├─ Is at least one variable randomly assigned?
│  ├─ Yes → Experimental/Quasi-experimental
│  │  ├─ Sample size < 200 → Use linear model (with caution)
│  │  ├─ Sample size 200-500 → Use binning as diagnostic + GAM
│  │  └─ Sample size > 500 → Use GAM simple slopes
│  │
│  └─ No → Observational Data
│     │
│     ├─ Are predictors likely correlated?
│     │  ├─ Yes (usually) → Strong nonlinearity concern
│     │  │  ├─ Can implement GAM? → Use GAM
│     │  │  └─ Cannot implement GAM? → Add quadratic controls
│     │  │
│     │  └─ No (rare) → Proceed with caution
│     │
│     └─ Check for nonlinearity
│        ├─ Theory suggests nonlinearity → Use GAM
│        ├─ Bounded scales → Use GAM
│        └─ Previous literature → Use GAM

References

Hainmueller, Jens, Jiehan Liu, Ziyi Liu, Jonathan Mummolo, and Yiqing Xu. 2025. “A Response to Recent Critiques of Hainmueller, Mummolo and Xu (2019) on Estimating Conditional Relationships.” arXiv Preprint arXiv:2502.05717.

Hainmueller, Jens, Jonathan Mummolo, and Yiqing Xu. 2019. “How Much Should We Trust Estimates from Multiplicative Interaction Models? Simple Tools to Improve Empirical Practice.” Political Analysis 27 (2): 163–92.

Simonsohn, Uri. 2024. “Interacting with Curves: How to Validly Test and Probe Interactions in the Real (Nonlinear) World.” Advances in Methods and Practices in Psychological Science 7 (1): 25152459231207787.