35.9 Selection on Observables
In observational studies, treatment assignment is typically not randomized. This poses a challenge when estimating causal effects, as treated and control groups may differ systematically. A central assumption that allows us to estimate causal effects from such data is selection on observables, also known as unconfoundedness or conditional independence.
Suppose we observe a binary treatment indicator Ti∈{0,1} and an outcome Yi. Each unit i has two potential outcomes:
- Yi(1): outcome if treated
- Yi(0): outcome if untreated
However, only one of these outcomes is observed for each unit. The average treatment effect on the treated (ATT) is:
ATT=E[Y(1)−Y(0)∣T=1]
To identify this from data, we invoke the conditional independence assumption:
(Y(0),Y(1))⊥T∣X
This assumption means that after controlling for covariates X, treatment is as good as randomly assigned. A secondary assumption is overlap:
0<P(T=1∣X=x)<1for all x
This ensures that for every covariate profile, there is a positive probability of being both treated and untreated.
Matching attempts to approximate the conditions of a randomized experiment by creating a comparison group that is similar to the treated group in terms of observed covariates. Instead of relying solely on model-based adjustment (e.g., regression), matching balances the covariate distribution across treatment groups before estimation.
35.9.1 Matching with MatchIt
We demonstrate the matching procedure using the lalonde
dataset, a classic example in the causal inference literature, which investigates the effect of job training on subsequent earnings.
We focus on estimating the effect of the treatment (treat
) on earnings in 1978 (re78
), conditional on covariates.
Step 1: Planning the Analysis
Before conducting matching, several strategic decisions must be made:
Estimand: Do you want ATT (effect on treated), ATE (effect in population), or ATC (effect on controls)? Matching typically targets the ATT.
Covariate Selection: Only include pre-treatment variables that are potential confounders—i.e., affect both the treatment assignment and the outcome (Austin 2011; T. J. VanderWeele 2019).
Distance Measure: Choose how to quantify similarity between units (e.g., propensity score, Mahalanobis distance).
Matching Method: Determine the method (e.g., nearest neighbor, full matching, genetic matching).
For our demonstration, we focus on ATT using propensity score matching.
Step 2: Assessing Initial Imbalance
We first assess imbalance between treatment and control groups before matching.
# Estimate propensity scores with logistic regression
m.out0 <- matchit(
treat ~ age + educ + race + married + nodegree + re74 + re75,
data = MatchIt::lalonde,
method = NULL, # no matching, only estimates propensity scores
distance = "glm"
)
# Summary of balance statistics before matching
summary(m.out0)
#>
#> Call:
#> matchit(formula = treat ~ age + educ + race + married + nodegree +
#> re74 + re75, data = MatchIt::lalonde, method = NULL, distance = "glm")
#>
#> Summary of Balance for All Data:
#> Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
#> distance 0.5774 0.1822 1.7941 0.9211 0.3774
#> age 25.8162 28.0303 -0.3094 0.4400 0.0813
#> educ 10.3459 10.2354 0.0550 0.4959 0.0347
#> raceblack 0.8432 0.2028 1.7615 . 0.6404
#> racehispan 0.0595 0.1422 -0.3498 . 0.0827
#> racewhite 0.0973 0.6550 -1.8819 . 0.5577
#> married 0.1892 0.5128 -0.8263 . 0.3236
#> nodegree 0.7081 0.5967 0.2450 . 0.1114
#> re74 2095.5737 5619.2365 -0.7211 0.5181 0.2248
#> re75 1532.0553 2466.4844 -0.2903 0.9563 0.1342
#> eCDF Max
#> distance 0.6444
#> age 0.1577
#> educ 0.1114
#> raceblack 0.6404
#> racehispan 0.0827
#> racewhite 0.5577
#> married 0.3236
#> nodegree 0.1114
#> re74 0.4470
#> re75 0.2876
#>
#> Sample Sizes:
#> Control Treated
#> All 429 185
#> Matched 429 185
#> Unmatched 0 0
#> Discarded 0 0
This summary provides:
Standardized mean differences
Variance ratios
Propensity score distributions
These diagnostics help us understand the extent of covariate imbalance.
Step 3: Implementing Matching
- Nearest Neighbor Matching (1:1 without Replacement)
m.out1 <- matchit(
treat ~ age + educ + race + married + nodegree + re74 + re75,
data = MatchIt::lalonde,
method = "nearest",
distance = "glm"
)
Matching is based on estimated propensity scores. Each treated unit is matched to the closest control unit.
Assess Balance After Matching
summary(m.out1, un = FALSE) # only show post-matching stats
#>
#> Call:
#> matchit(formula = treat ~ age + educ + race + married + nodegree +
#> re74 + re75, data = lalonde, method = "nearest", distance = "glm")
#>
#> Summary of Balance for Matched Data:
#> Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
#> distance 0.5774 0.3629 0.9739 0.7566 0.1321
#> age 25.8162 25.3027 0.0718 0.4568 0.0847
#> educ 10.3459 10.6054 -0.1290 0.5721 0.0239
#> raceblack 0.8432 0.4703 1.0259 . 0.3730
#> racehispan 0.0595 0.2162 -0.6629 . 0.1568
#> racewhite 0.0973 0.3135 -0.7296 . 0.2162
#> married 0.1892 0.2108 -0.0552 . 0.0216
#> nodegree 0.7081 0.6378 0.1546 . 0.0703
#> re74 2095.5737 2342.1076 -0.0505 1.3289 0.0469
#> re75 1532.0553 1614.7451 -0.0257 1.4956 0.0452
#> eCDF Max Std. Pair Dist.
#> distance 0.4216 0.9740
#> age 0.2541 1.3938
#> educ 0.0757 1.2474
#> raceblack 0.3730 1.0259
#> racehispan 0.1568 1.0743
#> racewhite 0.2162 0.8390
#> married 0.0216 0.8281
#> nodegree 0.0703 1.0106
#> re74 0.2757 0.7965
#> re75 0.2054 0.7381
#>
#> Sample Sizes:
#> Control Treated
#> All 429 185
#> Matched 185 185
#> Unmatched 244 0
#> Discarded 0 0
# Visual diagnostic: jitter plot of propensity scores
plot(m.out1, type = "jitter", interactive = FALSE)
Interpretation: Good matches will show overlapping distributions of covariates across groups, and standardized differences should be below 0.1 in absolute value.
- Full Matching
Allows many-to-one or one-to-many matches, minimizing overall distance.
m.out2 <- matchit(
treat ~ age + educ + race + married + nodegree + re74 + re75,
data = MatchIt::lalonde,
method = "full",
distance = "glm",
link = "probit"
)
summary(m.out2, un = FALSE)
#>
#> Call:
#> matchit(formula = treat ~ age + educ + race + married + nodegree +
#> re74 + re75, data = MatchIt::lalonde, method = "full", distance = "glm",
#> link = "probit")
#>
#> Summary of Balance for Matched Data:
#> Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
#> distance 0.5773 0.5764 0.0045 0.9949 0.0043
#> age 25.8162 25.5347 0.0393 0.4790 0.0787
#> educ 10.3459 10.5381 -0.0956 0.6192 0.0253
#> raceblack 0.8432 0.8389 0.0119 . 0.0043
#> racehispan 0.0595 0.0492 0.0435 . 0.0103
#> racewhite 0.0973 0.1119 -0.0493 . 0.0146
#> married 0.1892 0.1633 0.0660 . 0.0259
#> nodegree 0.7081 0.6577 0.1110 . 0.0504
#> re74 2095.5737 2100.2150 -0.0009 1.3467 0.0314
#> re75 1532.0553 1561.4420 -0.0091 1.5906 0.0536
#> eCDF Max Std. Pair Dist.
#> distance 0.0486 0.0198
#> age 0.2742 1.2843
#> educ 0.0730 1.2179
#> raceblack 0.0043 0.0162
#> racehispan 0.0103 0.4412
#> racewhite 0.0146 0.3454
#> married 0.0259 0.4473
#> nodegree 0.0504 0.9872
#> re74 0.1881 0.8387
#> re75 0.1984 0.8240
#>
#> Sample Sizes:
#> Control Treated
#> All 429. 185
#> Matched (ESS) 50.76 185
#> Matched 429. 185
#> Unmatched 0. 0
#> Discarded 0. 0
- Exact Matching
Only matches units with exactly the same covariate values (usually categorical):
- Optimal Matching
Minimizes the total distance between matched units across the sample.
m.out4 <- matchit(
treat ~ age + educ + re74 + re75,
data = MatchIt::lalonde,
method = "optimal",
ratio = 2
)
- Genetic Matching
Searches over weights assigned to covariates to optimize balance.
Step 4: Estimating Treatment Effects
Once matching is complete, we use the matched data to estimate treatment effects.
library(lmtest)
library(sandwich)
# Extract matched data
matched_data <- match.data(m.out1)
# Estimate ATT with robust standard errors
model <- lm(re78 ~ treat + age + educ + race + re74 + re75,
data = matched_data,
weights = weights)
coeftest(model, vcov. = vcovCL, cluster = ~subclass)
#>
#> t test of coefficients:
#>
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -437.664937 1912.759171 -0.2288 0.819143
#> treat 1398.134870 723.745751 1.9318 0.054164 .
#> age -0.343085 39.256789 -0.0087 0.993032
#> educ 470.767350 147.892765 3.1832 0.001583 **
#> racehispan 1518.303924 1035.083141 1.4668 0.143287
#> racewhite 557.295853 897.121013 0.6212 0.534856
#> re74 0.017244 0.166298 0.1037 0.917470
#> re75 0.226076 0.165722 1.3642 0.173357
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The coefficient on treat
is our estimate of the ATT. The weights and subclass clustering account for the matched design.
35.9.2 Reporting Standards
To ensure transparency and reproducibility, always report the following:
Matching method (e.g., nearest neighbor, genetic)
Distance metric (e.g., propensity score via logistic regression)
Covariates matched on and justification for their inclusion
Balance statistics (e.g., standardized mean differences before/after)
Sample sizes: total, matched, unmatched, discarded
Estimation model: whether treatment effects are estimated using regression adjustment, with or without weights
Assumptions: especially unconfoundedness and overlap
35.9.3 Optimization-Based Matching via designmatch
For more advanced applications, the designmatch
package provides matching methods based on combinatorial optimization.
Notable methods:
distmatch()
: Distance-based matching with custom constraintsbmatch()
: Bipartite matching using linear programmingcardmatch()
: Cardinality matching for maximum matched sample size with balance constraintsprofmatch()
: Profile matching for stratified treatment allocationnmatch()
: Non-bipartite matching (e.g., in interference settings)
35.9.4 MatchingFrontier
As mentioned in MatchIt
, you have to make trade-off (also known as bias-variance trade-off) between balance and sample size. An automated procedure to optimize this trade-off is implemented in MatchingFrontier
(G. King, Lucas, and Nielsen 2017), which solves this joint optimization problem.
Following MatchingFrontier
guide
# library(devtools)
# install_github('ChristopherLucas/MatchingFrontier')
library(MatchingFrontier)
data("lalonde", package = "MatchIt")
# choose var to match on
match.on <-
colnames(lalonde)[!(colnames(lalonde) %in% c('re78', 'treat'))]
match.on
# Mahanlanobis frontier (default)
mahal.frontier <-
makeFrontier(
dataset = lalonde,
treatment = "treat",
match.on = match.on
)
mahal.frontier
# L1 frontier
L1.frontier <-
makeFrontier(
dataset = lalonde,
treatment = 'treat',
match.on = match.on,
QOI = 'SATT',
metric = 'L1',
ratio = 'fixed'
)
L1.frontier
# estimate effects along the frontier
# Set base form
my.form <-
as.formula(re78 ~ treat + age + black + education
+ hispanic + married + nodegree + re74 + re75)
# Estimate effects for the mahalanobis frontier
mahal.estimates <-
estimateEffects(
mahal.frontier,
're78 ~ treat',
mod.dependence.formula = my.form,
continuous.vars = c('age', 'education', 're74', 're75'),
prop.estimated = .1,
means.as.cutpoints = TRUE
)
# Estimate effects for the L1 frontier
L1.estimates <-
estimateEffects(
L1.frontier,
're78 ~ treat',
mod.dependence.formula = my.form,
continuous.vars = c('age', 'education', 're74', 're75'),
prop.estimated = .1,
means.as.cutpoints = TRUE
)
# Plot covariates means
# plotPrunedMeans()
# Plot estimates (deprecated)
# plotEstimates(
# L1.estimates,
# ylim = c(-10000, 3000),
# cex.lab = 1.4,
# cex.axis = 1.4,
# panel.first = grid(NULL, NULL, lwd = 2,)
# )
# Plot estimates
plotMeans(L1.frontier)
# parallel plot
parallelPlot(
L1.frontier,
N = 400,
variables = c('age', 're74', 're75', 'black'),
treated.col = 'blue',
control.col = 'gray'
)
# export matched dataset
# take 400 units
matched.data <- generateDataset(L1.frontier, N = 400)
35.9.5 Propensity Scores
Propensity score methods are widely used for estimating causal effects in observational studies, where random assignment to treatment and control groups is not feasible. The core idea is to mimic a randomized experiment by adjusting for confounding variables that predict treatment assignment. Formally, the propensity score is defined as the probability of assignment to treatment conditional on observed covariates (Rosenbaum and Rubin 1983, 1985):
ei(Xi)=P(Ti=1∣Xi)
where:
- Ti∈{0,1} is the binary treatment indicator for unit i,
- Xi is a vector of observed pre-treatment covariates for unit i.
The key insight from Rosenbaum and Rubin (1983) is that conditioning on the propensity score is sufficient to remove bias due to confounding from observed covariates, under certain assumptions.
35.9.5.1 Assumptions for Identification
To identify causal effects using propensity scores, the following assumptions must hold:
- Unconfoundedness / Conditional Independence Assumption (CIA):
(Yi(0),Yi(1))⊥Ti∣Xi
- Positivity (Overlap):
0<P(Ti=1∣Xi)<1for all i
These assumptions ensure that for each unit, we can observe comparable treated and untreated units in the sample. Violations of positivity, especially in high-dimensional covariate spaces, are a critical weakness of propensity score matching.
35.9.5.2 Why PSM Is Not Recommended Anymore
Despite its intuitive appeal, recent literature has strongly cautioned against using propensity score matching for causal inference (G. King and Nielsen 2019). The main criticisms are as follows:
- Imbalance: PSM often fails to achieve covariate balance better than simpler techniques like covariate adjustment via regression or exact matching. Matching on the propensity score is a scalar reduction of a multivariate distribution, and this reduction can distort multivariate relationships.
- Inefficiency: Discarding unmatched units reduces statistical efficiency, especially when better estimators (e.g., inverse probability weighting or doubly robust estimators) can use all data.
- Model dependence: Small changes in the specification of the propensity score model can lead to large changes in matches and estimated treatment effects.
- Bias: Poor matches and irrelevant covariates can introduce additional bias rather than reduce it.
Abadie and Imbens (2016) show that the asymptotic distribution of treatment effect estimators is sensitive to the estimation of the propensity score itself:
- The estimated propensity score can improve efficiency over using the true propensity score when estimating the average treatment effect (ATE). Formally, the adjustment to the asymptotic variance is non-positive.
- However, for the average treatment effect on the treated (ATT), the sign of the adjustment is data-dependent. Estimation error in the propensity score can lead to misestimated confidence intervals: they may be too wide or too narrow.
This result suggests that even in large samples, failure to account for the estimation uncertainty in the propensity score can produce misleading inference.
A fundamental flaw in PSM is the asymmetry of match quality:
- If Xc=Xt, then it must be that e(Xc)=e(Xt).
- However, the converse does not hold:
e(Xc)=e(Xt)⇏
Therefore, two units with identical propensity scores may still differ substantially in covariate space. This undermines the matching goal of achieving similarity across all covariates.
35.9.5.3 Estimating the Propensity Score
Estimation of the propensity score is typically carried out using:
- Parametric models:
- Logistic regression:
\hat{e}_i = \frac{1}{1 + \exp(-X_i^\top \hat{\beta})}
- Logistic regression:
- Nonparametric / machine learning methods:
- Generalized Boosted Models (GBM)
- Boosted Classification and Regression Trees (CART)
- Random forests or Bayesian Additive Regression Trees (BART)
These machine learning approaches often yield better balance due to flexible functional forms, but they also complicate interpretation and inference.
35.9.5.4 Matching Algorithms
- Reduce the high-dimensional vector X_i to the scalar \hat{e}_i.
- Calculate distances between treated and control units based on \hat{e}_i: d(i, j) = |\hat{e}_i - \hat{e}_j|
- Match each treated unit i to the control unit j with the closest propensity score.
- Prune:
- Control units not used in any match are discarded.
- Matches exceeding a caliper (maximum distance threshold) are also discarded.
- No replacement is typically assumed — each control unit is matched at most once.
Let c > 0 be a caliper. Then a treated unit i is matched to control unit j only if:
|\hat{e}_i - \hat{e}_j| < c
Caliper matching helps reduce poor matches, but may result in random pruning, further reducing balance and efficiency.
35.9.5.5 Practical Recommendations
- Do not include irrelevant covariates: Including variables that are unrelated to the outcome can increase the variability of the estimated propensity score and reduce matching quality.
- Avoid instrumental variables (IVs) (Bhattacharya and Vogt 2007): Including IVs in the propensity score model can inflate bias by introducing variation in treatment assignment unrelated to potential outcomes.
- Focus on covariates that are confounders, i.e., those that affect both treatment and outcome.
What remains after pruning is more consequential than the initial covariate set. Matching can discard a large portion of the sample, which distorts representativeness and increases variance.
35.9.5.6 Diagnostics and Evaluation
After matching, the primary diagnostic tool is covariate balance. Key diagnostics include:
- Standardized mean differences (SMD) between treatment groups before and after matching
- Variance ratios
- Visual inspection via Love plots or density plots
Statistical model fit criteria (e.g., AIC, BIC, c-statistics) are not valid for evaluating propensity score models, since the goal is not prediction, but achieving balance.
There is no need to worry about collinearity in the covariates when estimating propensity scores — unlike in outcome regression models, where multicollinearity can inflate standard errors.
35.9.5.7 Applications in Business and Finance
Propensity score methods have been used in empirical finance and marketing, though increasingly replaced by more robust approaches. One illustrative application is found in Hirtle, Kovner, and Plosser (2020), which investigates the causal effect of regulatory bank supervision on firm-level outcomes:
- Treatment: Degree of supervisory attention.
- Outcomes: Loan risk, profitability, volatility, and firm growth.
- Method: Propensity score matching to construct treated and control groups of banks with comparable observed characteristics.
Their matched sample analysis reveals that intensified supervision leads to:
- Lower risk (more conservative loan portfolios)
- Reduced volatility
- No significant loss in profitability or growth
35.9.5.8 Conclusion
While the theoretical motivation behind propensity scores remains sound, their application via naive matching methods is no longer considered best practice. The statistical community increasingly favors alternatives such as:
- Covariate adjustment via regression
- Inverse probability weighting (IPW)
- Doubly robust estimators
- Targeted Maximum Likelihood Estimation (TMLE)
These approaches better utilize the full dataset, yield more efficient estimators, and offer more transparent diagnostics. Matching on estimated propensity scores may still be useful for illustration or sensitivity analysis, but should not be the primary method for causal inference in applied research.
35.9.6 Mahalanobis Distance Matching
Mahalanobis Distance Matching is a method for matching units in observational studies based on the multivariate similarity of covariates. Unlike propensity score matching, which reduces the covariate space to a single scalar, Mahalanobis distance operates in the full multivariate space. As a result, Mahalanobis matching can be interpreted as approximating a fully blocked design, where each treated unit is paired with a control unit that has nearly identical covariate values.
- In the ideal case: X_t = X_c, implying a perfect match.
- In practice, Mahalanobis distance allows for “near-exact” matches in high-dimensional space.
Because it preserves the multivariate structure of the covariates, Mahalanobis matching more faithfully emulates randomization within covariate strata, making it more robust to specification error than propensity score matching.
This method is particularly appealing when the number of covariates is relatively small and when these covariates are continuous and well-measured.
Given a set of p covariates X_i \in \mathbb{R}^p for unit i, the Mahalanobis distance between a treated unit t and a control unit c is defined as:
D_M(X_t, X_c) = \sqrt{(X_t - X_c)^\top S^{-1}(X_t - X_c)}
where:
- X_t and X_c are the p-dimensional vectors of covariates for treated and control units, respectively,
- S is the sample covariance matrix of the covariates across all units (treated and control),
- S^{-1} serves to standardize and decorrelate the covariate space, accounting for both scale and correlation among covariates.
Why Not Use Euclidean Distance?
The Euclidean distance:
D_E(X_t, X_c) = \sqrt{(X_t - X_c)^\top (X_t - X_c)}
does not adjust for different variances among covariates or for correlation between them. Mahalanobis distance corrects for this by incorporating the inverse covariance matrix S^{-1}, effectively transforming the data to a space where the covariates are standardized and orthogonal.
This makes the Mahalanobis distance scale-invariant, i.e., invariant under affine transformations of the data, which is essential when matching on variables of different units (e.g., income in dollars and age in years).
35.9.6.1 Mahalanobis Matching Algorithm
Let \mathcal{T} be the set of treated units and \mathcal{C} the set of control units. The procedure for Mahalanobis matching can be described as follows:
- Compute the covariance matrix S of the covariates X across all units.
- For each treated unit i \in \mathcal{T}, compute D_M(X_i, X_j) for all j \in \mathcal{C}.
- Match treated unit i to the control unit j with the smallest Mahalanobis distance.
- Prune:
- Control units not used in any match are discarded.
- Matches where D_M > \delta (a caliper threshold) are also discarded.
A caliper \delta is used to enforce a maximum allowable distance. That is, a match is made between i and j only if:
D_M(X_i, X_j) < \delta
This prevents poor-quality matches and helps ensure that treated and control units are meaningfully similar. If no match falls within the caliper for a given treated unit, that unit is left unmatched, and potentially discarded from the analysis.
35.9.6.2 Properties
- Scale-invariant: Standardizes variables using S^{-1}, ensuring that variables with large scales do not dominate the distance metric.
- Correlation-adjusted: Accounts for linear relationships among covariates, which is critical in multivariate contexts.
- Non-parametric: No model for treatment assignment is estimated; purely based on observed covariates.
35.9.6.3 Limitations
- Sensitive to multicollinearity: If the covariates are highly collinear, the covariance matrix S may be nearly singular, making S^{-1} unstable or non-invertible. Regularization techniques may be required.
- Not suitable for high-dimensional covariate spaces: As p increases, exact or even near-exact matches become harder to find. Dimensionality reduction techniques (e.g., PCA) may help.
- Inefficient with categorical variables: Since Mahalanobis distance is based on continuous covariates and assumes multivariate normality (implicitly), it’s less effective when most covariates are categorical or binary.
35.9.6.4 Hybrid Approaches
Mahalanobis matching can be combined with propensity scores for improved performance:
- Within propensity score calipers: Match using Mahalanobis distance only within groups of units that fall within a specified caliper of each other in propensity score space.
- Stratified matching: Divide the sample into strata based on propensity scores and apply Mahalanobis matching within each stratum.
These hybrid methods aim to preserve the robustness of Mahalanobis matching while mitigating its weaknesses in high dimensions.
35.9.6.5 Practical Considerations
- Estimate S from the pooled sample (treated + control) to ensure consistency.
- Standardize all covariates before applying Mahalanobis distance to ensure comparability and numerical stability.
- Use diagnostic plots (e.g., QQ plots or multivariate distance histograms) to assess match quality.
35.9.7 Coarsened Exact Matching (CEM)
Coarsened Exact Matching (CEM) is a monotonic imbalance bounding matching method designed to improve covariate balance between treated and control units in observational studies. CEM operates by coarsening continuous or high-cardinality covariates into discrete bins and then applying exact matching on this coarsened space. The result is a non-parametric method that prioritizes covariate balance and robustness over modeling assumptions.
Introduced and formalized in Iacus, King, and Porro (2012), CEM is well-suited for causal inference when covariates are noisy or when traditional modeling-based approaches (e.g., propensity scores) are unstable or hard to interpret.
The central idea is to discretize covariates into meaningful bins (either automatically or manually), then sort observations into strata or subclasses based on the unique combinations of these binned covariate values. Exact matching is applied within each stratum.
Let X_i \in \mathbb{R}^p be the p-dimensional covariate vector for unit i. Define C(X_i) to be a coarsened version of X_i, where:
- Continuous variables are binned into intervals (e.g., age 20–29, 30–39, etc.)
- Categorical variables may be grouped (e.g., Likert scales aggregated into fewer categories)
Then, matching proceeds as follows:
- Temporarily coarsen the covariate space X \to C(X).
- Sort observations into strata defined by unique values of C(X).
- Prune any stratum that contains only treated or only control units.
- Retain all original (uncoarsened) units in matched strata for analysis.
This process produces a matched sample in which each stratum includes at least one treated and one control unit, improving internal validity by design.
As with other matching methods, CEM requires the ignorability assumption (also known as unconfoundedness or selection on observables):
(Y_i(0), Y_i(1)) \perp T_i \mid X_i
Under this assumption and sufficient overlap in the coarsened strata, CEM yields unbiased estimates of causal effects within the matched sample.
35.9.7.1 Mathematical Properties
- Monotonic Imbalance Bounding
CEM is a Monotonic Imbalance Bounding method, meaning that the user can pre-specify the maximum level of imbalance allowed on each covariate. The imbalance measure is guaranteed to be weakly decreasing as coarsening becomes finer. This provides:
- Transparency: The imbalance tradeoff is determined ex ante by the user.
- Control: Users can make a direct decision about how much imbalance is tolerable.
Formally, let \mathcal{L}(C(X)) denote a measure of imbalance under coarsened covariates. Then, for any refinements of C(X):
\text{if } C_1(X) \preceq C_2(X), \text{ then } \mathcal{L}(C_1(X)) \leq \mathcal{L}(C_2(X))
That is, finer coarsenings (more bins) cannot increase imbalance.
- Congruence Principle
CEM respects the congruence principle, which asserts that analysis should not be more precise than the data allow. By coarsening, CEM protects against overfitting and artificial precision, especially when measurement error is present.
- Robustness
- Robust to measurement error: Discretization makes matching less sensitive to noise in covariates.
- Works with missing data: The
cem
R package supports partial handling of missingness. - Multiple imputation: CEM can be applied within imputation routines to preserve matching structure across imputed datasets.
- Supports multi-valued treatments: Matching can be extended to treatments beyond binary T \in \{0, 1\}.
- Load and Prepare Data
library(cem)
data(LeLonde)
# Remove missing values
Le <- na.omit(LeLonde)
# Treatment and control indices
tr <- which(Le$treated == 1)
ct <- which(Le$treated == 0)
ntr <- length(tr)
nct <- length(ct)
# Unadjusted bias (naïve difference in means)
mean(Le$re78[tr]) - mean(Le$re78[ct])
#> [1] 759.0479
- Define Pre-Treatment Covariates
vars <- c(
"age", "education", "black", "married", "nodegree",
"re74", "re75", "hispanic", "u74", "u75", "q1"
)
# Pre-treatment imbalance
imbalance(group = Le$treated, data = Le[vars]) # L1 imbalance = 0.902
#>
#> Multivariate Imbalance Measure: L1=0.902
#> Percentage of local common support: LCS=5.8%
#>
#> Univariate Imbalance Measures:
#>
#> statistic type L1 min 25% 50% 75%
#> age -0.252373042 (diff) 5.102041e-03 0 0 0.0000 -1.0000
#> education 0.153634710 (diff) 8.463851e-02 1 0 1.0000 1.0000
#> black -0.010322734 (diff) 1.032273e-02 0 0 0.0000 0.0000
#> married -0.009551495 (diff) 9.551495e-03 0 0 0.0000 0.0000
#> nodegree -0.081217371 (diff) 8.121737e-02 0 -1 0.0000 0.0000
#> re74 -18.160446880 (diff) 5.551115e-17 0 0 284.0715 806.3452
#> re75 101.501761679 (diff) 5.551115e-17 0 0 485.6310 1238.4114
#> hispanic -0.010144756 (diff) 1.014476e-02 0 0 0.0000 0.0000
#> u74 -0.045582186 (diff) 4.558219e-02 0 0 0.0000 0.0000
#> u75 -0.065555292 (diff) 6.555529e-02 0 0 0.0000 0.0000
#> q1 7.494021189 (Chi2) 1.067078e-01 NA NA NA NA
#> max
#> age -6.0000
#> education 1.0000
#> black 0.0000
#> married 0.0000
#> nodegree 0.0000
#> re74 -2139.0195
#> re75 490.3945
#> hispanic 0.0000
#> u74 0.0000
#> u75 0.0000
#> q1 NA
- Automatically Coarsen and Match
mat <- cem(
treatment = "treated",
data = Le,
drop = "re78", # outcome variable
keep.all = TRUE # retain unmatched units in output
)
#>
#> Using 'treated'='1' as baseline group
mat
#> G0 G1
#> All 392 258
#> Matched 95 84
#> Unmatched 297 174
mat$w
contains weights for matched units.Summary of matched strata and units retained.
- Manual Coarsening (User-Controlled)
Users may coarsen variables explicitly based on theoretical knowledge or data distribution.
Example: Grouped Categorical and Binned Continuous Covariates
# Inspect levels for grouping
levels(Le$q1)
#> [1] "agree" "disagree" "neutral"
#> [4] "no opinion" "strongly agree" "strongly disagree"
# Group Likert responses
q1.grp <- list(
c("strongly agree", "agree"),
c("neutral", "no opinion"),
c("strongly disagree", "disagree")
)
# Custom cutpoints for education
table(Le$education)
#>
#> 3 4 5 6 7 8 9 10 11 12 13 14 15
#> 1 5 4 6 12 55 106 146 173 113 19 9 1
educut <- c(0, 6.5, 8.5, 12.5, 17)
# Run CEM with manual coarsening
mat1 <- cem(
treatment = "treated",
data = Le,
drop = "re78",
cutpoints = list(education = educut),
grouping = list(q1 = q1.grp)
)
#>
#> Using 'treated'='1' as baseline group
mat1
#> G0 G1
#> All 392 258
#> Matched 158 115
#> Unmatched 234 143
This allows for domain-specific discretion and enhances interpretability.
35.9.7.2 Progressive Coarsening
CEM supports progressive coarsening, where matching is attempted with fine coarsening first. If insufficient matches are found, coarsening is gradually relaxed until a suitable number of matched units is retained.
This strategy balances:
Precision (finer bins)
Sample size retention (coarser bins)
35.9.7.3 Summary
Strengths
Transparent and tunable matching procedure
Non-parametric: Does not require estimation of a treatment assignment model
Robust to measurement error and model misspecification
Supports multi-valued treatments and partial missingness
Theoretically grounded via monotonic imbalance bounding
Limitations
Loss of information due to coarsening
Arbitrary binning may influence results if not theoretically motivated
Not designed for high-dimensional settings without careful variable selection
Cannot account for unobserved confounding
Coarsened Exact Matching offers a powerful, transparent, and theoretically principled method for preprocessing observational data before estimating causal effects. It overcomes several limitations of traditional matching approaches, particularly propensity score matching, by giving researchers direct control over the quality of matches and by anchoring inference in empirical balance rather than model-dependent estimates.
35.9.8 Genetic Matching
Genetic Matching (GM) is a generalization of propensity score and Mahalanobis distance matching, leveraging a search algorithm inspired by evolutionary biology to optimize covariate balance. Unlike classical matching techniques, which rely on pre-specified distance metrics, Genetic Matching learns an optimal distance metric through an iterative process aimed at minimizing imbalance between treated and control groups.
Genetic Matching was introduced by Diamond and Sekhon (2013) and is designed to improve balance on observed covariates by selecting optimal weights for each covariate. The core idea is to define a generalized Mahalanobis distance metric, where the weights used in the distance calculation are chosen by a genetic search algorithm. This process adaptively explores the space of possible weighting matrices and evolves toward a set of weights that result in optimal covariate balance across treatment groups.
The method combines two components:
- Propensity Score Matching: Balances on the estimated probability of treatment assignment given covariates.
- Mahalanobis Distance Matching: Accounts for correlations among covariates and ensures geometric proximity in multivariate space.
The Genetic Matching approach treats the selection of covariate weights as an optimization problem, solved using techniques inspired by natural selection—mutation, crossover, and survival of the fittest.
Compared to traditional matching methods such as:
- Nearest Neighbor Matching: May result in poor balance in high-dimensional or imbalanced datasets.
- Full Matching: More efficient but not always feasible with severe imbalance.
Genetic Matching is particularly powerful in situations where traditional distance metrics are insufficient to achieve balance. It adaptively searches the weight space to ensure better covariate overlap, which is crucial for unbiased causal inference.
In empirical settings, this often results in superior balance on observed covariates and more credible estimates of treatment effects.
Let:
T_i \in \{0, 1\} be the binary treatment indicator for unit i.
\mathbf{X}_i \in \mathbb{R}^p be the covariate vector for unit i, where p is the number of observed covariates.
\mathbf{W} be a p \times p positive definite weight matrix optimized by the algorithm.
The generalized Mahalanobis distance between unit i and j is defined as:
D_{ij} = \sqrt{(\mathbf{X}_i - \mathbf{X}_j)^\top \mathbf{W} (\mathbf{X}_i - \mathbf{X}_j)}
Here, \mathbf{W} is not necessarily the inverse of the covariance matrix (as in standard Mahalanobis distance), but a data-driven weighting matrix found by Genetic Matching to optimize covariate balance.
The genetic algorithm optimizes the weights in \mathbf{W} to minimize a global imbalance metric \mathcal{B} across all covariates. This can involve different balance criteria:
- Paired t-tests for continuous or dichotomous variables
- Kolmogorov–Smirnov (K–S) test statistics for distributional similarity
- Standardized mean differences
Let b_k denote the balance metric for the kth covariate. Then the total imbalance could be:
\mathcal{B} = \sum_{k=1}^{p} w_k b_k^2
The optimization objective becomes finding \mathbf{W} that minimizes \mathcal{B}.
Genetic Matching is implemented in the Matching
package in R via the GenMatch()
function. This function:
- Accepts a treatment indicator (
Tr
) and a matrix of covariates (X
). - Optionally takes a Balance Matrix (
BalanceMatrix
) that specifies which variables should be balanced. - Uses a genetic search to find the weight matrix that best achieves balance on the specified variables.
Balance is assessed via:
- Paired t-tests for dichotomous or continuous variables
- Kolmogorov-Smirnov (K–S) tests for distributional differences
Matching can be performed:
- With or without replacement (with replacement often improves match quality at the cost of increased variance).
- For different estimands: Average Treatment Effect, Average Treatment Effect on the Treated, or Average Treatment Effect on the Control.
library(Matching)
data(lalonde, package = "MatchIt")
attach(lalonde)
# Define the covariates to match on
X <- cbind(age, educ, black, hisp, married, nodegr, u74, u75, re75, re74)
# Define the covariates to balance on
BalanceMat <- cbind(
age, educ, black, hisp, married, nodegr,
u74, u75, re75, re74,
I(re74 * re75) # Include interaction term
)
# Genetic Matching to optimize covariate balance
# Note: pop.size = 16 is too small for real applications
genout <- GenMatch(
Tr = treat,
X = X,
BalanceMatrix = BalanceMat,
estimand = "ATE",
M = 1,
pop.size = 16,
max.generations = 10,
wait.generations = 1
)
# Define the outcome variable
Y <- re78 / 1000
# Perform matching with the optimized weights
mout <- Match(
Y = Y,
Tr = treat,
X = X,
estimand = "ATE",
Weight.matrix = genout
)
# Summarize the treatment effect estimates
summary(mout)
# Assess post-matching balance
mb <- MatchBalance(
treat ~ age + educ + black + hisp + married + nodegr +
u74 + u75 + re75 + re74 + I(re74 * re75),
match.out = mout,
nboots = 500
)
Extensions and Considerations
The method can be extended to multinomial treatments (via generalized entropy balancing).
For longitudinal or panel data, entropy balancing can be adapted to match across time points.
One should avoid balancing on post-treatment variables, which would bias estimates.
Entropy balancing fits naturally into modern workflows for causal inference, and is especially valuable when researchers want to prioritize design over modeling.
35.9.9 Entropy Balancing
Entropy balancing is a data preprocessing technique for causal inference in observational studies. Introduced by Hainmueller (2012), it is particularly useful in settings with binary treatments where covariate imbalance between treated and control groups threatens the validity of causal estimates.
The method applies a maximum entropy reweighting scheme to assign unit weights to control group observations such that the reweighted covariate distribution in the control group matches the sample moments (e.g., means, variances) of the treated group.
Entropy balancing directly targets covariate balance and avoids the trial-and-error of iterative propensity score or matching methods. It also reduces reliance on outcome models by ensuring pre-treatment covariates are aligned between groups.
- Goal: Create a set of weights for the control group such that its covariate distribution matches that of the treated group.
- Approach: Solve a constrained optimization problem that minimizes information loss (measured via Shannon entropy) subject to balance constraints on covariates.
Entropy balancing:
Achieves exact balance on the specified covariate moments (e.g., means, variances).
Produces unique, data-driven weights without the need for iterative matching.
Is compatible with any outcome model in the second stage (e.g., weighted regression, IPW, DID, etc.).
Advantages of Entropy Balancing
- Exact balance: Guarantees moment balance without iterative diagnostics.
- Flexibility: Can balance on higher moments, interactions, or non-linear transformations.
- Model independence: Reduces reliance on correct specification of the outcome model.
- Stability: The optimization procedure is smooth and produces stable, interpretable weights.
Entropy balancing is especially useful in high-dimensional settings or when working with small treated groups, where matching can perform poorly or fail entirely.
Suppose we have n_1 treated units and n_0 control units, with covariates \mathbf{X}_i \in \mathbb{R}^p for each unit i. Let T_i \in \{0,1\} be the treatment indicator.
Let the treatment group’s sample moment vector be:
\bar{\mathbf{X}}_T = \frac{1}{n_1} \sum_{i: T_i=1} \mathbf{X}_i
We seek a set of weights \{w_i\}_{i: T_i=0} for control units such that:
- Covariate balancing constraints (e.g., mean balance):
\sum_{i: T_i=0} w_i \mathbf{X}_i = \bar{\mathbf{X}}_T
- Minimum entropy divergence from uniform weights:
We minimize the Kullback-Leibler divergence between the new weights w_i and uniform base weights w_i^{(0)} = \frac{1}{n_0}:
\min_{\{w_i\}} \sum_{i: T_i=0} w_i \log \left( \frac{w_i}{w_i^{(0)}} \right)
subject to:
- \sum w_i = 1 (weights must sum to 1)
- \sum w_i \mathbf{X}_i = \bar{\mathbf{X}}_T
- (Optionally) higher-order moments, e.g., variances, skewness, etc.
This is a convex optimization problem, ensuring a unique global solution.
# Load package
library(ebal)
# Simulate data
set.seed(123)
n_treat <- 50
n_control <- 100
# Covariates
X_treat <- matrix(rnorm(n_treat * 3, mean = 1), ncol = 3)
X_control <- matrix(rnorm(n_control * 3, mean = 0), ncol = 3)
X <- rbind(X_control, X_treat) # Order: control first, then treated
# Treatment vector: 0 = control, 1 = treated
treatment <- c(rep(0, n_control), rep(1, n_treat))
# Apply entropy balancing
eb_out <- ebalance(Treatment = treatment, X = X)
# Simulate outcome variable
Y_control <- rnorm(n_control, mean = 1)
Y_treat <- rnorm(n_treat, mean = 3)
Y <- c(Y_control, Y_treat)
# Construct weights: treated get weight 1, control get weights from ebalance
weights <- c(eb_out$w, rep(1, n_treat))
# Estimate ATE using weighted linear regression
df <- data.frame(Y = Y, treat = treatment, weights = weights)
fit <- lm(Y ~ treat, data = df, weights = weights)
summary(fit)
35.9.10 Matching for high-dimensional data
One could reduce the number of dimensions using methods such as:
Lasso (Gordon et al. 2019)
Penalized logistic regression (Eckles and Bakshy 2021)
PCA (Principal Component Analysis)
Locality Preserving Projections (LPP) (S. Li et al. 2016)
Random projection
Autoencoders (Ramachandra 2018)
Additionally, one could jointly does dimension reduction while balancing the distributions of the control and treated groups (Yao et al. 2018).
35.9.11 Matching for multiple treatments
In cases where you have multiple treatment groups, and you want to do matching, it’s important to have the same baseline (control) group. For more details, see
(Q.-Y. Zhao et al. 2021): also for continuous treatment
If you insist on using the MatchIt
package, then see this answer
35.9.12 Matching for multi-level treatments
See (Yang et al. 2016)
Package in R shuyang1987/multilevelMatching
on Github
35.9.13 Matching for repeated treatments
https://cran.r-project.org/web/packages/twang/vignettes/iptw.pdf
package in R twang