A useful reference: Halekol, I., S. Hojsgaard, and J. Yan. The R Package geepack for Generalized Estimating Equations. Journal of Statistical Software. 2006.
Marginal Models: assess population-level associations in the presence of heteroscedastic and/or correlated errors without requiring assumptions about subject-specific random effects.
Longitudinal data
Clustered data
Multilevel data
For example, suppose we have a linear regression problem give below:
We are interested in learning the population-level associations between \(\mathbf{y}\) and \(\mathbf{X}\) across \(n\) subjects, i.e., How can we find the least squares estimate of \(\beta\) accounting for within-subject correlation?
Motivating Example: Revisiting TLC
The Treatment of Lead-Exposed Children (TLC) was examined using a placebo-controlled randomized study of succimer (a chelating agent) in children with blood lead levels of 20-44 micrograms/dL. These data consist of four repeated measurements of blood lead levels obtained at baseline (or week 0) , week 1, week 4, and week 6 on 100 children who were randomly assigned to chelation treament with succimer or placebo.
Assumes random effects are assumed to be normally distributed. Inference depends on correct specification of random effects distribution and covariance structure.
Estimate population level associations (fixed effects) conditional on random effects. $E(y_{ij}) = X_{ij}^T + _i $.
\(V\) = Var-Cov matrix which includes within-subject correlations, but only through assumed random effects structure, i.e., we treat \(V\) as model-based \(V = diag(\tau^2 + \sigma^2)\).
The Better Approach: Weighted Least Squares
In WLS, we extend the methods we learned in regular OLS (for normal data) to account for heteroscedasticity unequal variance. Each observation is given a weight based on how much reliability we place on it. This allows a better fit to data where some points have more variability than others.
Let \(\mathbf{W}\) be a diagonal matrix \(diag(\mathbf{W}) = (v_1^{1/2}, v_2^{-1/2}, v_3^{-1/2}, ..., v_n^{-1/2})\). Note that \(\mathbf{W}\cdot\mathbf{W} = \mathbf{V}^{-1}\).
Now let’s derive the weighted least squares estimator \(\hat{\beta}_{WLS}\):
Derivation of WLS estimator
Consider the transformed \(y^*\) and transformed \(X^*\):
Comparing WLS vs OLS \(E[\hat{\beta}_{WLS}] = E[\hat{\beta}_{OLS}]= \beta\): Both are unbiased and consistent for true \(\beta\). This means in the limit, as sample size grows, both estimators converge to the same population parameter
If there are heteroscedastic errors the WLS estimator is more efficient (lower variance).
If there are heteroscedastic errors (\(V \neq \sigma^2 I)\), the OLS covariance is not correct \(\rightarrow\) lets look at the OLS Estimator:
However, our estimate of the covariance is a \(\hat{cov}(\hat{\beta}_{OLS}) = \sigma^2_{OLS}(X^T X)^{-1}\).
Then if \(V \neq \sigma^2 I\) inference is incorrect.
WLS: TLC example
We’ll now apply WLS to the TLC blood lead dataset to account for heteroskedasticity across weeks.
Visual inspection of residuals by week (not shown here yet) suggests that variability in lead levels increases over time.
We’ll allow a non-constant variance for each week using \(nlme::gls\) with a \(varIdent\) variance function.
# Different variance per week; no correlation structurem_wls <-gls( lead ~ treatment * week_f,data = tlcdata,weights =varIdent(form =~1| week_f), # Var differs by weekmethod ="REML")summary(m_wls)
Intercept: Placebo group at baseline = 24.66 (SE = 0.47, p < 0.001).
Succimer (baseline difference): At baseline, succimer-treated children had lead levels 5.58 µg/dL lower than placebo (SE = 0.66, p < 0.001).
Time trend (placebo group):
Linear = –1.89 (SE = 0.92, p = 0.040) → significant decline over time.
Quadratic = 0.59 (SE = 0.94, p = 0.53) → no evidence of curvature.
Cubic = –0.19 (SE = 0.95, p = 0.84) → no higher-order time effects.
Succimer × Time interactions:
Linear = –1.54 (SE = 1.30, p = 0.24) → not significant.
Quadratic = 8.54 (SE = 1.33, p < 0.001) → succimer group shows a strong positive quadratic trend, diverging from placebo.
Cubic = –2.44 (SE = 1.35, p = 0.072) → marginal evidence for cubic effects.
Variance function
Residual variability increased relative to baseline (week 0):
Week 1: ~33% higher
Week 4: ~37% higher
Week 6: ~52% higher
GLS (via varIdent) down-weights observations from weeks with higher variance.
Generalized Least Squares
In WLS, we assumed the residual covariance matrix \(\mathbf{V}\) was known and diagonal—errors were uncorrelated but had unequal variances.
Generalized Least Squares (GLS) drops that restriction: \(\mathbf{V}\) can be any known, positive definite covariance matrix, allowing for both:
Unequal variances (heteroskedasticity), and
Correlations among observations.
GLS solves the same weighted least squares problem, but now the “weights” come from \(\mathbf{V}^{-1}\), which accounts for the full error structure (variances and covariances).
If \(\mathbf{V}\) is diagonal ⟶ GLS = WLS.
If \(\mathbf{V} = \sigma^2 \mathbf{I}\) ⟶ GLS = OLS.
So we already know:
\[
\begin{align*}
\hat{\beta}_{GLS}& = (X^T V^{-1} X)^{-1} X^T V^{-1}y\\
ov(\hat{\beta}_{GLS})& = (X^T V^{-1} X)^{-1}
\end{align*}
\] This is because \(V\) is a known positive definite matrix so we can always find a matrix \(\mathbf{W} = \mathbf{V}^{-1/2}\) and \(\mathbf{W}\mathbf{W}^T=\mathbf{V}^{-1}\)
This means that \(\hat{\beta}_{GLS} = (X^T V^{-1}X)^{-1}X^T V^{-1}\) is the best (minimum variance) linear unbiased estimator (BLUE).
Robust Inference
We have outlined above whether to use WLS vs GLS when \(V\) is known and we assume it is correct. However when V is unknown/misspecified OLS coefficients remain unbiased but their standard errors may be wrong. In this case, robust (sandwich) standard errors provide valid inference without needing the exact form of \(\mathbf{V}\).
Robust (Sandwich) Standard Errors (OLS: The Simpler Case)
For OLS, the covariance of the coefficient estimate is:
where \(\mathbf{V} = \sigma^2 \mathbf{I}\) is the var-cov matrix of errors (constant and 0-diagonal…soooo simple!!)
When \(\mathbf{V}\) is unknown, we can estimate it and substitute the estimate into this formula. One natural choice for \(\hat{\mathbf{V}}\) is the empirical residual covariance matrix:
This is called the sandwich estimator because the middle term (the “meat”)—the residual covariance—is sandwiched between two \((\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top\) terms (the “bread”).
The robust (sandwich) standard errors are obtained by taking the square roots of the diagonal elements of \(\widehat{\operatorname{Cov}}(\hat{\beta}_{OLS})_{\text{robust}}\).
Definition:
The sandwich estimator because of its “bread-meat-bread” structure, with the covariance matrix beg “sandiwched” between terms involving the deign matrix \(\mathbf{X}\).
The robust standard error provides an estimate of the covariance matrix \(cov(\hat{\beta})\) given by:
Note: The empirical estimate of the “meat” component uses only one observation for each element. Thus it is a poor estimator of \(\mathbf{V}\), i.e., inaccurate. We can have a better estimate if we have some idea about what is the structure of \(\mathbf{V}\).
Trade-Offs in Modeling
If assumptions are true: stronger assumptions = more accurate estimates and conversely, weaker assumptions = less accurate estimates.
When assumptions are false: stronger assumptions = incorrect inference, weaker assumptions = more robust.
Robust (Sandwich) Standard Errors (GLS: Assuming Some Structure on \(V\))
For GLS, recall that the coefficient estimator is:
\[
\hat{\beta}_{GLS} = (X^\top V^{-1} X)^{-1} X^\top V^{-1} y,
\]
If \(V\) is unknown or misspecified, naive GLS standard errors are invalid.
In this case, we again can use a robust sandwich estimator, plugging in empirical residuals instead of assuming the true \(V\) is known.
where \(\tilde{V}\) is the “true” covariance structure and the empirical variance is given by:
\[
\hat{\Omega} = (y - X\hat{\beta}_{GLS})(y - X\hat{\beta}_{GLS})^\top.
\] Coming back to our WLS example: we assumed heteroscedastic errors that were uncorrelated across subjects. GLS then weighted observations accordingly, gaining efficiency if that structure was correct. But if some subjects were family members and thus correlated, our covariance assumption would be wrong, and so would our standard errors. GLS with robust SEs protects against this by keeping the weighted estimates while still providing valid and efficient inference.
Definition:
The GLS robust (sandwich) estimator has the same bread–meat–bread structure as OLS, but the “bread” now reflects the weighting by \(V^{-1}\):
If \(V\) is wrong → the robust form still yields valid inference.
The idea is that even if \(\hat{V}\) is misspecified, the standard errors are valid for inference.
Takeaway:
- OLS + robust SE = no assumptions about \(V\), robust but potentially inefficient.
- GLS + robust SE = assumes structure in \(V\) for efficiency, but robust SEs protect against misspecification.
Choosing a Covariance Structure for Grouped Data
When data are grouped or repeatedly measured (e.g., patients within clinics, time points within subjects), the error terms are usually correlated. In GLS, we model this correlation by specifying the covariance matrix:
\[
V_i = \sigma^2 R_i(\alpha),
\]
where \(\sigma^2\) is the common variance and \(R_i(\alpha)\) is the “group”, correlation matrix that depends on parameters \(\alpha\) (e.g., a correlation \(\rho\)).
\(R_i(\alpha)=\) working correlation structure where \(R_i[j, j'] = cor(y_{ij}, y_{ij'})\).
For example, for a group (or cluster) of size 3, common choices for correlation matrices \(R_i(\alpha)\) look as follows:
Independent
Assumes no correlation among observations within a group.
Note: As we learned above, if we use GLS + Robust SEs, even if we specify the incorrect correlation matrix, we still have valid and efficient inference.
Summary
Section
Description
What is the model?
Marginal models estimate population-level associations without specifying random effects. The general form is \(y = X\beta + \epsilon\), with \(\operatorname{Var}(\epsilon) = V\).
Motivation
Standard OLS assumes independent, homoskedastic errors. Real-world data (longitudinal, clustered, multilevel) often violate these assumptions, leading to biased SEs and invalid inference.
OLS vs. WLS vs. GLS
OLS: ignores correlation/heteroskedasticity → unbiased coefficients, wrong SEs if \(V \neq \sigma^2 I\). WLS: handles heteroskedasticity with diagonal \(V\). GLS: allows general \(V\) with variances + covariances.
Estimators
OLS: \(\hat{\beta}_{OLS} = (X^\top X)^{-1}X^\top y\) WLS: \(\hat{\beta}_{WLS} = (X^\top V^{-1}X)^{-1} X^\top V^{-1} y\), \(V\) diagonal GLS: same formula with full \(V\).
Properties
All are unbiased and consistent for \(\beta\). Efficiency differs: WLS/GLS improve efficiency when \(V\) is correctly specified.
Robust Inference
OLS + robust SEs: no assumptions about \(V\), robust but inefficient. GLS + robust SEs: assumes structure in \(V\) for efficiency, robust SEs protect against misspecification.
Covariance Structures
\(V_i = \sigma^2 R_i(\alpha)\) with group correlation matrix \(R_i(\alpha)\). Independent: no correlation. Exchangeable: equal correlation \(\rho\). Unstructured: free variances/covariances. AR(1): correlation decays with lag.
Example (TLC data)
WLS with varIdent model: succimer significantly reduced baseline lead levels (–5.6 µg/dL, p<0.001) and showed a strong quadratic trajectory difference vs. placebo. Residual variability increased over time, modeled via week-specific variances.
Takeaway
Marginal models give valid population-level inference when correlation/heteroskedasticity is present. Efficiency gains come from modeling \(V\), while robust SEs ensure inference remains valid even if \(V\) is misspecified.