Module 3A Marginal Models

Readings

Basic Concept

Weighted Least Squares

Motivation and Examples

Motivation:

Marginal Models: assess population-level associations in the presence of heteroscedastic and/or correlated errors without requiring assumptions about subject-specific random effects.

  • Longitudinal data
  • Clustered data
  • Multilevel data

For example, suppose we have a linear regression problem give below:

\[ \mathbf{y} = \mathbf{X} \beta + \epsilon, \quad \epsilon \sim N(0, \mathbf{V}) \]

Here we assume the residual covariance matrix \(\mathbf{V}\) is known and diagonal. meaning there are non-constant errors across units.

\[ \mathbf{V} = \begin{bmatrix} v_1 & 0 & ... & 0 \\ 0 & v_2 & ... & 0 \\ \vdots & \ddots & \ddots & \vdots\\ 0 & ... & ... & v_n \end{bmatrix} \]

We are interested in learning the population-level associations between \(\mathbf{y}\) and \(\mathbf{X}\) across \(n\) subjects, i.e., How can we find the least squares estimate of \(\beta\) accounting for within-subject correlation?

Motivating Example: Revisiting TLC

The Treatment of Lead-Exposed Children (TLC) was examined using a placebo-controlled randomized study of succimer (a chelating agent) in children with blood lead levels of 20-44 micrograms/dL. These data consist of four repeated measurements of blood lead levels obtained at baseline (or week 0) , week 1, week 4, and week 6 on 100 children who were randomly assigned to chelation treament with succimer or placebo.

Variable Description
id child identifier
treatment succimer or placebo
week measurement week
lead lead level (outcome)
library(dplyr)
library(ggplot2)
library(nlme)        # gls(), varIdent(), corAR1(), corCompSymm()
library(sandwich)    # vcovHC(), vcovCL()
library(lmtest)      # coeftest()
library(clubSandwich) # cluster-robust SEs for gls/lme

tlcdata <- readRDS("data/tlc.RDS") %>%
  mutate(
    treatment = factor(treatment),          # succimer / placebo
    week_f    = factor(week, ordered = TRUE),
    week_num  = as.numeric(as.character(week)) # for AR(1) correlation
  )
head(tlcdata)
    id treatment week lead week_f week_num
1    1   placebo    0 30.8      0        0
101  1   placebo    1 26.9      1        1
201  1   placebo    4 25.8      4        4
301  1   placebo    6 23.8      6        6
2    2  succimer    0 26.5      0        0
102  2  succimer    1 14.8      1        1

We are interested in assessing the population-level effect of treatment on the outcome of lead levels.


Why Not OLS/LMM? The Wrong Tool for the Job

  1. OLS: If we ignore clustering and fit the standard OLS model:

\[ y_{ij} = \beta_0 + \beta_1 \text{trt}_i + \beta_2 \text{week}_j + \epsilon_{ij} \]

  • OLS requirements: linearity, independence, homoscedasticity. OLS does not require distributional assumption.
  • Errors \(\rightarrow\) assumed independent and ignores repeated measures within the same child.
  • We have seen that this does not bias \(\mathbf{\hat{\beta}}\), but produces incorrect SEs leading to incorrect inference.
  1. LMM: If we account for clustering using LMM, we include it in the mean function as follows:

\[ \begin{align*} y_{ij} &= \beta_0 + \beta_1 \text{trt}_i + \beta_2 \text{week}_j + \theta_i + \epsilon_{ij}\\ \epsilon_{ij} & \sim N(0, \sigma^2)\\ \theta_i & \sim N(1, \tau^2) \end{align*} \]

  • Assumes random effects are assumed to be normally distributed. Inference depends on correct specification of random effects distribution and covariance structure.

  • Estimate population level associations (fixed effects) conditional on random effects. $E(y_{ij}) = X_{ij}^T + _i $.

  • \(V\) = Var-Cov matrix which includes within-subject correlations, but only through assumed random effects structure, i.e., we treat \(V\) as model-based \(V = diag(\tau^2 + \sigma^2)\).


The Better Approach: Weighted Least Squares

In WLS, we extend the methods we learned in regular OLS (for normal data) to account for heteroscedasticity unequal variance. Each observation is given a weight based on how much reliability we place on it. This allows a better fit to data where some points have more variability than others.

Let \(\mathbf{W}\) be a diagonal matrix \(diag(\mathbf{W}) = (v_1^{1/2}, v_2^{-1/2}, v_3^{-1/2}, ..., v_n^{-1/2})\). Note that \(\mathbf{W}\cdot\mathbf{W} = \mathbf{V}^{-1}\).

Now let’s derive the weighted least squares estimator \(\hat{\beta}_{WLS}\):

Consider the transformed \(y^*\) and transformed \(X^*\):

\[ \begin{align*} y^* &= \mathbf{W}y\\ X^* &= \mathbf{W} X\\ \epsilon^* & = \mathbf{W} \epsilon \end{align*} \]

We now have a regression problem with equal variance.

\[ \begin{align*} y^* &= \mathbf{X^*}\beta + \epsilon^*, \quad \epsilon^* \sim N(0, I)\\ \mathbf{W}\mathbf{y} &= \mathbf{W} \mathbf{X} \beta + \mathbf{W} \epsilon \end{align*} \]

Using the OLS form of the estimator:

\[ \begin{align*} \hat{\beta}_{WLS}& = (X^{*T} X^*)^{-1} X^{*T}y^*\\ & =([WX]^T WX)^{-1}[WX]^T Wy\\ & = (X^T W^T WX)^{-1}X^T W^T Wy\\ & = (X^T V^{-1}X)^{-1}X^T V^{-1}y \end{align*} \]

We can also derive the covariance of the WLS estimator:

\[ \begin{align*} \operatorname{cov}(\hat{\beta}_{WLS}) &= \operatorname{cov}\!\left[(X^{T} V^{-1} X)^{-1}X^{T} V^{-1}y\right] \\ &= (X^{T}V^{-1}X)^{-1}X^{T} V^{-1}\operatorname{cov}(Y)\left[(X^{T}V^{-1}X)^{-1}X^{T}\right]^{T} \\ &= (X^{T} V^{-1} X)^{-1} X^{T} V^{-1} V V^{-1} X (X^{T}V^{-1}X)^{-1} \\ &= (X^{T} V^{-1} X)^{-1} X^{T} V^{-1}X (X^{T} V^{-1}X)^{-1} \\ &= (X^{T} V^{-1} X)^{-1} \end{align*} \]

Properties of the WLS Estimator:

  • Comparing WLS vs OLS \(E[\hat{\beta}_{WLS}] = E[\hat{\beta}_{OLS}]= \beta\): Both are unbiased and consistent for true \(\beta\). This means in the limit, as sample size grows, both estimators converge to the same population parameter

  • If there are heteroscedastic errors the WLS estimator is more efficient (lower variance).

  • If there are heteroscedastic errors (\(V \neq \sigma^2 I)\), the OLS covariance is not correct \(\rightarrow\) lets look at the OLS Estimator:

\[ \begin{align*} cov(\hat{\beta}_{OLS}) & = cov[(X^T X)^{-1} X^Ty]\\ & = (X^T X)^{-1}X^T V X(X^T X)^{-1} \end{align*} \]

However, our estimate of the covariance is a \(\hat{cov}(\hat{\beta}_{OLS}) = \sigma^2_{OLS}(X^T X)^{-1}\).

Then if \(V \neq \sigma^2 I\) inference is incorrect.


WLS: TLC example

We’ll now apply WLS to the TLC blood lead dataset to account for heteroskedasticity across weeks.

Visual inspection of residuals by week (not shown here yet) suggests that variability in lead levels increases over time.

We’ll allow a non-constant variance for each week using \(nlme::gls\) with a \(varIdent\) variance function.

# Different variance per week; no correlation structure
m_wls <- gls(
  lead ~ treatment * week_f,
  data = tlcdata,
  weights = varIdent(form = ~ 1 | week_f),  # Var differs by week
  method = "REML"
)
summary(m_wls)
Generalized least squares fit by REML
  Model: lead ~ treatment * week_f 
  Data: tlcdata 
       AIC      BIC    logLik
  2635.581 2683.237 -1305.791

Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | week_f 
 Parameter estimates:
       0        1        4        6 
1.000000 1.325880 1.370457 1.524810 

Coefficients:
                               Value Std.Error  t-value p-value
(Intercept)                24.662000 0.4685087 52.63936  0.0000
treatmentsuccimer          -5.577500 0.6625714 -8.41796  0.0000
week_f.L                   -1.893502 0.9201165 -2.05789  0.0403
week_f.Q                    0.594000 0.9370174  0.63393  0.5265
week_f.C                   -0.191407 0.9536188 -0.20072  0.8410
treatmentsuccimer:week_f.L -1.537073 1.3012413 -1.18124  0.2382
treatmentsuccimer:week_f.Q  8.539000 1.3251427  6.44383  0.0000
treatmentsuccimer:week_f.C -2.436867 1.3486206 -1.80693  0.0715

 Correlation: 
                           (Intr) trtmnt wk_f.L wk_f.Q wk_f.C tr:_.L tr:_.Q
treatmentsuccimer          -0.707                                          
week_f.L                    0.268 -0.189                                   
week_f.Q                   -0.045  0.032  0.252                            
week_f.C                    0.061 -0.043 -0.027  0.106                     
treatmentsuccimer:week_f.L -0.189  0.268 -0.707 -0.178  0.019              
treatmentsuccimer:week_f.Q  0.032 -0.045 -0.178 -0.707 -0.075  0.252       
treatmentsuccimer:week_f.C -0.043  0.061  0.019 -0.075 -0.707 -0.027  0.106

Standardized residuals:
       Min         Q1        Med         Q3        Max 
-2.1756515 -0.6849999 -0.1515551  0.5294188  5.6327724 

Residual standard error: 5.022524 
Degrees of freedom: 400 total; 392 residual

Model Interpretation

  • Fixed effects
    • Intercept: Placebo group at baseline = 24.66 (SE = 0.47, p < 0.001).
    • Succimer (baseline difference): At baseline, succimer-treated children had lead levels 5.58 µg/dL lower than placebo (SE = 0.66, p < 0.001).
    • Time trend (placebo group):
      • Linear = –1.89 (SE = 0.92, p = 0.040) → significant decline over time.
      • Quadratic = 0.59 (SE = 0.94, p = 0.53) → no evidence of curvature.
      • Cubic = –0.19 (SE = 0.95, p = 0.84) → no higher-order time effects.
    • Succimer × Time interactions:
      • Linear = –1.54 (SE = 1.30, p = 0.24) → not significant.
      • Quadratic = 8.54 (SE = 1.33, p < 0.001) → succimer group shows a strong positive quadratic trend, diverging from placebo.
      • Cubic = –2.44 (SE = 1.35, p = 0.072) → marginal evidence for cubic effects.
  • Variance function
    • Residual variability increased relative to baseline (week 0):
      • Week 1: ~33% higher
      • Week 4: ~37% higher
      • Week 6: ~52% higher
    • GLS (via varIdent) down-weights observations from weeks with higher variance.

Generalized Least Squares

In WLS, we assumed the residual covariance matrix \(\mathbf{V}\) was known and diagonal—errors were uncorrelated but had unequal variances.

Generalized Least Squares (GLS) drops that restriction: \(\mathbf{V}\) can be any known, positive definite covariance matrix, allowing for both:

  • Unequal variances (heteroskedasticity), and
  • Correlations among observations.

GLS solves the same weighted least squares problem, but now the “weights” come from \(\mathbf{V}^{-1}\), which accounts for the full error structure (variances and covariances).

  • If \(\mathbf{V}\) is diagonal ⟶ GLS = WLS.
  • If \(\mathbf{V} = \sigma^2 \mathbf{I}\)GLS = OLS.

So we already know:

\[ \begin{align*} \hat{\beta}_{GLS}& = (X^T V^{-1} X)^{-1} X^T V^{-1}y\\ ov(\hat{\beta}_{GLS})& = (X^T V^{-1} X)^{-1} \end{align*} \] This is because \(V\) is a known positive definite matrix so we can always find a matrix \(\mathbf{W} = \mathbf{V}^{-1/2}\) and \(\mathbf{W}\mathbf{W}^T=\mathbf{V}^{-1}\)

This means that \(\hat{\beta}_{GLS} = (X^T V^{-1}X)^{-1}X^T V^{-1}\) is the best (minimum variance) linear unbiased estimator (BLUE).


Robust Inference

We have outlined above whether to use WLS vs GLS when \(V\) is known and we assume it is correct. However when V is unknown/misspecified OLS coefficients remain unbiased but their standard errors may be wrong. In this case, robust (sandwich) standard errors provide valid inference without needing the exact form of \(\mathbf{V}\).

Robust (Sandwich) Standard Errors (OLS: The Simpler Case)

For OLS, the covariance of the coefficient estimate is:

\[ Cov(\hat{\beta}_{OLS}) = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{V} \mathbf{X}(\mathbf{X}^T \mathbf{X})^{-1}, \]

where \(\mathbf{V} = \sigma^2 \mathbf{I}\) is the var-cov matrix of errors (constant and 0-diagonal…soooo simple!!)

When \(\mathbf{V}\) is unknown, we can estimate it and substitute the estimate into this formula. One natural choice for \(\hat{\mathbf{V}}\) is the empirical residual covariance matrix:

\[ \hat{\mathbf{V}} = \left[ \mathbf{y} - \mathbf{X} \hat{\beta}_{OLS} \right] \left[ \mathbf{y} - \mathbf{X} \hat{\beta}_{OLS} \right]^\top, \]

or, in expanded form,

\[ \hat{\mathbf{V}} = \begin{pmatrix} (y_1 - \mathbf{x}_1^\top \hat{\beta}_{OLS})^2 & (y_1 - \mathbf{x}_1^\top \hat{\beta}_{OLS})(y_2 - \mathbf{x}_2^\top \hat{\beta}_{OLS}) & \cdots \\ (y_2 - \mathbf{x}_2^\top \hat{\beta}_{OLS})(y_1 - \mathbf{x}_1^\top \hat{\beta}_{OLS}) & (y_2 - \mathbf{x}_2^\top \hat{\beta}_{OLS})^2 & \cdots \\ \vdots & \vdots & \ddots \end{pmatrix}. \]

Plugging \(\hat{\mathbf{V}}\) into the covariance formula yields the robust covariance matrix:

\[ \widehat{\operatorname{Cov}}(\hat{\beta}_{OLS})_{\text{robust}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \left[ (\mathbf{y} - \mathbf{X} \hat{\beta}_{OLS})(\mathbf{y} - \mathbf{X} \hat{\beta}_{OLS})^\top \right] \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1}. \]

This is called the sandwich estimator because the middle term (the “meat”)—the residual covariance—is sandwiched between two \((\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top\) terms (the “bread”).

The robust (sandwich) standard errors are obtained by taking the square roots of the diagonal elements of \(\widehat{\operatorname{Cov}}(\hat{\beta}_{OLS})_{\text{robust}}\).

Definition:

The sandwich estimator because of its “bread-meat-bread” structure, with the covariance matrix beg “sandiwched” between terms involving the deign matrix \(\mathbf{X}\).

The robust standard error provides an estimate of the covariance matrix \(cov(\hat{\beta})\) given by:

\[ \widehat{cov}(\hat{\beta}_{OLS})_{robust} = (X^T X)^{-1}[(y - X\hat{\beta}_{OLS})(y-X\hat{\beta}_{OLS})^T]X(X^T X)^{-1} \]

Note: The empirical estimate of the “meat” component uses only one observation for each element. Thus it is a poor estimator of \(\mathbf{V}\), i.e., inaccurate. We can have a better estimate if we have some idea about what is the structure of \(\mathbf{V}\).

Trade-Offs in Modeling

  • If assumptions are true: stronger assumptions = more accurate estimates and conversely, weaker assumptions = less accurate estimates.
  • When assumptions are false: stronger assumptions = incorrect inference, weaker assumptions = more robust.

Robust (Sandwich) Standard Errors (GLS: Assuming Some Structure on \(V\))

For GLS, recall that the coefficient estimator is:

\[ \hat{\beta}_{GLS} = (X^\top V^{-1} X)^{-1} X^\top V^{-1} y, \]

with covariance

\[ \operatorname{Var}(\hat{\beta}_{GLS}) = (X^\top V^{-1} X)^{-1} X^\top V^{-1} \operatorname{Var}(Y) V^{-1} X (X^\top V^{-1} X)^{-1}. \]

  • If \(V\) is correctly specified, then \(\operatorname{Var}(Y) = V\), and this simplifies to the familiar efficient GLS covariance:

\[ \operatorname{Cov}(\hat{\beta}_{GLS}) = (X^\top V^{-1} X)^{-1}. \]

  • If \(V\) is unknown or misspecified, naive GLS standard errors are invalid.
    In this case, we again can use a robust sandwich estimator, plugging in empirical residuals instead of assuming the true \(V\) is known.

The robust covariance for GLS is:

\[ \widehat{\operatorname{Cov}}(\hat{\beta}_{GLS})_{\text{robust}} = (X^\top \tilde{V}^{-1} X)^{-1} X^\top \tilde{V}^{-1} \hat{\Omega} \tilde{V}^{-1} X (X^\top \tilde{V}^{-1} X)^{-1}, \]

where \(\tilde{V}\) is the “true” covariance structure and the empirical variance is given by:

\[ \hat{\Omega} = (y - X\hat{\beta}_{GLS})(y - X\hat{\beta}_{GLS})^\top. \] Coming back to our WLS example: we assumed heteroscedastic errors that were uncorrelated across subjects. GLS then weighted observations accordingly, gaining efficiency if that structure was correct. But if some subjects were family members and thus correlated, our covariance assumption would be wrong, and so would our standard errors. GLS with robust SEs protects against this by keeping the weighted estimates while still providing valid and efficient inference.

Definition:

The GLS robust (sandwich) estimator has the same bread–meat–bread structure as OLS, but the “bread” now reflects the weighting by \(V^{-1}\):

\[ \hat{cov}(\hat{\beta}_{GLS})_{robust} = (X^\top \tilde{V}^{-1} X)^{-1} X^\top \tilde{V}^{-1} \left[(y - X\hat{\beta}_{GLS})(y - X\hat{\beta}_{GLS})^\top\right] \tilde{V}^{-1} X (X^\top \tilde{V}^{-1} X)^{-1}. \]

  • If \(V\) is correct → GLS is efficient.
  • If \(V\) is wrong → the robust form still yields valid inference.

The idea is that even if \(\hat{V}\) is misspecified, the standard errors are valid for inference.

Takeaway:
- OLS + robust SE = no assumptions about \(V\), robust but potentially inefficient.
- GLS + robust SE = assumes structure in \(V\) for efficiency, but robust SEs protect against misspecification.


Choosing a Covariance Structure for Grouped Data

When data are grouped or repeatedly measured (e.g., patients within clinics, time points within subjects), the error terms are usually correlated. In GLS, we model this correlation by specifying the covariance matrix:

\[ V_i = \sigma^2 R_i(\alpha), \]

where \(\sigma^2\) is the common variance and \(R_i(\alpha)\) is the “group”, correlation matrix that depends on parameters \(\alpha\) (e.g., a correlation \(\rho\)).

\(R_i(\alpha)=\) working correlation structure where \(R_i[j, j'] = cor(y_{ij}, y_{ij'})\).

For example, for a group (or cluster) of size 3, common choices for correlation matrices \(R_i(\alpha)\) look as follows:

  • Independent

    • Assumes no correlation among observations within a group.
    • \[ R_i = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}. \]
  • Exchangeable (Compound Symmetry)

    • All pairs of observations in a group share the same correlation \(\rho\).
    • \[ R_i(\alpha) = \begin{bmatrix} 1 & \alpha & \alpha \\ \alpha & 1 & \alpha\\ \alpha & \alpha & 1 \end{bmatrix}. \]
  • Unstructured

    • Every variance and covariance is estimated freely. For 3 observations we need 3 variances and 3 unique covariances.
    • \[ R_i = \begin{bmatrix} 1 & \alpha_{12} & \alpha_{13} \\ \alpha_{12} & 1 & \alpha_{23} \\ \alpha_{13} & \alpha_{23} & 1 \end{bmatrix}. \]
  • AR(1) (First-Order Autoregressive)

    • Correlation decays with time lag: \(\text{Corr}(\varepsilon_t,\varepsilon_{t+k}) = \alpha^k\).
    • \[ R_i(\alpha) = \begin{bmatrix} 1 & \alpha & \alpha^2 \\ \alpha & 1 & \alpha \\ \alpha^2 & \alpha & 1 \end{bmatrix}. \]

    Note: As we learned above, if we use GLS + Robust SEs, even if we specify the incorrect correlation matrix, we still have valid and efficient inference.

Summary

Section Description
What is the model? Marginal models estimate population-level associations without specifying random effects. The general form is \(y = X\beta + \epsilon\), with \(\operatorname{Var}(\epsilon) = V\).
Motivation Standard OLS assumes independent, homoskedastic errors. Real-world data (longitudinal, clustered, multilevel) often violate these assumptions, leading to biased SEs and invalid inference.
OLS vs. WLS vs. GLS OLS: ignores correlation/heteroskedasticity → unbiased coefficients, wrong SEs if \(V \neq \sigma^2 I\).
WLS: handles heteroskedasticity with diagonal \(V\).
GLS: allows general \(V\) with variances + covariances.
Estimators OLS: \(\hat{\beta}_{OLS} = (X^\top X)^{-1}X^\top y\)
WLS: \(\hat{\beta}_{WLS} = (X^\top V^{-1}X)^{-1} X^\top V^{-1} y\), \(V\) diagonal
GLS: same formula with full \(V\).
Properties All are unbiased and consistent for \(\beta\). Efficiency differs: WLS/GLS improve efficiency when \(V\) is correctly specified.
Robust Inference OLS + robust SEs: no assumptions about \(V\), robust but inefficient.
GLS + robust SEs: assumes structure in \(V\) for efficiency, robust SEs protect against misspecification.
Covariance Structures \(V_i = \sigma^2 R_i(\alpha)\) with group correlation matrix \(R_i(\alpha)\).
Independent: no correlation.
Exchangeable: equal correlation \(\rho\).
Unstructured: free variances/covariances.
AR(1): correlation decays with lag.
Example (TLC data) WLS with varIdent model: succimer significantly reduced baseline lead levels (–5.6 µg/dL, p<0.001) and showed a strong quadratic trajectory difference vs. placebo. Residual variability increased over time, modeled via week-specific variances.
Takeaway Marginal models give valid population-level inference when correlation/heteroskedasticity is present. Efficiency gains come from modeling \(V\), while robust SEs ensure inference remains valid even if \(V\) is misspecified.