Module 3B Marginal Models

Motivation and Examples

Important

Generalized Estimating Equations (GEE): extend regression models to handle correlated response data while still targeting population-level associations. GEE does not require a full likelihood specification; instead, it uses estimating equations and a working correlation structure to improve efficiency. Both WLS and GLS can be seen as special cases of the more general GEE framework. GEE extends these ideas beyond linear models to handle correlated data with non-Gaussian outcomes through a quasi-likelihood approach.

  • Longitudinal data with repeated measures
  • Clustered outcomes (e.g., patients within hospitals, students within schools)
  • Multilevel or correlated designs
  • Extends to non-normal data.

For example, suppose we have repeated binary outcomes \(y_{it}\) for subject \(i\) at time \(t\):

\[ \text{logit}\{\Pr(y_{it}=1)\} = \mathbf{x}_{it}^\top \beta \]

If we fit this with standard logistic regression, we would incorrectly assume independence across time within each subject. GEE addresses this by introducing a working covariance matrix \(\mathbf{V}_i\) for subject \(i\) that accounts for correlation between repeated measures:

\[ \mathbf{V}_i = \mathbf{A}_i^{1/2}\,\mathbf{R}(\alpha)\,\mathbf{A}_i^{1/2}, \]

where \(\mathbf{A}_i\) is a diagonal variance matrix (from the mean model) and \(\mathbf{R}(\alpha)\) is a correlation structure (e.g., exchangeable, AR(1)).

We are interested in learning the average effect of covariates on the outcome across the population, while accounting for within-subject correlation. GEE provides a way to solve for \(\beta\) using the estimating equation:

\[ U(\beta) = \sum_i D_i^\top V_i^{-1}(y_i - \mu_i) = 0, \]

where \(D_i = \partial \mu_i/\partial \beta^\top\).

Motivating Example: A Toy Example

Suppose we collect three measurements, two from the same subject and one from a different subject. Because two measurements come from the same subject, they might be correlated, while the third is independent. How should we combine these observations to estimate the overall population mean?

Consider a data \(y_1, y_2, y_3\). Mathematically, we can represent this with a covariance matrix where \(y_1\) and \(y_2\) are correlated, while \(y_3\) is independent.

\[ \begin{pmatrix} y_1 \\ y_2 \\ y_3 \end{pmatrix} \sim N \left(\begin{pmatrix} \mu\\ \mu \\ \mu\end{pmatrix}), \Sigma = \begin{bmatrix} \sigma^2 & R & 0\\ R & \sigma^2 & 0 \\ 0 & 0 & \sigma^2 \end{bmatrix} \right) \]

Things to note: In this example, \(y_1\) and \(y_2\) come from the same cluster and may be correlated \(R \neq 0\).

If we simply take the average of the three values, we still get an unbiased estimate of the population mean \(\hat{\mu} = \frac{1}{3}(y_1 + y_2 + y_3)\). But when \(y_1\) and \(y_2\) are correlated, this average does not make the best use of the data—the variance of the estimator depends on the correlation \(R\). The above estimator has variance:

\[ [\frac{1}{3},\frac{1}{3}, \frac{1}{3}]\Sigma[\frac{1}{3},\frac{1}{3}, \frac{1}{3}]^T = \frac{3\sigma^2 + 2R}{9} \] So we can see \(R \uparrow\) then \(\text{variance} \uparrow\).

In a general context, we wish to find a set of weights \(\mathbf{w} = (w_1, w_2, w_3)\) such that the weights meet these conditions:

  1. \(\sum_k w_k = 1\) (\(\hat{\mu}\) is unbiased.
  2. \(w_k>0\) for all \(k\).
  3. The variance of the estimator:

\[ \begin{align*} var(\hat{\mu}) &= [\frac{1}{w},\frac{1}{w}, \frac{1}{w}]\Sigma[\frac{1}{w},\frac{1}{w}, \frac{1}{w}]^T\\ & = [\frac{1}{w},\frac{1}{w}, \frac{1}{w}] \begin{bmatrix} \sigma^2 & R & 0\\ R & \sigma^2 & 0\\ 0 & 0 & \sigma^2 \end{bmatrix} [\frac{1}{w},\frac{1}{w}, \frac{1}{w}]^T\\ & = [w_1 \sigma^2 + w_2, w_1 R + w_2, w_2 \sigma^2] \cdot [\frac{1}{w},\frac{1}{w}, \frac{1}{w}]^T\\ & = (w_1^2 + w_2^2 + w_3^2)\sigma^2 + 2 w_1 w_2 R \end{align*} \] When R = 0, minimum variance when \(w_1 = w_2 = w_3\) (the simple sample average). But this is not the case when \(R \neq 0\). So let’s look at some other options:

  1. Simple average:

\[ \mathbf{w} = \left[\frac{1}{3},\frac{1}{3}, \frac{1}{3}\right] \]

  1. Use only one of \(y_1\) or \(y_2\):

\[ \mathbf{w} = \left[\frac{1}{2},0, \frac{1}{2}\right] \]

  1. Secret: assuming \(R\) and \(\sigma^2\) are known.

\[ \mathbf{w} = \left( \frac{1}{3 + R/\sigma^2},\frac{1}{3 + R/\sigma^2}, \frac{1 + R/\sigma^2}{3 + R/\sigma^2}\right) \]

Toy example: efficiency under different weighting schemes
Estimator Weights \(\mathrm{Var}(\hat{\mu})\) \(R=0\) \(R=0.5\sigma^2\) \(R=\sigma^2\)
\(\left[\tfrac{1}{3},\tfrac{1}{3},\tfrac{1}{3}\right]\) \(\tfrac{3\sigma^2+2R}{9}\) \(\tfrac{1}{3}\sigma^2\) \(\tfrac{4}{9}\sigma^2\) \(\tfrac{5}{9}\sigma^2\)
\(\left[\tfrac{1}{4},\tfrac{1}{4},\tfrac{1}{2}\right]\) \(\tfrac{3}{8}\sigma^2+\tfrac{1}{8}R\) \(\tfrac{3}{8}\sigma^2\) \(\tfrac{7}{16}\sigma^2\) \(\tfrac{1}{2}\sigma^2\)
\(\left[\tfrac{1}{3+R/\sigma^2},\tfrac{1}{3+R/\sigma^2},\tfrac{1+R/\sigma^2}{3+R/\sigma^2}\right]\) \(\tfrac{(1+R/\sigma^2)}{3+R/\sigma^2}\,\sigma^2\) \(\tfrac{1}{3}\sigma^2\) \(\tfrac{3}{7}\sigma^2\) \(\tfrac{1}{2}\sigma^2\)

Notes.

  • Scenario 3 comes from minimizing \(\operatorname{Var}(\hat\mu)\) subject to \(\sum w_k=1\), which yields \((w_1=w_2=\sigma^2/(R+3\sigma^2)), (w_3=(R+\sigma^2)/(R+3^\sigma2))\) and \(\operatorname{Var}(\hat\mu)=\sigma^2(R+\sigma^2)/(R+3\sigma^2)\).
  • Scenario 3 (Secret Optimal Weights) have the smallest standard error for the complete range of \(R\) values.
  • Scenario 3 = Scenario 1 when \(R=0\) (independence).
Estimating Equations for \(\hat{\mu}\)

In the toy example with three observations \(y_1, y_2, y_3\), we define an estimator of the population mean as a weighted average:

\[ \hat{\mu} = \sum_{k=1}^3 w_k y_k, \qquad \sum_{k=1}^3 w_k = 1. \]

Each estimator arises as the solution to an estimating equation of the form

\[ U(\mu) = \sum_{k=1}^3 w_k (y_k - \mu) = 0. \]

Solving \(U(\mu)=0\) gives \(\hat{\mu} = \sum w_k y_k\).
The difference between estimators lies in the choice of weights \(w_k\).


1. Equal Weights Estimator

Weights: \(w_1=w_2=w_3=\tfrac{1}{3}\)

Estimating equation:

\[ U(\mu) = \tfrac{1}{3}(y_1 - \mu) + \tfrac{1}{3}(y_2 - \mu) + \tfrac{1}{3}(y_3 - \mu) = 0. \]

Solution:

\[ \hat{\mu} = \tfrac{1}{3}(y_1+y_2+y_3). \]


2. Two-Stage Estimator (average the correlated pair, then with \(y_3\))

Weights: \(w_1=w_2=\tfrac{1}{4}, \; w_3=\tfrac{1}{2}\)

Estimating equation:

\[ U(\mu) = \tfrac{1}{4}(y_1 - \mu) + \tfrac{1}{4}(y_2 - \mu) + \tfrac{1}{2}(y_3 - \mu) = 0. \]

Solution:

\[ \hat{\mu} = \tfrac{1}{4}(y_1+y_2) + \tfrac{1}{2}y_3. \]


3. Variance-Optimal Estimator

Weights chosen to minimize \(\mathrm{Var}(\hat{\mu})\):

\[ w_1 = w_2 = \frac{\sigma^2}{R+3\sigma^2}, \quad w_3 = \frac{R+\sigma^2}{R+3\sigma^2}. \]

Estimating equation:

\[ U(\mu) = w_1(y_1 - \mu) + w_2(y_2 - \mu) + w_3(y_3 - \mu) = 0. \]

Solution:

\[ \hat{\mu} = \frac{\sigma^2}{R+3\sigma^2}(y_1+y_2) + \frac{R+\sigma^2}{R+3\sigma^2}y_3. \]


Key Point

All three scenarios fit the estimating equation framework:

\[ \sum_{k=1}^3 w_k (y_k - \mu) = 0, \]

but the efficiency of \(\hat{\mu}\) depends on the choice of weights \((w_1,w_2,w_3)\), which in turn reflects how we model correlation \(R\).

Estimating Equations

The method of estimating equations provides a unifying way to obtain parameter estimates in generalized linear models (GLMs).

For models in the exponential family (Normal, Bernoulli/Logistic, Poisson, etc.), the score equations—the first derivatives of the log-likelihood with respect to the parameters—take a common form.

In general, if \(\mu_i(\beta)=E[y_i \mid x_i;\beta]\) denotes the mean and \(V_i\) the variance of \(y_i\), then the score equations are:

\[ \sum_{i=1}^n \frac{\partial \mu_i(\beta)}{\partial \beta_k}\, V_i^{-1}\, \big(y_i - \mu_i(\beta)\big) = 0, \qquad k=1,\dots,p. \]

  • \(\mathbf{(\mu(\beta))} = n \times 1\) vector of \(E(y_i ; \beta)\).
  • \(\mathbf{V(y; \beta)} = n \times n\) diagonal matrix.
  • \(\left[\frac{\partial \mu(\beta)}{\partial \beta}\right] = n \times p\) matrix

These equations are solved simultaneously to obtain the maximum likelihood estimate (MLE) \(\hat\beta\).

The key point is that, regardless of the outcome distribution, the score function always looks like a weighted sum of covariates multiplied by residuals. What changes across models is:

  • the mean function \(\mu_i(\beta)\),
  • the link function \(g(\cdot)\) that relates \(\mu_i\) to the linear predictor \(x_i^\top \beta\), and
  • the variance function \(V_i\) associated with the distribution.

\[ cov(\hat{\beta}) = \left(\left[ \frac{\partial \mu(\beta)}{\partial \beta}\right]^T V(y; \beta)^{-1}\left[ \frac{\partial \mu(\beta)}{\partial \beta}\right]\right)^{-1} \]

Below we illustrate this general form for three common GLMs: Normal regression, Logistic regression, and Poisson log-linear regression.

Model: \(y_{ij}=\eta_{ij}+\varepsilon_{ij}\), \(\eta_{ij}=x_{ij}^\top\beta\), \(g(\mu)=\mu\), \(v(\mu)=1\), typically \(\phi=\sigma^2\).

Then \(A_i=\phi I_{n_i}\), \(D_i=X_i\), and \[ U(\beta)=\sum_{i=1}^m X_i^\top V_i^{-1}(y_i-X_i\beta)=0, \] which reduces to GLS for a chosen working \(R(\alpha)\).

Model: \(y_{i} \sim \mathrm{Bernoulli}(p_{i})\) with
\[ p_{i} = \frac{e^{x_{i}^\top \beta}}{1 + e^{x_{i}^\top \beta}}. \]

For a single observation, the log-likelihood is \[ \ell(\beta \mid y_{i}, x_{i}) = y_{i} \log(p_{i}) + (1 - y_{i}) \log(1 - p_{i}). \]

Score (gradient) for a single observation: \[ U_{i}(\beta) = \frac{\partial \ell_{i}(\beta)}{\partial \beta} = x_{i}\,\big(y_{i}-p_{i}\big). \]

Stacked over all observations (independent case): \[ U(\beta)=\sum_{i=1}^n x_{i}\,\big(y_{i}-p_{i}\big) = X^\top (y-\mu), \quad \text{with } \mu_i=p_i. \]

Therefore, the MLE \(\hat\beta\) is obtained by solving \(p\) equations in \(p\) unknowns: \[ \sum_{i=1}^n x_{ik}\,\Big(y_i - p_i(\beta)\Big) = 0, \qquad k=1,\dots,p, \] or equivalently, \[ \sum_{i=1}^n x_{ik}\,\left(y_i - \frac{e^{x_i^\top \beta}}{1 + e^{x_i^\top \beta}}\right) = 0, \qquad k=1,\dots,p. \]

In compact matrix form the estimating equation is given by: \[ X^\top\!\big(y-\mu(\beta)\big)=0. \]

There is no closed-form solution; numerically one can use Newton–Raphson or Iteratively Reweighted Least Squares (IRLS).

Model: \(y_{i} \sim \mathrm{Poisson}(\mu_{i})\) with
\[ \mu_{i} = e^{x_{i}^\top \beta}, \qquad g(\mu_i) = \log(\mu_i). \]

For a single observation, the log-likelihood is \[ \ell(\beta \mid y_{i}, x_{i}) = y_{i}\,x_i^\top \beta - e^{x_i^\top \beta} - \log(y_i!). \]

Score (gradient) for a single observation: \[ U_{i}(\beta) = \frac{\partial \ell_{i}(\beta)}{\partial \beta} = x_{i}\,\big(y_{i}-\mu_{i}\big). \]

Stacked over all observations (independent case): \[ U(\beta)=\sum_{i=1}^n x_{i}\,\big(y_{i}-\mu_{i}\big) = X^\top (y-\mu), \quad \text{with } \mu_i = e^{x_i^\top \beta}. \]

Therefore, the MLE \(\hat\beta\) is obtained by solving \(p\) equations in \(p\) unknowns: \[ \sum_{i=1}^n x_{ik}\,\Big(y_i - \mu_i(\beta)\Big) = 0, \qquad k=1,\dots,p, \] or equivalently, \[ \sum_{i=1}^n x_{ik}\,\Big(y_i - e^{x_i^\top \beta}\Big) = 0, \qquad k=1,\dots,p. \]

In compact matrix form the estimating equation is given by: \[ X^\top\!\big(y-\mu(\beta)\big)=0. \]

There is no closed-form solution; numerically one can use Newton–Raphson or Iteratively Reweighted Least Squares (IRLS).


Motivating Example: Coming back to the Crossover Design

We come back to our crossover study:

Data were obtained from a crossover trial on the disease of cerebrovascular deficiency. The goal is to investigate the side effects of a treatment drug compared to a placebo

Design:

  • 34 patients: an active drug (A) and followed by placebo (B).

  • 33 patients: a placebo (B) followed by an active drug (A).

  • Outcome: a normal (0) or abnormal (1) electrodiagram.

  • Each patient has a binary observation at period 1 and period 2.

  • Crossover design: can have “carryover” effects which confound treatment effect estimation. Test whether washout period was adequate.

Variables:

  • ID \(i\): subject id

  • period \(j\): 0 = period 1, 1 = period 2.

  • group: 0 = B then A, 1 = A then B

  • outcome \(y_{ij}\): 0 = normal ECG response; 1 = abnormal ECG response.

  • trt: 0 = placebo, 1 = active drug.

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(gee)
library(lme4)
Loading required package: Matrix
data <- readRDS(paste0("data/crossover.RDS"))

head(data)
  ID group period trt outcome
1  1     1      0   0       0
2  1     1      1   1       0
3  2     1      0   0       0
4  2     1      1   1       0
5  3     1      0   0       0
6  3     1      1   1       0

We fit a marginal (population-average) logistic model for abnormal ECG as a function of treatment, period, and sequence, clustering on subject. The working correlation captures the within-subject dependence across the two periods; with two measurements per person, an exchangeable or AR(1) structure are both reasonable choices. Robust (“sandwich”) standard errors ensure valid inference even if the working correlation is mis-specified.

#Marginal model with independent correlation
fit_indep<- gee(
  outcome ~ trt * period ,
  id = ID, data = data,
  family = binomial(link = "logit"),
  corstr = "independence"
)
(Intercept)         trt      period  trt:period 
 -1.5404450   1.1096621   0.8472979  -1.0226507 
#Marginal model with exchangeable correlation
fit_exch <- gee(
  outcome ~ trt * period ,
  id = ID, data = data,
  family = binomial(link = "logit"),
  corstr = "exchangeable"
)
(Intercept)         trt      period  trt:period 
 -1.5404450   1.1096621   0.8472979  -1.0226507 
#GLMM (CONDITIONAL) model with exchangeable correlation
fit_glmm <- glmer(outcome ~ trt * period + (1|ID),
                  data = data, family= binomial, nAGQ = 25)




summary(fit_indep)

 GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
 gee S-function, version 4.13 modified 98/01/27 (1998) 

Model:
 Link:                      Logit 
 Variance to Mean Relation: Binomial 
 Correlation Structure:     Independent 

Call:
gee(formula = outcome ~ trt * period, id = ID, data = data, family = binomial(link = "logit"), 
    corstr = "independence")

Summary of Residuals:
       Min         1Q     Median         3Q        Max 
-0.3939394 -0.3529412 -0.1764706  0.6060606  0.8235294 


Coefficients:
              Estimate Naive S.E.   Naive z Robust S.E.  Robust z
(Intercept) -1.5404450  0.4567363 -3.372723   0.4498677 -3.424218
trt          1.1096621  0.5826118  1.904634   0.5738502  1.933714
period       0.8472979  0.5909040  1.433901   0.5820177  1.455794
trt:period  -1.0226507  0.7827812 -1.306432   0.9789663 -1.044623

Estimated Scale Parameter:  1.030769
Number of Iterations:  1

Working Correlation
     [,1] [,2]
[1,]    1    0
[2,]    0    1
summary(fit_exch)

 GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
 gee S-function, version 4.13 modified 98/01/27 (1998) 

Model:
 Link:                      Logit 
 Variance to Mean Relation: Binomial 
 Correlation Structure:     Exchangeable 

Call:
gee(formula = outcome ~ trt * period, id = ID, data = data, family = binomial(link = "logit"), 
    corstr = "exchangeable")

Summary of Residuals:
       Min         1Q     Median         3Q        Max 
-0.3939394 -0.3529412 -0.1764706  0.6060606  0.8235294 


Coefficients:
              Estimate Naive S.E.   Naive z Robust S.E.  Robust z
(Intercept) -1.5404450  0.4567363 -3.372723   0.4498677 -3.424218
trt          1.1096621  0.5826118  1.904634   0.5738502  1.933714
period       0.8472979  0.5909040  1.433901   0.5820177  1.455794
trt:period  -1.0226507  0.9997117 -1.022946   0.9789663 -1.044623

Estimated Scale Parameter:  1.030769
Number of Iterations:  1

Working Correlation
          [,1]      [,2]
[1,] 1.0000000 0.6401548
[2,] 0.6401548 1.0000000
summary(fit_glmm)
Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 25) [glmerMod]
 Family: binomial  ( logit )
Formula: outcome ~ trt * period + (1 | ID)
   Data: data

     AIC      BIC   logLik deviance df.resid 
   145.1    159.6    -67.5    135.1      129 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.1399 -0.2140 -0.1462  0.2435  1.3149 

Random effects:
 Groups Name        Variance Std.Dev.
 ID     (Intercept) 24.4     4.94    
Number of obs: 134, groups:  ID, 67

Fixed effects:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)   -5.004      2.176  -2.299   0.0215 *
trt            3.595      2.140   1.680   0.0929 .
period         2.786      2.042   1.364   0.1726  
trt:period    -3.338      3.303  -1.011   0.3122  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
           (Intr) trt    period
trt        -0.831              
period     -0.791  0.901       
trt:period  0.659 -0.903 -0.916

Model Interpretation

  • GEE (Independence vs Exchangeable)
    • Estimates and robust SEs are nearly identical.
    • Treatment effect ≈ 1.1 on the log-odds scale (OR ≈ 3), borderline significant (\(p \approx 0.05\)).
    • Period and treatment–period interaction effects are not statistically significant.
    • Working correlation choice does not change inference, reflecting robustness of sandwich SEs.
  • GLMM (Random Intercept)
    • Fixed effects show the same qualitative pattern but with larger magnitudes (treatment ≈ 3.6 on log-odds scale).
    • Larger coefficients reflect subject-specific (conditional) associations.
    • Inference indicates moderate treatment effect, but with wide uncertainty.
  • Key Contrast
    • GEE estimates population-average effects, robust to correlation specification.
    • GLMM estimates subject-specific effects, often stronger but less generalizable.

Summary

Section Description
What is the model? Generalized Estimating Equations (GEE) extend GLMs to correlated outcomes (longitudinal or clustered). The mean structure is the same as a GLM:
\(\mathbb{E}[y_{it}] = g^{-1}(x_{it}^\top \beta)\),
but estimation incorporates within-cluster correlation through a working covariance matrix \(\mathbf{V}_i = \mathbf{A}_i^{1/2} R(\alpha)\mathbf{A}_i^{1/2}\).
Motivation Standard GLMs assume independent outcomes, which fails with repeated measures or clustering. GEE provides population-averaged associations while accounting for correlation. Even if the working correlation is misspecified, inference remains valid using robust (sandwich) SEs.
Estimating Equation \(\hat{\beta}\) solves:
\(U(\beta) = \sum_i D_i^\top V_i^{-1}(y_i - \mu_i) = 0\),
where \(D_i = \partial \mu_i / \partial \beta^\top\). This generalizes the score equations from GLMs.
Working Correlation Choices include independence, exchangeable, AR(1), unstructured, etc. This structure improves efficiency if close to truth, but robust SEs protect inference if it is wrong.
Sandwich Variance \(\widehat{\operatorname{Cov}}(\hat{\beta}) = B^{-1} M B^{-1}\),
where \(B = \sum_i D_i^\top V_i^{-1} D_i\), and \(M = \sum_i D_i^\top V_i^{-1}(y_i-\mu_i)(y_i-\mu_i)^\top V_i^{-1} D_i\). Provides valid SEs under misspecification.
Key Idea GEE does not require a full likelihood; it relies on estimating equations.
Efficiency comes from a good working correlation; validity comes from robust variance.
Interpretation GEE coefficients: average effect of covariates on the outcome across the population.
Contrast with GLMM: GEE = marginal effects; GLMM = subject-specific effects.