8.2 Estimation in Linear Mixed Models
The general Linear Mixed Model is:
Yi=Xiβ+Zibi+ϵi,
where:
- β: Fixed effects (parameters shared across subjects).
- bi: Random effects (subject-specific deviations).
- Xi: Design matrix for fixed effects.
- Zi: Design matrix for random effects.
- D: Covariance matrix of the random effects.
- Σi: Covariance matrix of the residual errors.
Since β, bi, D, and Σi are unknown, they must be estimated from data.
- β,D,Σi are fixed parameters → must be estimated.
- bi is a random variable → must be predicted (not estimated). In other words, random thing/variable can’t be estimated.
Thus, we define:
- Estimator of β: ˆβ (fixed effect estimation).
- Predictor of bi: ˆbi (random effect prediction).
Then:
The population-level estimate of Yi is:
ˆYi=Xiˆβ.
The subject-specific prediction is:
ˆYi=Xiˆβ+Ziˆbi.
For all N subjects, we stack the equations into the Mixed Model Equations (Henderson 1975):
Y=Xβ+Zb+ϵ.
and
Y∼N(Xβ,ZBZ′+Σ)
where:
Y=[y1⋮yN],X=[X1⋮XN],b=[b1⋮bN],ϵ=[ϵ1⋮ϵN].
The covariance structure is:
Cov(b)=B,Cov(ϵ)=Σ,Cov(b,ϵ)=0.
Expanding Z and B as block diagonal matrices:
Z=[Z10…00Z2…0⋮⋮⋱⋮00…ZN],B=[D0…00D…0⋮⋮⋱⋮00…D].
The best linear unbiased estimator (BLUE) for β and the best linear unbiased predictor (BLUP) for b are obtained by solving (Henderson 1975):
[ˆβˆb]=[X′Σ−1XX′Σ−1ZZ′Σ−1XZ′Σ−1Z+B−1]−1[X′Σ−1YZ′Σ−1Y].
where:
- ˆβ is the Generalized Least Squares estimator for β.
- ˆb is the BLUP for b.
If we define:
V=ZBZ′+Σ.
Then, the solutions to the Mixed Model Equations are:
ˆβ=(X′V−1X)−1X′V−1Y,ˆb=BZ′V−1(Y−Xˆβ).
where:
- ˆβ is obtained using Generalized Least Squares.
- ˆb is a Weighted Least Squares predictor, where weights come from B and V.
Properties of the Estimators
For ˆβ:
E(ˆβ)=β,Var(ˆβ)=(X′V−1X)−1.
For ˆb:
E(ˆb)=0.
The variance of the prediction error (Mean Squared Prediction Error, MSPE) is:
Var(ˆb−b)=B−BZ′V−1ZB+BZ′V−1X(X′V−1X)−1X′V−1B.
🔹 Key Insight:
Mean Squared Prediction Error is more meaningful than Var(ˆb) alone, since it accounts for both variance and bias in prediction.
8.2.1 Interpretation of the Mixed Model Equations
The system:
[X′Σ−1XX′Σ−1ZZ′Σ−1XZ′Σ−1Z+B−1][βb]=[X′Σ−1YZ′Σ−1Y].
can be understood as:
- Fixed Effects Estimation (ˆβ)
- Uses Generalized Least Squares.
- Adjusted for both random effects and correlated errors.
- Random Effects Prediction (ˆb)
- Computed using the BLUP formula.
- Shrinks subject-specific estimates toward the population mean.
Component | Equation | Interpretation |
---|---|---|
Fixed effects (ˆβ) | (X′V−1X)−1X′V−1Y | Generalized Least Squares estimator |
Random effects (ˆb) | BZ′V−1(Y−Xˆβ) | Best Linear Unbiased Predictor (BLUP) |
Variance of ˆβ | (X′V−1X)−1 | Accounts for uncertainty in fixed effect estimates |
Variance of prediction error | B−BZ′V−1ZB+BZ′V−1X(X′V−1X)−1X′V−1B | Includes both variance and bias in prediction |
8.2.2 Derivation of the Mixed Model Equations
To derive the Mixed Model Equations, consider:
ϵ=Y−Xβ−Zb.
Define:
- T=∑Ni=1ni → Total number of observations.
- Nq → Total number of random effects.
The joint distribution of (b,ϵ) is:
f(b,ϵ)=1(2π)(T+Nq)/2|B00Σ|−1/2exp(−12[bY−Xβ−Zb]′[B00Σ]−1[bY−Xβ−Zb]).
Maximizing f(b,ϵ) with respect to b and β requires minimization of:
Q=[bY−Xβ−Zb]′[B00Σ]−1[bY−Xβ−Zb]=b′B−1b+(Y−Xβ−Zb)′Σ−1(Y−Xβ−Zb).
Setting the derivatives of Q with respect to b and β to zero leads to the system of equations:
X′Σ−1Xβ+X′Σ−1Zb=X′Σ−1Y(Z′Σ−1Z+B−1)b+Z′Σ−1Xβ=Z′Σ−1Y
Rearranging
[X′Σ−1XX′Σ−1ZZ′Σ−1XZ′Σ−1Z+B−1][βb]=[X′Σ−1YZ′Σ−1Y]
Thus, the solution to the mixed model equations give:
[ˆβˆb]=[X′Σ−1XX′Σ−1ZZ′Σ−1XZ′Σ−1Z+B−1]−1[X′Σ−1YZ′Σ−1Y]
8.2.3 Bayesian Interpretation of Linear Mixed Models
In a Bayesian framework, the posterior distribution of the random effects b given the observed data Y is derived using Bayes’ theorem:
f(b|Y)=f(Y|b)f(b)∫f(Y|b)f(b)db.
where:
- f(Y|b) is the likelihood function, describing how the data are generated given the random effects.
- f(b) is the prior distribution of the random effects.
- The denominator ∫f(Y|b)f(b)db is the normalizing constant that ensures the posterior integrates to 1.
- f(b|Y) is the posterior distribution, which updates our belief about b given the observed data Y.
In the Linear Mixed Model, we assume:
Y|b∼N(Xβ+Zb,Σ),b∼N(0,B).
This means:
- Likelihood: Given b, the data Y follows a multivariate normal distribution with mean Xβ+Zb and covariance Σ.
- Prior for b: The random effects are assumed to follow a multivariate normal distribution with mean 0 and covariance B.
By applying Bayes’ theorem, the posterior distribution of b given Y is:
b|Y∼N(BZ′V−1(Y−Xβ),(Z′Σ−1Z+B−1)−1).
where:
- Mean: BZ′V−1(Y−Xβ)
- This is the BLUP.
- It represents the expected value of b given Y under squared-error loss.
- Covariance: (Z′Σ−1Z+B−1)−1
- This posterior variance accounts for both prior uncertainty (B) and data uncertainty (Σ).
Thus, the Bayesian posterior mean of b coincides with the BLUP predictor:
E(b|Y)=BZ′V−1(Y−Xβ).
Interpretation of the Posterior Distribution
- Posterior Mean as a Shrinkage Estimator (BLUP)
- The expectation E(b|Y) shrinks individual estimates toward the population mean.
- Subjects with less data or more variability will have estimates closer to zero.
- This is similar to Ridge Regression in penalized estimation.
- Posterior Variance Quantifies Uncertainty
- The matrix (Z′Σ−1Z+B−1)−1 captures the remaining uncertainty in b after seeing Y.
- If Z′Σ−1Z is large, the data provide strong information about b, reducing posterior variance.
- If B−1 dominates, prior information heavily influences estimates.
- Connection to Bayesian Inference
- The random effects b follow a Gaussian posterior due to conjugacy.
- This is analogous to Bayesian hierarchical models, where random effects are latent variables estimated from data.
Step | Equation | Interpretation |
---|---|---|
Likelihood | Y|b∼N(Xβ+Zb,Σ) | Data given random effects |
Prior | b∼N(0,B) | Random effects distribution |
Posterior | b|Y∼N(BZ′V−1(Y−Xβ),(Z′Σ−1Z+B−1)−1) | Updated belief about b |
Posterior Mean (BLUP) | E(b|Y)=BZ′V−1(Y−Xβ) | Best predictor (squared error loss) |
Posterior Variance | (Z′Σ−1Z+B−1)−1 | Uncertainty in predictions |
8.2.4 Estimating the Variance-Covariance Matrix
If we have an estimate ˜V for V, we can estimate the fixed and random effects as:
ˆβ=(X′˜V−1X)−1X′˜V−1Y,ˆb=BZ′˜V−1(Y−Xˆβ).
where:
- ˆβ is the estimated fixed effects.
- ˆb is the Empirical Best Linear Unbiased Predictor (EBLUP), also called the Empirical Bayes estimate of b.
Properties of ˆβ and Variance Estimation
- Consistency: ^Var(ˆβ) is a consistent estimator of Var(ˆβ) if ˜V is a consistent estimator of V.
- Bias Issue: ^Var(ˆβ) is biased because it does not account for the uncertainty in estimating V.
- Implication: This means that ^Var(ˆβ) underestimates the true variability.
To estimate V, several approaches can be used:
- Maximum Likelihood Estimation (MLE)
- Restricted Maximum Likelihood (REML)
- Estimated Generalized Least Squares (EGLS)
- Bayesian Hierarchical Models (BHM)
8.2.4.1 Maximum Likelihood Estimation
MLE finds parameter estimates by maximizing the likelihood function.
Define a parameter vector θ that includes all unknown variance components in Σ and B. Then, we assume:
Y∼N(Xβ,V(θ)).
The log-likelihood function (ignoring constant terms) is:
−2logL(y;θ,β)=log|V(θ)|+(Y−Xβ)′V(θ)−1(Y−Xβ).
Steps for MLE Estimation
Estimate ˆβ, assuming θ is known:
ˆβMLE=(X′V(θ)−1X)−1X′V(θ)−1Y.
Obtain ˆθMLE by maximizing the log-likelihood:
ˆθMLE=argmax
Substitute \hat{\theta}_{MLE} to get updated estimates:
\hat{\beta}_{MLE} = (\mathbf{X'V(\hat{\theta}_{MLE})^{-1}X})^{-1} \mathbf{X'V(\hat{\theta}_{MLE})^{-1}Y}.
Predict random effects:
\hat{\mathbf{b}}_{MLE} = \mathbf{B}(\hat{\theta}_{MLE}) \mathbf{Z'V}(\hat{\theta}_{MLE})^{-1} (\mathbf{Y - X \hat{\beta}_{MLE}}).
Key Observations about MLE
- MLE tends to underestimate \theta because it does not account for the estimation of fixed effects.
- Bias in variance estimates can be corrected using REML.
8.2.4.2 Restricted Maximum Likelihood
Restricted Maximum Likelihood (REML) is an estimation method that improves upon Maximum Likelihood Estimation by accounting for the loss of degrees of freedom due to the estimation of fixed effects.
Unlike MLE, which estimates both fixed effects (\beta) and variance components (\theta) simultaneously, REML focuses on estimating variance components by considering linear combinations of the data that are independent of the fixed effects.
Consider the Linear Mixed Model:
\mathbf{y} = \mathbf{X} \beta + \mathbf{Z} \mathbf{b} + \epsilon,
where:
- \mathbf{y}: Response vector of length N
- \mathbf{X}: Design matrix for fixed effects (N \times p)
- \beta: Fixed effects parameter vector (p \times 1)
- \mathbf{Z}: Design matrix for random effects
- \mathbf{b} \sim N(\mathbf{0, D}): Random effects
- \epsilon \sim N(\mathbf{0, \Sigma}): Residual errors
The marginal distribution of \mathbf{y} is:
\mathbf{y} \sim N(\mathbf{X} \beta, \mathbf{V}(\theta)),
where:
\mathbf{V}(\theta) = \mathbf{Z D Z'} + \mathbf{\Sigma}.
To eliminate dependence on \beta, consider linear transformations of \mathbf{y} that are orthogonal to the fixed effects.
Let \mathbf{K} be a full-rank contrast matrix of size N \times (N - p) such that:
\mathbf{K}' \mathbf{X} = 0.
Then, we consider the transformed data:
\mathbf{K}' \mathbf{y} \sim N(\mathbf{0}, \mathbf{K}' \mathbf{V}(\theta) \mathbf{K}).
- This transformation removes \beta from the likelihood, focusing solely on the variance components \theta.
- Importantly, the choice of \mathbf{K} does not affect the final REML estimates.
The REML log-likelihood is:
-2 \log L_{REML}(\theta) = \log |\mathbf{K}' \mathbf{V}(\theta) \mathbf{K}| + \mathbf{y}' \mathbf{K} (\mathbf{K}' \mathbf{V}(\theta) \mathbf{K})^{-1} \mathbf{K}' \mathbf{y}.
An equivalent form of the REML log-likelihood, avoiding explicit use of \mathbf{K}, is:
-2 \log L_{REML}(\theta) = \log |\mathbf{V}(\theta)| + \log |\mathbf{X}' \mathbf{V}(\theta)^{-1} \mathbf{X}| + (\mathbf{y} - \mathbf{X} \hat{\beta})' \mathbf{V}(\theta)^{-1} (\mathbf{y} - \mathbf{X} \hat{\beta}),
where:
\hat{\beta} = (\mathbf{X}' \mathbf{V}(\theta)^{-1} \mathbf{X})^{-1} \mathbf{X}' \mathbf{V}(\theta)^{-1} \mathbf{y}.
This form highlights how REML adjusts for the estimation of fixed effects via the second term \log |\mathbf{X}' \mathbf{V}^{-1} \mathbf{X}|.
Steps for REML Estimation
Transform the data using \mathbf{K}' \mathbf{y} to remove \beta from the likelihood.
Maximize the restricted likelihood to estimate \hat{\theta}_{REML}.
Estimate fixed effects using:
\hat{\beta}_{REML} = (\mathbf{X}' \mathbf{V}(\hat{\theta}_{REML})^{-1} \mathbf{X})^{-1} \mathbf{X}' \mathbf{V}(\hat{\theta}_{REML})^{-1} \mathbf{y}.
Properties of REML
- Unbiased Variance Component Estimates: REML produces unbiased estimates of variance components by accounting for the degrees of freedom used to estimate fixed effects.
- Invariance to Fixed Effects: The restricted likelihood is constructed to be independent of the fixed effects \beta.
- Asymptotic Normality: REML estimates are consistent and asymptotically normal under standard regularity conditions.
- Efficiency: While REML estimates variance components efficiently, it does not maximize the joint likelihood of all parameters, so \beta estimates are slightly less efficient compared to MLE.
Criterion | MLE | REML |
---|---|---|
Approach | Maximizes full likelihood | Maximizes likelihood of contrasts (removes \beta) |
Estimates Fixed Effects? | Yes | No (focuses on variance components) |
Bias in Variance Estimates | Biased (underestimates variance components) | Unbiased (corrects for loss of degrees of freedom) |
Effect of Changing \mathbf{X} | Affects variance estimates | No effect on variance estimates |
Consistency | Yes | Yes |
Asymptotic Normality | Yes | Yes |
Efficiency | Efficient under normality | More efficient for variance components |
Model Comparison (AIC/BIC) | Suitable for comparing different fixed-effect models | Not suitable (penalizes fixed effects differently) |
Performance in Small Samples | Sensitive to small sample bias | More robust to small sample bias |
Handling Outliers | More sensitive | Less sensitive |
Equivalent to ANOVA? | No | Yes, in balanced designs |
8.2.4.3 Estimated Generalized Least Squares
MLE and REML rely on the Gaussian assumption, which may not always hold.
EGLS provides an alternative by relying only on the first two moments (mean and variance).
The LMM framework is:
\mathbf{Y}_i = \mathbf{X}_i \beta + \mathbf{Z}_i \mathbf{b}_i + \epsilon_i.
where:
- Random effects: \mathbf{b}_i \sim N(\mathbf{0, D}).
- Residual errors: \epsilon_i \sim N(\mathbf{0, \Sigma_i}).
- Independence assumption: \text{Cov}(\epsilon_i, \mathbf{b}_i) = 0.
Thus, the first two moments are:
E(\mathbf{Y}_i) = \mathbf{X}_i \beta, \quad \text{Var}(\mathbf{Y}_i) = \mathbf{V}_i.
The EGLS estimator is:
\hat{\beta}_{GLS} = \left\{ \sum_{i=1}^n \mathbf{X'_iV_i(\theta)^{-1}X_i} \right\}^{-1} \sum_{i=1}^n \mathbf{X'_iV_i(\theta)^{-1}Y_i}.
Writing in matrix form:
\hat{\beta}_{GLS} = \left\{ \mathbf{X'V(\theta)^{-1}X} \right\}^{-1} \mathbf{X'V(\theta)^{-1}Y}.
Since \mathbf{V}(\theta) is unknown, we estimate it as \hat{\mathbf{V}}, leading to the EGLS estimator:
\hat{\beta}_{EGLS} = \left\{ \mathbf{X'\hat{V}^{-1}X} \right\}^{-1} \mathbf{X'\hat{V}^{-1}Y}.
Key Insights about EGLS
- Computational Simplicity:
- EGLS does not require iterative maximization of a likelihood function, making it computationally attractive.
- Same Form as MLE/REML:
- The fixed effects estimators for MLE, REML, and EGLS have the same form, differing only in how \mathbf{V} is estimated.
- Robust to Non-Gaussian Data:
- Since it only depends on first and second moments, it can handle cases where MLE and REML struggle with non-normality.
When to Use EGLS?
- When the normality assumption for MLE/REML is questionable.
- When \mathbf{V} can be estimated efficiently without requiring complex optimization.
- In non-iterative approaches, where computational simplicity is a priority.
8.2.4.4 Bayesian Hierarchical Models
Bayesian methods offer a fully probabilistic framework to estimate \mathbf{V} by incorporating prior distributions.
The joint distribution can be decomposed hierarchically:
f(A, B, C) = f(A | B, C) f(B | C) f(C).
Applying this to LMMs:
\begin{aligned} f(\mathbf{Y, \beta, b, \theta}) &= f(\mathbf{Y | \beta, b, \theta}) f(\mathbf{b | \theta, \beta}) f(\mathbf{\beta | \theta}) f(\mathbf{\theta}) \\ &= f(\mathbf{Y | \beta, b, \theta}) f(\mathbf{b | \theta}) f(\mathbf{\beta}) f(\mathbf{\theta}). \end{aligned}
where:
- The first equality follows from probability decomposition.
- The second equality assumes conditional independence, meaning:
- Given \theta, no additional information about \mathbf{b} is obtained from knowing \beta.
Using Bayes’ theorem, the posterior distribution is:
f(\mathbf{\beta, b, \theta | Y}) \propto f(\mathbf{Y | \beta, b, \theta}) f(\mathbf{b | \theta}) f(\mathbf{\beta}) f(\mathbf{\theta}).
where:
\begin{aligned} \mathbf{Y | \beta, b, \theta} &\sim N(\mathbf{X\beta + Zb}, \mathbf{\Sigma(\theta)}), \\ \mathbf{b | \theta} &\sim N(\mathbf{0, B(\theta)}). \end{aligned}
To complete the Bayesian model, we specify prior distributions:
- f(\beta): Prior on fixed effects.
- f(\theta): Prior on variance components.
Since analytical solutions are generally unavailable, we use Markov Chain Monte Carlo (MCMC) to sample from the posterior:
- Gibbs sampling (if conjugate priors are used).
- Hamiltonian Monte Carlo (HMC) (for complex models).
Advantages
- Accounts for Parameter Uncertainty
- Unlike MLE/REML, Bayesian methods propagate uncertainty in variance component estimation.
- Flexible Model Specification
- Can incorporate prior knowledge via informative priors.
- Extends naturally beyond Gaussian assumptions (e.g., Student-t distributions for heavy-tailed errors).
- Robustness in Small Samples
- Bayesian methods can stabilize variance estimation in small datasets where MLE/REML are unreliable.
Challenges
- Computational Complexity
- Requires MCMC algorithms, which can be computationally expensive.
- Convergence Issues
- MCMC chains must be checked for convergence (e.g., using R-hat diagnostic).
- Choice of Priors
- Poorly chosen priors can bias estimates or slow down convergence.
Comparison of Estimation Methods for \mathbf{V}
Method | Assumptions | Computational Cost | Handles Non-Normality? | Best Use Case |
---|---|---|---|---|
MLE | Gaussian errors | High (iterative) | ❌ No | Model selection (AIC/BIC) |
REML | Gaussian errors | High (iterative) | ❌ No | Variance estimation |
EGLS | First two moments | Low (non-iterative) | ✅ Yes | Large-scale models with correlated errors |
Bayesian (BHM) | Probabilistic | Very High (MCMC) | ✅ Yes | Small samples, prior information available |