8.2 Estimation in Linear Mixed Models

The general Linear Mixed Model is:

Yi=Xiβ+Zibi+ϵi,

where:

  • β: Fixed effects (parameters shared across subjects).
  • bi: Random effects (subject-specific deviations).
  • Xi: Design matrix for fixed effects.
  • Zi: Design matrix for random effects.
  • D: Covariance matrix of the random effects.
  • Σi: Covariance matrix of the residual errors.

Since β, bi, D, and Σi are unknown, they must be estimated from data.

  • β,D,Σi are fixed parameters → must be estimated.
  • bi is a random variable → must be predicted (not estimated). In other words, random thing/variable can’t be estimated.

Thus, we define:

  • Estimator of β: ˆβ (fixed effect estimation).
  • Predictor of bi: ˆbi (random effect prediction).

Then:

  • The population-level estimate of Yi is:

    ˆYi=Xiˆβ.

  • The subject-specific prediction is:

    ˆYi=Xiˆβ+Ziˆbi.

For all N subjects, we stack the equations into the Mixed Model Equations (Henderson 1975):

Y=Xβ+Zb+ϵ.

and

YN(Xβ,ZBZ+Σ)

where:

Y=[y1yN],X=[X1XN],b=[b1bN],ϵ=[ϵ1ϵN].

The covariance structure is:

Cov(b)=B,Cov(ϵ)=Σ,Cov(b,ϵ)=0.

Expanding Z and B as block diagonal matrices:

Z=[Z1000Z2000ZN],B=[D000D000D].

The best linear unbiased estimator (BLUE) for β and the best linear unbiased predictor (BLUP) for b are obtained by solving (Henderson 1975):

[ˆβˆb]=[XΣ1XXΣ1ZZΣ1XZΣ1Z+B1]1[XΣ1YZΣ1Y].

where:

If we define:

V=ZBZ+Σ.

Then, the solutions to the Mixed Model Equations are:

ˆβ=(XV1X)1XV1Y,ˆb=BZV1(YXˆβ).

where:


Properties of the Estimators

For ˆβ:

E(ˆβ)=β,Var(ˆβ)=(XV1X)1.

For ˆb:

E(ˆb)=0.

The variance of the prediction error (Mean Squared Prediction Error, MSPE) is:

Var(ˆbb)=BBZV1ZB+BZV1X(XV1X)1XV1B.

🔹 Key Insight:
Mean Squared Prediction Error is more meaningful than Var(ˆb) alone, since it accounts for both variance and bias in prediction.


8.2.1 Interpretation of the Mixed Model Equations

The system:

[XΣ1XXΣ1ZZΣ1XZΣ1Z+B1][βb]=[XΣ1YZΣ1Y].

can be understood as:

  1. Fixed Effects Estimation (ˆβ)
  2. Random Effects Prediction (ˆb)
    • Computed using the BLUP formula.
    • Shrinks subject-specific estimates toward the population mean.

Component Equation Interpretation
Fixed effects (ˆβ) (XV1X)1XV1Y Generalized Least Squares estimator
Random effects (ˆb) BZV1(YXˆβ) Best Linear Unbiased Predictor (BLUP)
Variance of ˆβ (XV1X)1 Accounts for uncertainty in fixed effect estimates
Variance of prediction error BBZV1ZB+BZV1X(XV1X)1XV1B Includes both variance and bias in prediction

8.2.2 Derivation of the Mixed Model Equations

To derive the Mixed Model Equations, consider:

ϵ=YXβZb.

Define:

  • T=Ni=1niTotal number of observations.
  • NqTotal number of random effects.

The joint distribution of (b,ϵ) is:

f(b,ϵ)=1(2π)(T+Nq)/2|B00Σ|1/2exp(12[bYXβZb][B00Σ]1[bYXβZb]).

Maximizing f(b,ϵ) with respect to b and β requires minimization of:

Q=[bYXβZb][B00Σ]1[bYXβZb]=bB1b+(YXβZb)Σ1(YXβZb).

Setting the derivatives of Q with respect to b and β to zero leads to the system of equations:

XΣ1Xβ+XΣ1Zb=XΣ1Y(ZΣ1Z+B1)b+ZΣ1Xβ=ZΣ1Y

Rearranging

[XΣ1XXΣ1ZZΣ1XZΣ1Z+B1][βb]=[XΣ1YZΣ1Y]

Thus, the solution to the mixed model equations give:

[ˆβˆb]=[XΣ1XXΣ1ZZΣ1XZΣ1Z+B1]1[XΣ1YZΣ1Y]


8.2.3 Bayesian Interpretation of Linear Mixed Models

In a Bayesian framework, the posterior distribution of the random effects b given the observed data Y is derived using Bayes’ theorem:

f(b|Y)=f(Y|b)f(b)f(Y|b)f(b)db.

where:

  • f(Y|b) is the likelihood function, describing how the data are generated given the random effects.
  • f(b) is the prior distribution of the random effects.
  • The denominator f(Y|b)f(b)db is the normalizing constant that ensures the posterior integrates to 1.
  • f(b|Y) is the posterior distribution, which updates our belief about b given the observed data Y.

In the Linear Mixed Model, we assume:

Y|bN(Xβ+Zb,Σ),bN(0,B).

This means:

  • Likelihood: Given b, the data Y follows a multivariate normal distribution with mean Xβ+Zb and covariance Σ.
  • Prior for b: The random effects are assumed to follow a multivariate normal distribution with mean 0 and covariance B.

By applying Bayes’ theorem, the posterior distribution of b given Y is:

b|YN(BZV1(YXβ),(ZΣ1Z+B1)1).

where:

  • Mean: BZV1(YXβ)
    • This is the BLUP.
    • It represents the expected value of b given Y under squared-error loss.
  • Covariance: (ZΣ1Z+B1)1
    • This posterior variance accounts for both prior uncertainty (B) and data uncertainty (Σ).

Thus, the Bayesian posterior mean of b coincides with the BLUP predictor:

E(b|Y)=BZV1(YXβ).


Interpretation of the Posterior Distribution

  1. Posterior Mean as a Shrinkage Estimator (BLUP)
    • The expectation E(b|Y) shrinks individual estimates toward the population mean.
    • Subjects with less data or more variability will have estimates closer to zero.
    • This is similar to Ridge Regression in penalized estimation.
  2. Posterior Variance Quantifies Uncertainty
    • The matrix (ZΣ1Z+B1)1 captures the remaining uncertainty in b after seeing Y.
    • If ZΣ1Z is large, the data provide strong information about b, reducing posterior variance.
    • If B1 dominates, prior information heavily influences estimates.
  3. Connection to Bayesian Inference
    • The random effects b follow a Gaussian posterior due to conjugacy.
    • This is analogous to Bayesian hierarchical models, where random effects are latent variables estimated from data.
Step Equation Interpretation
Likelihood Y|bN(Xβ+Zb,Σ) Data given random effects
Prior bN(0,B) Random effects distribution
Posterior b|YN(BZV1(YXβ),(ZΣ1Z+B1)1) Updated belief about b
Posterior Mean (BLUP) E(b|Y)=BZV1(YXβ) Best predictor (squared error loss)
Posterior Variance (ZΣ1Z+B1)1 Uncertainty in predictions

8.2.4 Estimating the Variance-Covariance Matrix

If we have an estimate ˜V for V, we can estimate the fixed and random effects as:

ˆβ=(X˜V1X)1X˜V1Y,ˆb=BZ˜V1(YXˆβ).

where:

  • ˆβ is the estimated fixed effects.
  • ˆb is the Empirical Best Linear Unbiased Predictor (EBLUP), also called the Empirical Bayes estimate of b.

Properties of ˆβ and Variance Estimation

  • Consistency: ^Var(ˆβ) is a consistent estimator of Var(ˆβ) if ˜V is a consistent estimator of V.
  • Bias Issue: ^Var(ˆβ) is biased because it does not account for the uncertainty in estimating V.
  • Implication: This means that ^Var(ˆβ) underestimates the true variability.

To estimate V, several approaches can be used:


8.2.4.1 Maximum Likelihood Estimation

MLE finds parameter estimates by maximizing the likelihood function.

Define a parameter vector θ that includes all unknown variance components in Σ and B. Then, we assume:

YN(Xβ,V(θ)).

The log-likelihood function (ignoring constant terms) is:

2logL(y;θ,β)=log|V(θ)|+(YXβ)V(θ)1(YXβ).

Steps for MLE Estimation

  1. Estimate ˆβ, assuming θ is known:

    ˆβMLE=(XV(θ)1X)1XV(θ)1Y.

  2. Obtain ˆθMLE by maximizing the log-likelihood:

    ˆθMLE=argmax

  3. Substitute \hat{\theta}_{MLE} to get updated estimates:

    \hat{\beta}_{MLE} = (\mathbf{X'V(\hat{\theta}_{MLE})^{-1}X})^{-1} \mathbf{X'V(\hat{\theta}_{MLE})^{-1}Y}.

  4. Predict random effects:

    \hat{\mathbf{b}}_{MLE} = \mathbf{B}(\hat{\theta}_{MLE}) \mathbf{Z'V}(\hat{\theta}_{MLE})^{-1} (\mathbf{Y - X \hat{\beta}_{MLE}}).

Key Observations about MLE

  • MLE tends to underestimate \theta because it does not account for the estimation of fixed effects.
  • Bias in variance estimates can be corrected using REML.

8.2.4.2 Restricted Maximum Likelihood

Restricted Maximum Likelihood (REML) is an estimation method that improves upon Maximum Likelihood Estimation by accounting for the loss of degrees of freedom due to the estimation of fixed effects.

Unlike MLE, which estimates both fixed effects (\beta) and variance components (\theta) simultaneously, REML focuses on estimating variance components by considering linear combinations of the data that are independent of the fixed effects.

Consider the Linear Mixed Model:

\mathbf{y} = \mathbf{X} \beta + \mathbf{Z} \mathbf{b} + \epsilon,

where:

  • \mathbf{y}: Response vector of length N
  • \mathbf{X}: Design matrix for fixed effects (N \times p)
  • \beta: Fixed effects parameter vector (p \times 1)
  • \mathbf{Z}: Design matrix for random effects
  • \mathbf{b} \sim N(\mathbf{0, D}): Random effects
  • \epsilon \sim N(\mathbf{0, \Sigma}): Residual errors

The marginal distribution of \mathbf{y} is:

\mathbf{y} \sim N(\mathbf{X} \beta, \mathbf{V}(\theta)),

where:

\mathbf{V}(\theta) = \mathbf{Z D Z'} + \mathbf{\Sigma}.

To eliminate dependence on \beta, consider linear transformations of \mathbf{y} that are orthogonal to the fixed effects.

Let \mathbf{K} be a full-rank contrast matrix of size N \times (N - p) such that:

\mathbf{K}' \mathbf{X} = 0.

Then, we consider the transformed data:

\mathbf{K}' \mathbf{y} \sim N(\mathbf{0}, \mathbf{K}' \mathbf{V}(\theta) \mathbf{K}).

  • This transformation removes \beta from the likelihood, focusing solely on the variance components \theta.
  • Importantly, the choice of \mathbf{K} does not affect the final REML estimates.

The REML log-likelihood is:

-2 \log L_{REML}(\theta) = \log |\mathbf{K}' \mathbf{V}(\theta) \mathbf{K}| + \mathbf{y}' \mathbf{K} (\mathbf{K}' \mathbf{V}(\theta) \mathbf{K})^{-1} \mathbf{K}' \mathbf{y}.

An equivalent form of the REML log-likelihood, avoiding explicit use of \mathbf{K}, is:

-2 \log L_{REML}(\theta) = \log |\mathbf{V}(\theta)| + \log |\mathbf{X}' \mathbf{V}(\theta)^{-1} \mathbf{X}| + (\mathbf{y} - \mathbf{X} \hat{\beta})' \mathbf{V}(\theta)^{-1} (\mathbf{y} - \mathbf{X} \hat{\beta}),

where:

\hat{\beta} = (\mathbf{X}' \mathbf{V}(\theta)^{-1} \mathbf{X})^{-1} \mathbf{X}' \mathbf{V}(\theta)^{-1} \mathbf{y}.

This form highlights how REML adjusts for the estimation of fixed effects via the second term \log |\mathbf{X}' \mathbf{V}^{-1} \mathbf{X}|.

Steps for REML Estimation

  1. Transform the data using \mathbf{K}' \mathbf{y} to remove \beta from the likelihood.

  2. Maximize the restricted likelihood to estimate \hat{\theta}_{REML}.

  3. Estimate fixed effects using:

    \hat{\beta}_{REML} = (\mathbf{X}' \mathbf{V}(\hat{\theta}_{REML})^{-1} \mathbf{X})^{-1} \mathbf{X}' \mathbf{V}(\hat{\theta}_{REML})^{-1} \mathbf{y}.


Properties of REML

  1. Unbiased Variance Component Estimates: REML produces unbiased estimates of variance components by accounting for the degrees of freedom used to estimate fixed effects.
  2. Invariance to Fixed Effects: The restricted likelihood is constructed to be independent of the fixed effects \beta.
  3. Asymptotic Normality: REML estimates are consistent and asymptotically normal under standard regularity conditions.
  4. Efficiency: While REML estimates variance components efficiently, it does not maximize the joint likelihood of all parameters, so \beta estimates are slightly less efficient compared to MLE.

Comparison of REML and MLE
Criterion MLE REML
Approach Maximizes full likelihood Maximizes likelihood of contrasts (removes \beta)
Estimates Fixed Effects? Yes No (focuses on variance components)
Bias in Variance Estimates Biased (underestimates variance components) Unbiased (corrects for loss of degrees of freedom)
Effect of Changing \mathbf{X} Affects variance estimates No effect on variance estimates
Consistency Yes Yes
Asymptotic Normality Yes Yes
Efficiency Efficient under normality More efficient for variance components
Model Comparison (AIC/BIC) Suitable for comparing different fixed-effect models Not suitable (penalizes fixed effects differently)
Performance in Small Samples Sensitive to small sample bias More robust to small sample bias
Handling Outliers More sensitive Less sensitive
Equivalent to ANOVA? No Yes, in balanced designs

8.2.4.3 Estimated Generalized Least Squares

MLE and REML rely on the Gaussian assumption, which may not always hold.
EGLS provides an alternative by relying only on the first two moments (mean and variance).

The LMM framework is:

\mathbf{Y}_i = \mathbf{X}_i \beta + \mathbf{Z}_i \mathbf{b}_i + \epsilon_i.

where:

  • Random effects: \mathbf{b}_i \sim N(\mathbf{0, D}).
  • Residual errors: \epsilon_i \sim N(\mathbf{0, \Sigma_i}).
  • Independence assumption: \text{Cov}(\epsilon_i, \mathbf{b}_i) = 0.

Thus, the first two moments are:

E(\mathbf{Y}_i) = \mathbf{X}_i \beta, \quad \text{Var}(\mathbf{Y}_i) = \mathbf{V}_i.

The EGLS estimator is:

\hat{\beta}_{GLS} = \left\{ \sum_{i=1}^n \mathbf{X'_iV_i(\theta)^{-1}X_i} \right\}^{-1} \sum_{i=1}^n \mathbf{X'_iV_i(\theta)^{-1}Y_i}.

Writing in matrix form:

\hat{\beta}_{GLS} = \left\{ \mathbf{X'V(\theta)^{-1}X} \right\}^{-1} \mathbf{X'V(\theta)^{-1}Y}.

Since \mathbf{V}(\theta) is unknown, we estimate it as \hat{\mathbf{V}}, leading to the EGLS estimator:

\hat{\beta}_{EGLS} = \left\{ \mathbf{X'\hat{V}^{-1}X} \right\}^{-1} \mathbf{X'\hat{V}^{-1}Y}.

Key Insights about EGLS

  • Computational Simplicity:
    • EGLS does not require iterative maximization of a likelihood function, making it computationally attractive.
  • Same Form as MLE/REML:
    • The fixed effects estimators for MLE, REML, and EGLS have the same form, differing only in how \mathbf{V} is estimated.
  • Robust to Non-Gaussian Data:
    • Since it only depends on first and second moments, it can handle cases where MLE and REML struggle with non-normality.

When to Use EGLS?

  • When the normality assumption for MLE/REML is questionable.
  • When \mathbf{V} can be estimated efficiently without requiring complex optimization.
  • In non-iterative approaches, where computational simplicity is a priority.

8.2.4.4 Bayesian Hierarchical Models

Bayesian methods offer a fully probabilistic framework to estimate \mathbf{V} by incorporating prior distributions.

The joint distribution can be decomposed hierarchically:

f(A, B, C) = f(A | B, C) f(B | C) f(C).

Applying this to LMMs:

\begin{aligned} f(\mathbf{Y, \beta, b, \theta}) &= f(\mathbf{Y | \beta, b, \theta}) f(\mathbf{b | \theta, \beta}) f(\mathbf{\beta | \theta}) f(\mathbf{\theta}) \\ &= f(\mathbf{Y | \beta, b, \theta}) f(\mathbf{b | \theta}) f(\mathbf{\beta}) f(\mathbf{\theta}). \end{aligned}

where:

  • The first equality follows from probability decomposition.
  • The second equality assumes conditional independence, meaning:
    • Given \theta, no additional information about \mathbf{b} is obtained from knowing \beta.

Using Bayes’ theorem, the posterior distribution is:

f(\mathbf{\beta, b, \theta | Y}) \propto f(\mathbf{Y | \beta, b, \theta}) f(\mathbf{b | \theta}) f(\mathbf{\beta}) f(\mathbf{\theta}).

where:

\begin{aligned} \mathbf{Y | \beta, b, \theta} &\sim N(\mathbf{X\beta + Zb}, \mathbf{\Sigma(\theta)}), \\ \mathbf{b | \theta} &\sim N(\mathbf{0, B(\theta)}). \end{aligned}

To complete the Bayesian model, we specify prior distributions:

  • f(\beta): Prior on fixed effects.
  • f(\theta): Prior on variance components.

Since analytical solutions are generally unavailable, we use Markov Chain Monte Carlo (MCMC) to sample from the posterior:

  • Gibbs sampling (if conjugate priors are used).
  • Hamiltonian Monte Carlo (HMC) (for complex models).

Advantages

  1. Accounts for Parameter Uncertainty
    • Unlike MLE/REML, Bayesian methods propagate uncertainty in variance component estimation.
  2. Flexible Model Specification
    • Can incorporate prior knowledge via informative priors.
    • Extends naturally beyond Gaussian assumptions (e.g., Student-t distributions for heavy-tailed errors).
  3. Robustness in Small Samples
    • Bayesian methods can stabilize variance estimation in small datasets where MLE/REML are unreliable.

Challenges

  1. Computational Complexity
    • Requires MCMC algorithms, which can be computationally expensive.
  2. Convergence Issues
    • MCMC chains must be checked for convergence (e.g., using R-hat diagnostic).
  3. Choice of Priors
    • Poorly chosen priors can bias estimates or slow down convergence.

Comparison of Estimation Methods for \mathbf{V}

Method Assumptions Computational Cost Handles Non-Normality? Best Use Case
MLE Gaussian errors High (iterative) ❌ No Model selection (AIC/BIC)
REML Gaussian errors High (iterative) ❌ No Variance estimation
EGLS First two moments Low (non-iterative) ✅ Yes Large-scale models with correlated errors
Bayesian (BHM) Probabilistic Very High (MCMC) ✅ Yes Small samples, prior information available

References

Henderson, Charles R. 1975. “Best Linear Unbiased Estimation and Prediction Under a Selection Model.” Biometrics, 423–47.