8.2 Estimation in Linear Mixed Models

$\mathbf{Y}_i = \mathbf{X}_i \beta + \mathbf{Z}_i \mathbf{b}_i + \epsilon_i,$

where:

$\beta$ : Fixed effects (parameters shared across subjects).
$\mathbf{b}_i$ : Random effects (subject-specific deviations).
$\mathbf{X}_i$ : Design matrix for fixed effects.
$\mathbf{Z}_i$ : Design matrix for random effects.
$\mathbf{D}$ : Covariance matrix of the random effects.
$\mathbf{\Sigma}_i$ : Covariance matrix of the residual errors.

Since $\beta$ , $\mathbf{b}_i$ , $\mathbf{D}$ , and $\mathbf{\Sigma}_i$ are unknown, they must be estimated from data.

$\beta, \mathbf{D}, \mathbf{\Sigma}_i$ are fixed parameters → must be estimated.
$\mathbf{b}_i$ is a random variable → must be predicted (not estimated). In other words, random thing/variable can’t be estimated.

Thus, we define:

Estimator of $\beta$ : $\hat{\beta}$ (fixed effect estimation).
Predictor of $\mathbf{b}_i$ : $\hat{\mathbf{b}}_i$ (random effect prediction).

Then:

The population-level estimate of $\mathbf{Y}_i$ is:

$\hat{\mathbf{Y}}_i = \mathbf{X}_i \hat{\beta}.$
The subject-specific prediction is:

$\hat{\mathbf{Y}}_i = \mathbf{X}_i \hat{\beta} + \mathbf{Z}_i \hat{\mathbf{b}}_i.$

For all $N$ subjects, we stack the equations into the Mixed Model Equations (Henderson 1975):

$\mathbf{Y} = \mathbf{X} \beta + \mathbf{Z} \mathbf{b} + \epsilon.$

and

$\mathbf{Y} \sim N(\mathbf{X \beta, ZBZ' + \Sigma})$

where:

$\mathbf{Y} = \begin{bmatrix} \mathbf{y}_1 \\ \vdots \\ \mathbf{y}_N \end{bmatrix}, \quad \mathbf{X} = \begin{bmatrix} \mathbf{X}_1 \\ \vdots \\ \mathbf{X}_N \end{bmatrix}, \quad \mathbf{b} = \begin{bmatrix} \mathbf{b}_1 \\ \vdots \\ \mathbf{b}_N \end{bmatrix}, \quad\mathbf{\epsilon} = \begin{bmatrix}\mathbf{\epsilon}_1 \\\vdots \\\mathbf{\epsilon}_N\end{bmatrix}.$

The covariance structure is:

$\text{Cov}(\mathbf{b}) = \mathbf{B}, \quad \text{Cov}(\epsilon) = \mathbf{\Sigma}, \quad \text{Cov}(\mathbf{b}, \epsilon) = 0.$

Expanding $\mathbf{Z}$ and $\mathbf{B}$ as block diagonal matrices:

$\mathbf{Z} = \begin{bmatrix} \mathbf{Z}_1 & 0 & \dots & 0 \\ 0 & \mathbf{Z}_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \mathbf{Z}_N \end{bmatrix}, \quad \mathbf{B} = \begin{bmatrix} \mathbf{D} & 0 & \dots & 0 \\ 0 & \mathbf{D} & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \mathbf{D} \end{bmatrix}.$

The best linear unbiased estimator (BLUE) for $\beta$ and the best linear unbiased predictor (BLUP) for $\mathbf{b}$ are obtained by solving (Henderson 1975):

$\left[ \begin{array}{c} \hat{\beta} \\ \hat{\mathbf{b}} \end{array} \right] = \left[ \begin{array}{cc} \mathbf{X' \Sigma^{-1} X} & \mathbf{X' \Sigma^{-1} Z} \\ \mathbf{Z' \Sigma^{-1} X} & \mathbf{Z' \Sigma^{-1} Z + B^{-1}} \end{array} \right]^{-1} \left[ \begin{array}{c} \mathbf{X' \Sigma^{-1} Y} \\ \mathbf{Z' \Sigma^{-1} Y} \end{array} \right].$

where:

$\hat{\beta}$ is the Generalized Least Squares estimator for $\beta$ .
$\hat{\mathbf{b}}$ is the BLUP for $\mathbf{b}$ .

If we define:

$\mathbf{V} = \mathbf{Z B Z'} + \mathbf{\Sigma}.$

Then, the solutions to the Mixed Model Equations are:

$\begin{aligned} \hat{\beta} &= (\mathbf{X'V^{-1}X})^{-1} \mathbf{X'V^{-1}Y}, \\ \hat{\mathbf{b}} &= \mathbf{BZ'V^{-1}(Y - X\hat{\beta})}. \end{aligned}$

where:

$\hat{\beta}$ is obtained using Generalized Least Squares.
$\hat{\mathbf{b}}$ is a Weighted Least Squares predictor, where weights come from $\mathbf{B}$ and $\mathbf{V}$ .

Properties of the Estimators

For $\hat{\beta}$ :

$E(\hat{\beta}) = \beta, \quad \text{Var}(\hat{\beta}) = (\mathbf{X'V^{-1}X})^{-1}.$

For $\hat{\mathbf{b}}$ :

$E(\hat{\mathbf{b}}) = 0.$

The variance of the prediction error (Mean Squared Prediction Error, MSPE) is:

$\text{Var}(\hat{\mathbf{b}} - \mathbf{b}) = \mathbf{B - BZ'V^{-1}ZB + BZ'V^{-1}X (X'V^{-1}X)^{-1} X'V^{-1}B}.$

Key Insight:
Mean Squared Prediction Error is more meaningful than $\text{Var}(\hat{\mathbf{b}})$ alone, since it accounts for both variance and bias in prediction.

8.2.1 Interpretation of the Mixed Model Equations

The system:

$\left[ \begin{array}{cc} \mathbf{X'\Sigma^{-1}X} & \mathbf{X'\Sigma^{-1}Z} \\ \mathbf{Z'\Sigma^{-1}X} & \mathbf{Z'\Sigma^{-1}Z + B^{-1}} \end{array} \right] \left[ \begin{array}{c} \beta \\ \mathbf{b} \end{array} \right] = \left[ \begin{array}{c} \mathbf{X'\Sigma^{-1}Y} \\ \mathbf{Z'\Sigma^{-1}Y} \end{array} \right].$

can be understood as:

Fixed Effects Estimation ( $\hat{\beta}$ )
- Uses Generalized Least Squares.
- Adjusted for both random effects and correlated errors.
Random Effects Prediction ( $\hat{\mathbf{b}}$ )
- Computed using the BLUP formula.
- Shrinks subject-specific estimates toward the population mean.

Key Components and Equations in Linear Mixed Models
Component	Equation	Interpretation
Fixed effects ( $\hat{\beta}$ )	$(\mathbf{X'V^{-1}X})^{-1} \mathbf{X'V^{-1}Y}$	Generalized Least Squares estimator
Random effects ( $\hat{\mathbf{b}}$ )	$\mathbf{BZ'V^{-1}(Y - X\hat{\beta})}$	Best Linear Unbiased Predictor (BLUP)
Variance of $\hat{\beta}$	$(\mathbf{X'V^{-1}X})^{-1}$	Accounts for uncertainty in fixed effect estimates
Variance of prediction error	$\mathbf{B - BZ'V^{-1}ZB + BZ'V^{-1}X (X'V^{-1}X)^{-1} X'V^{-1}B}$	Includes both variance and bias in prediction

8.2.2 Derivation of the Mixed Model Equations

To derive the Mixed Model Equations, consider:

$\mathbf{\epsilon} = \mathbf{Y} - \mathbf{X \beta} - \mathbf{Z b}.$

Define:

$T = \sum_{i=1}^N n_i$ as Total number of observations.
$Nq$ as Total number of random effects.

The joint distribution of $(\mathbf{b}, \mathbf{\epsilon})$ is:

$\begin{aligned} f(\mathbf{b}, \epsilon) &= \frac{1}{(2\pi)^{(T+Nq)/2}} \left| \begin{array}{cc} \mathbf{B} & 0 \\ 0 & \mathbf{\Sigma} \end{array} \right|^{-1/2} \\ & \exp \left( -\frac{1}{2} \begin{bmatrix} \mathbf{b} \\ \mathbf{Y - X \beta - Zb} \end{bmatrix}' \begin{bmatrix} \mathbf{B} & 0 \\ 0 & \mathbf{\Sigma} \end{bmatrix}^{-1} \begin{bmatrix} \mathbf{b} \\ \mathbf{Y - X \beta - Zb} \end{bmatrix} \right). \end{aligned}$

Maximizing $f(\mathbf{b},\epsilon)$ with respect to $\mathbf{b}$ and $\beta$ requires minimization of:

$\begin{aligned} Q &= \left[ \begin{array} {c} \mathbf{b} \\ \mathbf{Y - X \beta - Zb} \end{array} \right]' \left[ \begin{array} {cc} \mathbf{B} & 0 \\ 0 & \mathbf{\Sigma} \end{array} \right]^{-1} \left[ \begin{array} {c} \mathbf{b} \\ \mathbf{Y - X \beta - Zb} \end{array} \right] \\ &=\mathbf{b' B^{-1} b} + (\mathbf{Y - X \beta - Z b})' \mathbf{\Sigma^{-1}} (\mathbf{Y - X \beta - Z b}). \end{aligned}$

Setting the derivatives of $Q$ with respect to $\mathbf{b}$ and $\mathbf{\beta}$ to zero leads to the system of equations:

$\begin{aligned} \mathbf{X'\Sigma^{-1}X\beta + X'\Sigma^{-1}Zb} &= \mathbf{X'\Sigma^{-1}Y}\\ \mathbf{(Z'\Sigma^{-1}Z + B^{-1})b + Z'\Sigma^{-1}X\beta} &= \mathbf{Z'\Sigma^{-1}Y} \end{aligned}$

Rearranging

$\left[ \begin{array} {cc} \mathbf{X'\Sigma^{-1}X} & \mathbf{X'\Sigma^{-1}Z} \\ \mathbf{Z'\Sigma^{-1}X} & \mathbf{Z'\Sigma^{-1}Z + B^{-1}} \end{array} \right] \left[ \begin{array} {c} \beta \\ \mathbf{b} \end{array} \right] = \left[ \begin{array} {c} \mathbf{X'\Sigma^{-1}Y} \\ \mathbf{Z'\Sigma^{-1}Y} \end{array} \right]$

Thus, the solution to the mixed model equations give:

$\left[ \begin{array} {c} \hat{\beta} \\ \hat{\mathbf{b}} \end{array} \right] = \left[ \begin{array} {cc} \mathbf{X'\Sigma^{-1}X} & \mathbf{X'\Sigma^{-1}Z} \\ \mathbf{Z'\Sigma^{-1}X} & \mathbf{Z'\Sigma^{-1}Z + B^{-1}} \end{array} \right] ^{-1} \left[ \begin{array} {c} \mathbf{X'\Sigma^{-1}Y} \\ \mathbf{Z'\Sigma^{-1}Y} \end{array} \right]$

8.2.3 Bayesian Interpretation of Linear Mixed Models

In a Bayesian framework, the posterior distribution of the random effects $\mathbf{b}$ given the observed data $\mathbf{Y}$ is derived using Bayes’ theorem:

$f(\mathbf{b}| \mathbf{Y}) = \frac{f(\mathbf{Y}|\mathbf{b})f(\mathbf{b})}{\int f(\mathbf{Y}|\mathbf{b})f(\mathbf{b}) d\mathbf{b}}.$

where:

$f(\mathbf{Y}|\mathbf{b})$ is the likelihood function, describing how the data are generated given the random effects.
$f(\mathbf{b})$ is the prior distribution of the random effects.
The denominator $\int f(\mathbf{Y}|\mathbf{b}) f(\mathbf{b}) d\mathbf{b}$ is the normalizing constant that ensures the posterior integrates to 1.
$f(\mathbf{b}|\mathbf{Y})$ is the posterior distribution, which updates our belief about $\mathbf{b}$ given the observed data $\mathbf{Y}$ .

In the Linear Mixed Model, we assume:

$\begin{aligned} \mathbf{Y} | \mathbf{b} &\sim N(\mathbf{X\beta+Zb, \Sigma}), \\ \mathbf{b} &\sim N(\mathbf{0, B}). \end{aligned}$

This means:

Likelihood: Given $\mathbf{b}$ , the data $\mathbf{Y}$ follows a multivariate normal distribution with mean $\mathbf{X\beta+Zb}$ and covariance $\mathbf{\Sigma}$ .
Prior for $\mathbf{b}$ : The random effects are assumed to follow a multivariate normal distribution with mean 0 and covariance $\mathbf{B}$ .

By applying Bayes’ theorem, the posterior distribution of $\mathbf{b}$ given $\mathbf{Y}$ is:

$\mathbf{b} | \mathbf{Y} \sim N(\mathbf{BZ'V^{-1}(Y - X\beta)}, (\mathbf{Z'\Sigma^{-1}Z} + \mathbf{B^{-1}})^{-1}).$

where:

Mean: $\mathbf{BZ'V^{-1}(Y - X\beta)}$
- This is the BLUP.
- It represents the expected value of $\mathbf{b}$ given $\mathbf{Y}$ under squared-error loss.
Covariance: $(\mathbf{Z'\Sigma^{-1}Z} + \mathbf{B^{-1}})^{-1}$
- This posterior variance accounts for both prior uncertainty ( $\mathbf{B}$ ) and data uncertainty ( $\mathbf{\Sigma}$ ).

Thus, the Bayesian posterior mean of $\mathbf{b}$ coincides with the BLUP predictor:

$E(\mathbf{b}|\mathbf{Y}) = \mathbf{BZ'V^{-1}(Y-X\beta)}.$

Interpretation of the Posterior Distribution

Posterior Mean as a Shrinkage Estimator (BLUP)
- The expectation $E(\mathbf{b}|\mathbf{Y})$ shrinks individual estimates toward the population mean.
- Subjects with less data or more variability will have estimates closer to zero.
- This is similar to Ridge Regression in penalized estimation.
Posterior Variance Quantifies Uncertainty
- The matrix $(\mathbf{Z'\Sigma^{-1}Z} + \mathbf{B^{-1}})^{-1}$ captures the remaining uncertainty in $\mathbf{b}$ after seeing $\mathbf{Y}$ .
- If $\mathbf{Z'\Sigma^{-1}Z}$ is large, the data provide strong information about $\mathbf{b}$ , reducing posterior variance.
- If $\mathbf{B^{-1}}$ dominates, prior information heavily influences estimates.
Connection to Bayesian Inference
- The random effects $\mathbf{b}$ follow a Gaussian posterior due to conjugacy.
- This is analogous to Bayesian hierarchical models, where random effects are latent variables estimated from data.

Bayesian Interpretation of Linear Mixed Models
Step	Equation	Interpretation
Likelihood	$\mathbf{Y} \| \mathbf{b} \sim N(\mathbf{X\beta+Zb, \Sigma})$	Data given random effects
Prior	$\mathbf{b} \sim N(\mathbf{0, B})$	Random effects distribution
Posterior	$\mathbf{b}\|\mathbf{Y} \sim N(\mathbf{BZ'V^{-1}(Y-X\beta)}, (\mathbf{Z'\Sigma^{-1}Z} + \mathbf{B^{-1}})^{-1})$	Updated belief about $\mathbf{b}$
Posterior Mean (BLUP)	$E(\mathbf{b}\|\mathbf{Y}) = \mathbf{BZ'V^{-1}(Y-X\beta)}$	Best predictor (squared error loss)
Posterior Variance	$(\mathbf{Z'\Sigma^{-1}Z} + \mathbf{B^{-1}})^{-1}$	Uncertainty in predictions

8.2.4 Estimating the Variance-Covariance Matrix

If we have an estimate $\tilde{\mathbf{V}}$ for $\mathbf{V}$ , we can estimate the fixed and random effects as:

$\begin{aligned} \hat{\beta} &= \mathbf{(X'\tilde{V}^{-1}X)^{-1}X'\tilde{V}^{-1}Y}, \\ \hat{\mathbf{b}} &= \mathbf{B Z' \tilde{V}^{-1} (Y - X \hat{\beta})}. \end{aligned}$

where:

$\hat{\beta}$ is the estimated fixed effects.
$\hat{\mathbf{b}}$ is the Empirical Best Linear Unbiased Predictor (EBLUP), also called the Empirical Bayes estimate of $\mathbf{b}$ .

Properties of $\hat{\beta}$ and Variance Estimation

Consistency: $\hat{\text{Var}}(\hat{\beta})$ is a consistent estimator of $\text{Var}(\hat{\beta})$ if $\tilde{\mathbf{V}}$ is a consistent estimator of $\mathbf{V}$ .
Bias Issue: $\hat{\text{Var}}(\hat{\beta})$ is biased because it does not account for the uncertainty in estimating $\mathbf{V}$ .
Implication: This means that $\hat{\text{Var}}(\hat{\beta})$ underestimates the true variability.

To estimate $\mathbf{V}$ , several approaches can be used:

Maximum Likelihood Estimation (MLE)
Restricted Maximum Likelihood (REML)
Estimated Generalized Least Squares (EGLS)
Bayesian Hierarchical Models (BHM)

8.2.4.1 Maximum Likelihood Estimation

MLE finds parameter estimates by maximizing the likelihood function.

Define a parameter vector $\theta$ that includes all unknown variance components in $\mathbf{\Sigma}$ and $\mathbf{B}$ . Then, we assume:

$\mathbf{Y} \sim N(\mathbf{X\beta}, \mathbf{V}(\theta)).$

The log-likelihood function (ignoring constant terms) is:

$-2\log L(\mathbf{y}; \theta, \beta) = \log |\mathbf{V}(\theta)| + (\mathbf{Y - X\beta})' \mathbf{V}(\theta)^{-1} (\mathbf{Y - X\beta}).$

Steps for MLE Estimation

Estimate $\hat{\beta}$ , assuming $\theta$ is known:

$\hat{\beta}_{MLE} = (\mathbf{X'V(\theta)^{-1}X})^{-1} \mathbf{X'V(\theta)^{-1}Y}.$
Obtain $\hat{\theta}_{MLE}$ by maximizing the log-likelihood:

$\hat{\theta}_{MLE} = \arg\max_{\theta} -2\log L(\mathbf{y}; \theta, \beta).$
Substitute $\hat{\theta}_{MLE}$ to get updated estimates:

$\hat{\beta}_{MLE} = (\mathbf{X'V(\hat{\theta}_{MLE})^{-1}X})^{-1} \mathbf{X'V(\hat{\theta}_{MLE})^{-1}Y}.$
Predict random effects:

$\hat{\mathbf{b}}_{MLE} = \mathbf{B}(\hat{\theta}_{MLE}) \mathbf{Z'V}(\hat{\theta}_{MLE})^{-1} (\mathbf{Y - X \hat{\beta}_{MLE}}).$

Key Observations about MLE

MLE tends to underestimate $\theta$ because it does not account for the estimation of fixed effects.
Bias in variance estimates can be corrected using REML.

8.2.4.2 Restricted Maximum Likelihood

Restricted Maximum Likelihood (REML) is an estimation method that improves upon Maximum Likelihood Estimation by accounting for the loss of degrees of freedom due to the estimation of fixed effects.

Unlike MLE, which estimates both fixed effects ( $\beta$ ) and variance components ( $\theta$ ) simultaneously, REML focuses on estimating variance components by considering linear combinations of the data that are independent of the fixed effects.

Consider the Linear Mixed Model:

$\mathbf{y} = \mathbf{X} \beta + \mathbf{Z} \mathbf{b} + \epsilon,$

where:

$\mathbf{y}$ : Response vector of length $N$
$\mathbf{X}$ : Design matrix for fixed effects ( $N \times p$ )
$\beta$ : Fixed effects parameter vector ( $p \times 1$ )
$\mathbf{Z}$ : Design matrix for random effects
$\mathbf{b} \sim N(\mathbf{0, D})$ : Random effects
$\epsilon \sim N(\mathbf{0, \Sigma})$ : Residual errors

The marginal distribution of $\mathbf{y}$ is:

$\mathbf{y} \sim N(\mathbf{X} \beta, \mathbf{V}(\theta)),$

where:

$\mathbf{V}(\theta) = \mathbf{Z D Z'} + \mathbf{\Sigma}.$

To eliminate dependence on $\beta$ , consider linear transformations of $\mathbf{y}$ that are orthogonal to the fixed effects.

Let $\mathbf{K}$ be a full-rank contrast matrix of size $N \times (N - p)$ such that:

$\mathbf{K}' \mathbf{X} = 0.$

Then, we consider the transformed data:

$\mathbf{K}' \mathbf{y} \sim N(\mathbf{0}, \mathbf{K}' \mathbf{V}(\theta) \mathbf{K}).$

This transformation removes $\beta$ from the likelihood, focusing solely on the variance components $\theta$ .
Importantly, the choice of $\mathbf{K}$ does not affect the final REML estimates.

The REML log-likelihood is:

$-2 \log L_{REML}(\theta) = \log |\mathbf{K}' \mathbf{V}(\theta) \mathbf{K}| + \mathbf{y}' \mathbf{K} (\mathbf{K}' \mathbf{V}(\theta) \mathbf{K})^{-1} \mathbf{K}' \mathbf{y}.$

An equivalent form of the REML log-likelihood, avoiding explicit use of $\mathbf{K}$ , is:

$-2 \log L_{REML}(\theta) = \log |\mathbf{V}(\theta)| + \log |\mathbf{X}' \mathbf{V}(\theta)^{-1} \mathbf{X}| + (\mathbf{y} - \mathbf{X} \hat{\beta})' \mathbf{V}(\theta)^{-1} (\mathbf{y} - \mathbf{X} \hat{\beta}),$

where:

$\hat{\beta} = (\mathbf{X}' \mathbf{V}(\theta)^{-1} \mathbf{X})^{-1} \mathbf{X}' \mathbf{V}(\theta)^{-1} \mathbf{y}.$

This form highlights how REML adjusts for the estimation of fixed effects via the second term $\log |\mathbf{X}' \mathbf{V}^{-1} \mathbf{X}|$ .

Steps for REML Estimation

Transform the data using $\mathbf{K}' \mathbf{y}$ to remove $\beta$ from the likelihood.
Maximize the restricted likelihood to estimate $\hat{\theta}_{REML}$ .
Estimate fixed effects using:

$\hat{\beta}_{REML} = (\mathbf{X}' \mathbf{V}(\hat{\theta}_{REML})^{-1} \mathbf{X})^{-1} \mathbf{X}' \mathbf{V}(\hat{\theta}_{REML})^{-1} \mathbf{y}.$

Properties of REML

Unbiased Variance Component Estimates: REML produces unbiased estimates of variance components by accounting for the degrees of freedom used to estimate fixed effects.
Invariance to Fixed Effects: The restricted likelihood is constructed to be independent of the fixed effects $\beta$ .
Asymptotic Normality: REML estimates are consistent and asymptotically normal under standard regularity conditions.
Efficiency: While REML estimates variance components efficiently, it does not maximize the joint likelihood of all parameters, so $\beta$ estimates are slightly less efficient compared to MLE.

Comparison of REML and MLE
Criterion	MLE	REML
Approach	Maximizes full likelihood	Maximizes likelihood of contrasts (removes $\beta$ )
Estimates Fixed Effects?	Yes	No (focuses on variance components)
Bias in Variance Estimates	Biased (underestimates variance components)	Unbiased (corrects for loss of degrees of freedom)
Effect of Changing $\mathbf{X}$	Affects variance estimates	No effect on variance estimates
Consistency	Yes	Yes
Asymptotic Normality	Yes	Yes
Efficiency	Efficient under normality	More efficient for variance components
Model Comparison (AIC/BIC)	Suitable for comparing different fixed-effect models	Not suitable (penalizes fixed effects differently)
Performance in Small Samples	Sensitive to small sample bias	More robust to small sample bias
Handling Outliers	More sensitive	Less sensitive
Equivalent to ANOVA?	No	Yes, in balanced designs

8.2.4.3 Estimated Generalized Least Squares

MLE and REML rely on the Gaussian assumption, which may not always hold.
EGLS provides an alternative by relying only on the first two moments (mean and variance).

The LMM framework is:

$\mathbf{Y}_i = \mathbf{X}_i \beta + \mathbf{Z}_i \mathbf{b}_i + \epsilon_i.$

where:

Random effects: $\mathbf{b}_i \sim N(\mathbf{0, D})$ .
Residual errors: $\epsilon_i \sim N(\mathbf{0, \Sigma_i})$ .
Independence assumption: $\text{Cov}(\epsilon_i, \mathbf{b}_i) = 0$ .

Thus, the first two moments are:

$E(\mathbf{Y}_i) = \mathbf{X}_i \beta, \quad \text{Var}(\mathbf{Y}_i) = \mathbf{V}_i.$

The EGLS estimator is:

$\hat{\beta}_{GLS} = \left\{ \sum_{i=1}^n \mathbf{X'_iV_i(\theta)^{-1}X_i} \right\}^{-1} \sum_{i=1}^n \mathbf{X'_iV_i(\theta)^{-1}Y_i}.$

Writing in matrix form:

$\hat{\beta}_{GLS} = \left\{ \mathbf{X'V(\theta)^{-1}X} \right\}^{-1} \mathbf{X'V(\theta)^{-1}Y}.$

Since $\mathbf{V}(\theta)$ is unknown, we estimate it as $\hat{\mathbf{V}}$ , leading to the EGLS estimator:

$\hat{\beta}_{EGLS} = \left\{ \mathbf{X'\hat{V}^{-1}X} \right\}^{-1} \mathbf{X'\hat{V}^{-1}Y}.$

Key Insights about EGLS

Computational Simplicity:
- EGLS does not require iterative maximization of a likelihood function, making it computationally attractive.
Same Form as MLE/REML:
- The fixed effects estimators for MLE, REML, and EGLS have the same form, differing only in how $\mathbf{V}$ is estimated.
Robust to Non-Gaussian Data:
- Since it only depends on first and second moments, it can handle cases where MLE and REML struggle with non-normality.

When to Use EGLS?

When the normality assumption for MLE/REML is questionable.
When $\mathbf{V}$ can be estimated efficiently without requiring complex optimization.
In non-iterative approaches, where computational simplicity is a priority.

8.2.4.4 Bayesian Hierarchical Models

Bayesian methods offer a fully probabilistic framework to estimate $\mathbf{V}$ by incorporating prior distributions.

The joint distribution can be decomposed hierarchically:

$f(A, B, C) = f(A | B, C) f(B | C) f(C).$

Applying this to LMMs:

$\begin{aligned} f(\mathbf{Y, \beta, b, \theta}) &= f(\mathbf{Y | \beta, b, \theta}) f(\mathbf{b | \theta, \beta}) f(\mathbf{\beta | \theta}) f(\mathbf{\theta}) \\ &= f(\mathbf{Y | \beta, b, \theta}) f(\mathbf{b | \theta}) f(\mathbf{\beta}) f(\mathbf{\theta}). \end{aligned}$

where:

The first equality follows from probability decomposition.
The second equality assumes conditional independence, meaning:
- Given $\theta$ , no additional information about $\mathbf{b}$ is obtained from knowing $\beta$ .

Using Bayes’ theorem, the posterior distribution is:

$f(\mathbf{\beta, b, \theta | Y}) \propto f(\mathbf{Y | \beta, b, \theta}) f(\mathbf{b | \theta}) f(\mathbf{\beta}) f(\mathbf{\theta}).$

where:

$\begin{aligned} \mathbf{Y | \beta, b, \theta} &\sim N(\mathbf{X\beta + Zb}, \mathbf{\Sigma(\theta)}), \\ \mathbf{b | \theta} &\sim N(\mathbf{0, B(\theta)}). \end{aligned}$

To complete the Bayesian model, we specify prior distributions:

$f(\beta)$ : Prior on fixed effects.
$f(\theta)$ : Prior on variance components.

Since analytical solutions are generally unavailable, we use Markov Chain Monte Carlo (MCMC) to sample from the posterior:

Gibbs sampling (if conjugate priors are used).
Hamiltonian Monte Carlo (HMC) (for complex models).

Advantages

Accounts for Parameter Uncertainty
- Unlike MLE/REML, Bayesian methods propagate uncertainty in variance component estimation.
Flexible Model Specification
- Can incorporate prior knowledge via informative priors.
- Extends naturally beyond Gaussian assumptions (e.g., Student- $t$ distributions for heavy-tailed errors).
Robustness in Small Samples
- Bayesian methods can stabilize variance estimation in small datasets where MLE/REML are unreliable.

Challenges

Computational Complexity
- Requires MCMC algorithms, which can be computationally expensive.
Convergence Issues
- MCMC chains must be checked for convergence (e.g., using R-hat diagnostic).
Choice of Priors
- Poorly chosen priors can bias estimates or slow down convergence.

Comparison of Estimation Methods for $\mathbf{V}$
Method	Assumptions	Computational Cost	Handles Non-Normality?	Best Use Case
MLE	Gaussian errors	High (iterative)	No	Model selection (AIC/BIC)
REML	Gaussian errors	High (iterative)	No	Variance estimation
EGLS	First two moments	Low (non-iterative)	Yes	Large-scale models with correlated errors
Bayesian (BHM)	Probabilistic	Very High (MCMC)	Yes	Small samples, prior information available

References

Henderson, Charles R. 1975. “Best Linear Unbiased Estimation and Prediction Under a Selection Model.” Biometrics, 423–47.