9.5 Estimation in Nonlinear and Generalized Linear Mixed Models

In Linear Mixed Models, the marginal likelihood of the observed data $\mathbf{y}$ is derived by integrating out the random effects from the hierarchical formulation:

$f(\mathbf{y}) = \int f(\mathbf{y} \mid \boldsymbol{\alpha}) f(\boldsymbol{\alpha}) d\boldsymbol{\alpha}$

For LMMs, both component distributions—

the conditional distribution $f(\mathbf{y} \mid \boldsymbol{\alpha})$ , and
the random effects distribution $f(\boldsymbol{\alpha})$ —

are typically assumed to be Gaussian with linear relationships. These assumptions imply that the marginal distribution of $\mathbf{y}$ is also Gaussian, allowing the integral to be solved analytically using properties of the multivariate normal distribution.

In contrast:

For GLMMs, the conditional distribution $f(\mathbf{y} \mid \boldsymbol{\alpha})$ belongs to the exponential family but is not Gaussian in general.
For NLMMs, the relationship between the mean response and the random (and fixed) effects is nonlinear, complicating the integral.

In both cases, the marginal likelihood integral:

$L(\boldsymbol{\beta}; \mathbf{y}) = \int f(\mathbf{y} \mid \boldsymbol{\alpha}) f(\boldsymbol{\alpha}) d\boldsymbol{\alpha}$

cannot be solved analytically. Consequently, estimation requires:

9.5.1 Estimation by Numerical Integration

The marginal likelihood for parameter estimation is given by:

$L(\boldsymbol{\beta}; \mathbf{y}) = \int f(\mathbf{y} \mid \boldsymbol{\alpha}) f(\boldsymbol{\alpha}) d\boldsymbol{\alpha}$

To estimate the fixed effects $\boldsymbol{\beta}$ , we often maximize the log-likelihood:

$\ell(\boldsymbol{\beta}) = \log L(\boldsymbol{\beta}; \mathbf{y})$

Optimization requires the score function (gradient):

$\frac{\partial \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}$

Since the integral in $L(\boldsymbol{\beta}; \mathbf{y})$ is generally intractable, we rely on numerical techniques to approximate it.

9.5.1.1 Methods for Numerical Integration

Gaussian Quadrature
- Suitable for low-dimensional random effects ( $\dim(\boldsymbol{\alpha})$ is small).
- Approximates the integral using weighted sums of function evaluations at specific points (nodes).
- Gauss-Hermite quadrature is commonly used when random effects are normally distributed.
Limitation: Computational cost grows exponentially with the dimension of $\boldsymbol{\alpha}$ (curse of dimensionality).
Laplace Approximation
- Approximates the integral by expanding the log-likelihood around the mode of the integrand (i.e., the most likely value of $\boldsymbol{\alpha}$ ).
- Provides accurate results for moderate-sized random effects and large sample sizes.
- First-order Laplace approximation is commonly used; higher-order versions improve accuracy but increase complexity.
Key Idea: Approximate the integral as:

$\int e^{h(\boldsymbol{\alpha})} d\boldsymbol{\alpha} \approx e^{h(\hat{\boldsymbol{\alpha}})} \sqrt{\frac{(2\pi)^q}{|\mathbf{H}|}}$

where:
- $\hat{\boldsymbol{\alpha}}$ is the mode of $h(\boldsymbol{\alpha})$ ,
- $\mathbf{H}$ is the Hessian matrix of second derivatives at $\hat{\boldsymbol{\alpha}}$ ,
- $q$ is the dimension of $\boldsymbol{\alpha}$ .
Monte Carlo Integration
- Uses random sampling to approximate the integral.
- Importance Sampling improves efficiency by sampling from a distribution that better matches the integrand.
- Markov Chain Monte Carlo methods, such as Gibbs sampling or Metropolis-Hastings, are used when the posterior distribution is complex.
Advantage: Scales better with high-dimensional random effects compared to quadrature methods.

Limitation: Computationally intensive, and variance of estimates can be high without careful tuning.

9.5.1.2 Choosing an Integration Method

Comparison of Numerical Integration Methods in Bayesian and Likelihood-Based Estimation
Method	Dimensionality	Accuracy	Computational Cost
Gaussian Quadrature	Low-dimensional ( $q \leq 3$ )	High (with sufficient nodes)	High (exponential growth with $q$ )
Laplace Approximation	Moderate-dimensional	Moderate to High	Moderate
Monte Carlo Methods	High-dimensional	Variable (depends on sample size)	High (but scalable)

For small random effect dimensions, quadrature methods are effective.
For moderate dimensions, Laplace approximation offers a good balance.
For high dimensions or complex models, Monte Carlo techniques are often the method of choice.

9.5.2 Estimation by Linearization

When estimating parameters in NLMMs and GLMMs, one common and effective approach is linearization. This technique approximates the nonlinear or non-Gaussian components with linear counterparts, enabling the use of standard LMM estimation methods. Linearization not only simplifies the estimation process but also allows for leveraging well-established statistical tools and methods developed for linear models.

9.5.2.1 Concept of Linearization

The core idea is to create a linearized version of the response variable, known as the working response or pseudo-response, denoted as $\tilde{y}_i$ . This pseudo-response is designed to approximate the original nonlinear relationship in a linear form, facilitating easier estimation of model parameters. The conditional mean of this pseudo-response is expressed as:

$E(\tilde{y}_i \mid \boldsymbol{\alpha}) = \mathbf{x}_i' \boldsymbol{\beta} + \mathbf{z}_i' \boldsymbol{\alpha}$

Here:

$\mathbf{x}_i$ is the design matrix for fixed effects,
$\boldsymbol{\beta}$ represents the fixed effect parameters,
$\mathbf{z}_i$ is the design matrix for random effects,
$\boldsymbol{\alpha}$ denotes the random effects.

In addition to the conditional mean, it is essential to estimate the conditional variance of the pseudo-response to fully characterize the linearized model:

$\operatorname{Var}(\tilde{y}_i \mid \boldsymbol{\alpha})$

This variance estimation ensures that the model accounts for the inherent variability in the data, maintaining the integrity of statistical inferences.

9.5.2.2 Application of Linearization

Once linearized, the model structure closely resembles that of a linear mixed model, allowing us to apply standard estimation techniques from LMMs. These techniques include methods such as MLE and REML, which are computationally efficient and statistically robust.

The primary difference between various linearization-based methods lies in how the linearization is performed. This often involves expanding the nonlinear function $f(\mathbf{x}, \boldsymbol{\theta}, \boldsymbol{\alpha})$ or the inverse link function $g^{-1}(\cdot)$ . The goal is to approximate these complex functions with simpler linear expressions while retaining as much of the original model’s characteristics as possible.

9.5.2.2.1 Taylor Series Expansion

A widely used method for linearization is the Taylor series expansion. This approach approximates the nonlinear mean function around initial estimates of the random effects. The first-order Taylor series expansion of the nonlinear function is given by:

$f(\mathbf{x}_{ij}, \boldsymbol{\theta}, \boldsymbol{\alpha}_i) \approx f(\mathbf{x}_{ij}, \boldsymbol{\theta}, \hat{\boldsymbol{\alpha}}_i) + \frac{\partial f}{\partial \boldsymbol{\alpha}_i} \bigg|_{\hat{\boldsymbol{\alpha}}_i} (\boldsymbol{\alpha}_i - \hat{\boldsymbol{\alpha}}_i)$

In this expression:

$f(\mathbf{x}_{ij}, \boldsymbol{\theta}, \hat{\boldsymbol{\alpha}}_i)$ is the function evaluated at the initial estimates of the random effects $\hat{\boldsymbol{\alpha}}_i$ ,
$\frac{\partial f}{\partial \boldsymbol{\alpha}_i} \big|_{\hat{\boldsymbol{\alpha}}_i}$ represents the gradient (or derivative) of the function with respect to the random effects, evaluated at $\hat{\boldsymbol{\alpha}}_i$ ,
$(\boldsymbol{\alpha}_i - \hat{\boldsymbol{\alpha}}_i)$ captures the deviation from the initial estimates.

The initial estimates $\hat{\boldsymbol{\alpha}}_i$ are often set to zero for simplicity, especially in the early stages of model fitting. This approximation reduces the model to a linear form, making it amenable to standard LMM estimation techniques.

9.5.2.2.2 Advantages and Considerations

Linearization offers several advantages:

Simplified Computation: By transforming complex nonlinear relationships into linear forms, linearization reduces computational complexity.
Flexibility: Despite the simplification, linearized models retain the ability to capture key features of the original data structure.
Statistical Robustness: The use of established LMM estimation techniques ensures robust parameter estimation.

However, linearization also comes with considerations:

Approximation Error: The accuracy of the linearized model depends on how well the linear approximation captures the original nonlinear relationship.
Choice of Expansion Point: The selection of initial estimates $\hat{\boldsymbol{\alpha}}_i$ can influence the quality of the approximation.
Higher-Order Terms: In cases where the first-order approximation is insufficient, higher-order Taylor series terms may be needed, increasing model complexity.

9.5.2.3 Penalized Quasi-Likelihood

Penalized Quasi-Likelihood (PQL) is one of the most popular linearization-based estimation methods for GLMMs.

The linearization is achieved through a first-order Taylor expansion of the inverse link function around current estimates of the parameters. The working response at the $k$ -th iteration is given by:

$\tilde{y}_i^{(k)} = \hat{\eta}_i^{(k-1)} + \left(y_i - \hat{\mu}_i^{(k-1)}\right) \cdot \left.\frac{d \eta}{d \mu}\right|_{\hat{\eta}_i^{(k-1)}}$

Where:

$\eta_i = g(\mu_i)$ is the linear predictor,
$\hat{\eta}_i^{(k-1)}$ and $\hat{\mu}_i^{(k-1)}$ are the estimates from the previous iteration $(k-1)$ ,
$g(\cdot)$ is the link function, and $\mu_i = g^{-1}(\eta_i)$ .

PQL Estimation Algorithm

Initialization:
Start with initial estimates of $\boldsymbol{\beta}$ and $\boldsymbol{\alpha}$ (commonly set to zeros).
Compute the Working Response:
Use the formula above to compute $\tilde{y}_i^{(k)}$ based on current parameter estimates.
Fit a Linear Mixed Model:
Apply standard LMM estimation techniques to the pseudo-response $\tilde{y}_i^{(k)}$ to update estimates of $\boldsymbol{\beta}$ and $\boldsymbol{\alpha}$ .
Update Variance Components:
Estimate $\operatorname{Var}(\tilde{y}_i \mid \boldsymbol{\alpha})$ based on updated parameter estimates.
Iteration:
Repeat steps 2–4 until the estimates converge.

Comments on PQL:

Advantages:
- Easy to implement using existing LMM software.
- Fast convergence for many practical datasets.
Limitations:
- Inference is only asymptotically correct due to the linearization approximation.
- Biased estimates are common, especially:
  - For binomial responses with small group sizes,
  - In Bernoulli models (worst-case scenario),
  - In Poisson models with small counts. (Faraway 2016)
- Hypothesis testing and confidence intervals can be unreliable.

9.5.2.4 Generalized Estimating Equations

Generalized Estimating Equations (GEE) offer an alternative approach to parameter estimation in models with correlated data, particularly for marginal models where the focus is on population-averaged effects rather than subject-specific effects.

GEE estimates are obtained by solving estimating equations rather than maximizing a likelihood function.

Consider a marginal generalized linear model:

$\operatorname{logit}(E(\mathbf{y})) = \mathbf{X} \boldsymbol{\beta}$

Assuming a working covariance matrix $\mathbf{V}$ for the elements of $\mathbf{y}$ , the estimating equation for $\boldsymbol{\beta}$ is:

$\mathbf{X}' \mathbf{V}^{-1} (\mathbf{y} - E(\mathbf{y})) = 0$

If $\mathbf{V}$ correctly specifies the covariance structure, the estimator is unbiased. In practice, we often assume a simple structure (e.g., independence) and obtain robust standard errors even when the covariance is misspecified.

9.5.2.4.1 GEE for Repeated Measures

Let $y_{ij}$ denote the $j$ -th measurement on the $i$ -th subject, with:

$\mathbf{y}_i = \begin{pmatrix} y_{i1} \\ \vdots \\ y_{in_i} \end{pmatrix}, \quad \boldsymbol{\mu}_i = \begin{pmatrix} \mu_{i1} \\ \vdots \\ \mu_{in_i} \end{pmatrix}, \quad \mathbf{x}_{ij} = \begin{pmatrix} X_{ij1} \\ \vdots \\ X_{ijp} \end{pmatrix}$

Define the working covariance matrix of $\mathbf{y}_i$ as:

$\mathbf{V}_i = \operatorname{Cov}(\mathbf{y}_i)$

Following (Liang and Zeger 1986), the GEE for estimating $\boldsymbol{\beta}$ is:

$S(\boldsymbol{\beta}) = \sum_{i=1}^K \frac{\partial \boldsymbol{\mu}_i'}{\partial \boldsymbol{\beta}} \mathbf{V}_i^{-1} (\mathbf{y}_i - \boldsymbol{\mu}_i) = 0$

Where:

$K$ is the number of subjects (or clusters),
$\boldsymbol{\mu}_i = E(\mathbf{y}_i)$ ,
$\mathbf{V}_i$ is the working covariance matrix.

9.5.2.4.2 Working Correlation Structures

The covariance matrix $\mathbf{V}_i$ is modeled as:

$\mathbf{V}_i = a(\phi) \mathbf{B}_i^{1/2} \mathbf{R}(\boldsymbol{c}) \mathbf{B}_i^{1/2}$

$a(\phi)$ is a dispersion parameter,
$\mathbf{B}_i$ is a diagonal matrix with variance functions $V(\mu_{ij})$ on the diagonal,
$\mathbf{R}(\boldsymbol{c})$ is the working correlation matrix, parameterized by $\boldsymbol{c}$ . If $\mathbf{R}(\boldsymbol{c})$ is the true correlation matrix of $\mathbf{y}_i$ , then $\mathbf{V}_i$ is the true covariance matrix.

Common Working Correlation Structures:

Independence: $\mathbf{R} = \mathbf{I}$ (simplest, but ignores correlation).
Exchangeable: Constant correlation between all pairs within a cluster.
Autoregressive (AR(1)): Correlation decreases with time lag.
Unstructured: Each pair has its own correlation parameter.

9.5.2.4.3 Iterative Algorithm for GEE Estimation

Initialization:
- Compute an initial estimate of $\boldsymbol{\beta}$ using a GLM under the independence assumption ( $\mathbf{R} = \mathbf{I}$ ).
Estimate the Working Correlation Matrix:
- Based on residuals from the initial fit, estimate $\mathbf{R}(\boldsymbol{c})$ .
Update the Covariance Matrix:
- Calculate $\hat{\mathbf{V}}_i$ using the updated working correlation matrix.
Update $\boldsymbol{\beta}$ :

$\boldsymbol{\beta}^{(r+1)} = \boldsymbol{\beta}^{(r)} + \left(\sum_{i=1}^K \frac{\partial \boldsymbol{\mu}_i'}{\partial \boldsymbol{\beta}} \hat{\mathbf{V}}_i^{-1} \frac{\partial \boldsymbol{\mu}_i}{\partial \boldsymbol{\beta}} \right)^{-1} \left( \sum_{i=1}^K \frac{\partial \boldsymbol{\mu}_i'}{\partial \boldsymbol{\beta}} \hat{\mathbf{V}}_i^{-1} (\mathbf{y}_i - \boldsymbol{\mu}_i) \right)$
Iteration:
- Repeat steps 2–4 until convergence (i.e., when changes in $\boldsymbol{\beta}$ are negligible).

Comments on GEE:

Advantages:
- Provides consistent estimates of $\boldsymbol{\beta}$ even if the working correlation matrix is misspecified.
- Robust standard errors (also known as “sandwich” estimators) account for potential misspecification.
Limitations:
- Not a likelihood-based method, so likelihood-ratio tests are not appropriate.
- Efficiency loss if the working correlation matrix is poorly specified.
- Estimation of random effects is not possible—GEE focuses on marginal (population-averaged) effects.

9.5.3 Estimation by Bayesian Hierarchical Models

Bayesian methods provide a flexible framework for estimating parameters in NLMMs and GLMMs. Unlike frequentist approaches that rely on MLE or linearization techniques, Bayesian estimation fully incorporates prior information and naturally accounts for uncertainty in both parameter estimation and predictions.

In the Bayesian context, we are interested in the posterior distribution of the model parameters, given the observed data $\mathbf{y}$ :

$f(\boldsymbol{\alpha}, \boldsymbol{\beta} \mid \mathbf{y}) \propto f(\mathbf{y} \mid \boldsymbol{\alpha}, \boldsymbol{\beta}) f(\boldsymbol{\alpha}) f(\boldsymbol{\beta})$

Where:

$f(\mathbf{y} \mid \boldsymbol{\alpha}, \boldsymbol{\beta})$ is the likelihood of the data,
$f(\boldsymbol{\alpha})$ is the prior distribution for the random effects,
$f(\boldsymbol{\beta})$ is the prior distribution for the fixed effects,
$f(\boldsymbol{\alpha}, \boldsymbol{\beta} \mid \mathbf{y})$ is the posterior distribution, which combines prior beliefs with observed data.

Advantages of Bayesian Estimation

No Need for Simplifying Approximations: Bayesian methods do not require linearization or asymptotic approximations.
Full Uncertainty Quantification: Provides credible intervals for parameters and predictive distributions for new data.
Flexible Modeling: Easily accommodates complex hierarchical structures, non-standard distributions, and prior information.

Computational Challenges

Despite its advantages, Bayesian estimation can be computationally intensive and complex, especially for high-dimensional models. Key implementation issues include:

Non-Valid Joint Distributions:
In some hierarchical models, specifying valid joint distributions for the data, random effects, and parameters can be challenging.
Constraints from Mean-Variance Relationships:
The inherent relationship between the mean and variance in GLMMs, combined with random effects, imposes constraints on the marginal covariance structure.
Computational Intensity:
Fitting Bayesian models often requires advanced numerical techniques like Markov Chain Monte Carlo, which can be slow to converge, especially for large datasets or complex models.

9.5.3.1 Bayesian Estimation Methods

Bayesian estimation can proceed through two general approaches:

Approximating the Objective Function (Marginal Likelihood)

The marginal likelihood is typically intractable because it requires integrating over random effects:

$f(\mathbf{y} \mid \boldsymbol{\beta}) = \int f(\mathbf{y} \mid \boldsymbol{\alpha}, \boldsymbol{\beta}) f(\boldsymbol{\alpha}) d\boldsymbol{\alpha}$

Since this integral cannot be solved analytically, we approximate it using the following methods:

Laplace Approximation
- Approximates the integral by expanding the log-likelihood around the mode of the integrand.
- Provides an efficient, asymptotically accurate approximation when the posterior is approximately Gaussian near its mode.
Quadrature Methods
- Gaussian quadrature (e.g., Gauss-Hermite quadrature) is effective for low-dimensional random effects.
- Approximates the integral by summing weighted evaluations of the function at specific points.
Monte Carlo Integration
- Uses random sampling to approximate the integral.
- Importance sampling improves efficiency by drawing samples from a distribution that closely resembles the target distribution.

Approximating the Model (Linearization)

Alternatively, we can approximate the model itself using Taylor series linearization around current estimates of the parameters. This approach simplifies the model, making Bayesian estimation computationally more feasible, though at the cost of some approximation error.

9.5.3.2 Markov Chain Monte Carlo Methods

The most common approach for fully Bayesian estimation is MCMC, which generates samples from the posterior distribution through iterative simulation. Popular MCMC algorithms include:

Gibbs Sampling:
Efficient when full conditional distributions are available in closed form.
Metropolis-Hastings Algorithm:
More general and flexible, used when full conditionals are not easily sampled.
Hamiltonian Monte Carlo (HMC):
Implemented in packages like Stan, provides faster convergence for complex models by leveraging gradient information.

The posterior distribution is then approximated using the generated samples:

$f(\boldsymbol{\alpha}, \boldsymbol{\beta} \mid \mathbf{y}) \approx \frac{1}{N} \sum_{i=1}^N \delta(\boldsymbol{\alpha} - \boldsymbol{\alpha}^{(i)}, \boldsymbol{\beta} - \boldsymbol{\beta}^{(i)})$

Where $N$ is the number of MCMC samples and $\delta(\cdot)$ is the Dirac delta function.

9.5.4 Practical Implementation in R

Several R packages facilitate Bayesian estimation for GLMMs and NLMMs:

GLMM Estimation:
- MASS::glmmPQL — Penalized Quasi-Likelihood for GLMMs.
- lme4::glmer — Frequentist estimation for GLMMs using Laplace approximation.
- glmmTMB — Handles complex random effects structures efficiently.
NLMM Estimation:
- nlme::nlme — Nonlinear mixed-effects modeling.
- lme4::nlmer — Extends lme4 for nonlinear mixed models.
- brms::brm — Bayesian estimation via Stan, supporting NLMMs.
Bayesian Estimation:
- MCMCglmm — Implements MCMC algorithms for GLMMs.
- brms::brm — High-level interface for Bayesian regression models, leveraging Stan for efficient MCMC sampling.

Example: Non-Gaussian Repeated Measurements

Consider the case of repeated measurements:

If the data are Gaussian: Use Linear Mixed Models.
If the data are non-Gaussian: Use Nonlinear and Generalized Linear Mixed Models.

References

Faraway, Julian J. 2016. Extending the Linear Model with r: Generalized Linear, Mixed Effects and Nonparametric Regression Models. CRC press.

Liang, Kung-Yee, and Scott L Zeger. 1986. “Longitudinal Data Analysis Using Generalized Linear Models.” Biometrika 73 (1): 13–22.