8.2 Estimation

\[ \mathbf{Y}_i = \mathbf{X}_i \beta + \mathbf{Z}_i \mathbf{b}_i + \epsilon_i \]

where \(\beta, \mathbf{b}_i, \mathbf{D}, \mathbf{\Sigma}_i\) we must obtain estimation from the data

  • \(\mathbf{\beta}, \mathbf{D}, \mathbf{\Sigma}_i\) are unknown, but fixed, parameters, and must be estimated from the data
  • \(\mathbf{b}_i\) is a random variable. Thus, we can’t estimate these values, but we can predict them. (i.e., you can’t estimate a random thing).

If we have

  • \(\hat{\beta}\) as an estimator of \(\beta\)
  • \(\hat{\mathbf{b}}_i\) as a predictor of \(\mathbf{b}_i\)

Then,

  • The population average estimate of \(\mathbf{Y}_i\) is \(\hat{\mathbf{Y}_i} = \mathbf{X}_i \hat{\beta}\)
  • The subject-specific prediction is \(\hat{\mathbf{Y}_i} = \mathbf{X}_i \hat{\beta} + \mathbf{Z}_i \hat{b}_i\)

According to (Henderson 1975), estimating equations known as the mixed model equations:

\[ \left[ \begin{array} {c} \hat{\beta} \\ \hat{\mathbf{b}} \end{array} \right] = \left[ \begin{array} {cc} \mathbf{X'\Sigma^{-1}X} & \mathbf{X'\Sigma^{-1}Z} \\ \mathbf{Z'\Sigma^{-1}X} & \mathbf{Z'\Sigma^{-1}Z +B^{-1}} \end{array} \right] \left[ \begin{array} {cc} \mathbf{X'\Sigma^{-1}Y} \\ \mathbf{Z'\Sigma^{-1}Y} \end{array} \right] \]

where

\[ \begin{aligned} \mathbf{Y} &= \left[ \begin{array} {c} \mathbf{y}_1 \\ . \\ \mathbf{y}_N \end{array} \right] ; \mathbf{X} = \left[ \begin{array} {c} \mathbf{X}_1 \\ . \\ \mathbf{X}_N \end{array} \right]; \mathbf{b} = \left[ \begin{array} {c} \mathbf{b}_1 \\ . \\ \mathbf{b}_N \end{array} \right] ; \epsilon = \left[ \begin{array} {c} \epsilon_1 \\ . \\ \epsilon_N \end{array} \right] \\ cov(\epsilon) &= \mathbf{\Sigma}, \mathbf{Z} = \left[ \begin{array} {cccc} \mathbf{Z}_1 & 0 & ... & 0 \\ 0 & \mathbf{Z}_2 & ... & 0 \\ . & . & . & . \\ 0 & 0 & ... & \mathbf{Z}_n \end{array} \right], \mathbf{B} = \left[ \begin{array} {cccc} \mathbf{D} & 0 & ... & 0 \\ 0 & \mathbf{D} & ... & 0 \\ . & . & . & . \\ 0 & 0 & ... & \mathbf{D} \end{array} \right] \end{aligned} \]

The model has the form

\[ \begin{aligned} \mathbf{Y} &= \mathbf{X \beta + Z b + \epsilon} \\ \mathbf{Y} &\sim N(\mathbf{X \beta, ZBZ' + \Sigma}) \end{aligned} \]

If \(\mathbf{V = ZBZ' + \Sigma}\), then the solutions to the estimating equations can be

\[ \begin{aligned} \hat{\beta} &= \mathbf{(X'V^{-1}X)^{-1}X'V^{-1}Y} \\ \hat{\mathbf{b}} &= \mathbf{BZ'V^{-1}(Y-X\hat{\beta}}) \end{aligned} \]

The estimate \(\hat{\beta}\) is a generalized least squares estimate.

The predictor, \(\hat{\mathbf{b}}\) is the best linear unbiased predictor (BLUP), for \(\mathbf{b}\)

\[ \begin{aligned} E(\hat{\beta}) &= \beta \\ var(\hat{\beta}) &= (\mathbf{X'V^{-1}X})^{-1} \\ E(\hat{\mathbf{b}}) &= 0 \end{aligned} \]

\[ var(\mathbf{\hat{b}-b}) = \mathbf{B-BZ'V^{-1}ZB + BZ'V^{-1}X(X'V^{-1}X)^{-1}X'V^{-1}B} \]

The variance here is the variance of the prediction error (mean squared prediction error, MSPE), which is more meaningful than \(var(\hat{\mathbf{b}})\), since MSPE accounts for both variance and bias in the prediction.

To derive the mixed model equations, consider

\[ \mathbf{\epsilon = Y - X\beta - Zb} \]

Let \(T = \sum_{i=1}^N n_i\) be the total number of observations (i.e., the length of \(\mathbf{Y},\epsilon\)) and \(Nq\) the length of \(\mathbf{b}\). The joint distribution of \(\mathbf{b, \epsilon}\) is

\[ f(\mathbf{b,\epsilon})= \frac{1}{(2\pi)^{(T+ Nq)/2}} \left| \begin{array} {cc} \mathbf{B} & 0 \\ 0 & \mathbf{\Sigma} \end{array} \right| ^{-1/2} \exp \left( -\frac{1}{2} \left[ \begin{array} {c} \mathbf{b} \\ \mathbf{Y - X \beta - Zb} \end{array} \right]' \left[ \begin{array} {cc} \mathbf{B} & 0 \\ 0 & \mathbf{\Sigma} \end{array} \right]^{-1} \left[ \begin{array} {c} \mathbf{b} \\ \mathbf{Y - X \beta - Zb} \end{array} \right] \right) \]

Maximization of \(f(\mathbf{b},\epsilon)\) with respect to \(\mathbf{b}\) and \(\beta\) requires minimization of

\[ \begin{aligned} Q &= \left[ \begin{array} {c} \mathbf{b} \\ \mathbf{Y - X \beta - Zb} \end{array} \right]' \left[ \begin{array} {cc} \mathbf{B} & 0 \\ 0 & \mathbf{\Sigma} \end{array} \right]^{-1} \left[ \begin{array} {c} \mathbf{b} \\ \mathbf{Y - X \beta - Zb} \end{array} \right] \\ &= \mathbf{b'B^{-1}b+(Y-X \beta-Zb)'\Sigma^{-1}(Y-X \beta-Zb)} \end{aligned} \]

Setting the derivatives of Q with respect to \(\mathbf{b}\) and \(\mathbf{\beta}\) to zero leads to the system of equations:

\[ \begin{aligned} \mathbf{X'\Sigma^{-1}X\beta + X'\Sigma^{-1}Zb} &= \mathbf{X'\Sigma^{-1}Y}\\ \mathbf{(Z'\Sigma^{-1}Z + B^{-1})b + Z'\Sigma^{-1}X\beta} &= \mathbf{Z'\Sigma^{-1}Y} \end{aligned} \]

Rearranging

\[ \left[ \begin{array} {cc} \mathbf{X'\Sigma^{-1}X} & \mathbf{X'\Sigma^{-1}Z} \\ \mathbf{Z'\Sigma^{-1}X} & \mathbf{Z'\Sigma^{-1}Z + B^{-1}} \end{array} \right] \left[ \begin{array} {c} \beta \\ \mathbf{b} \end{array} \right] = \left[ \begin{array} {c} \mathbf{X'\Sigma^{-1}Y} \\ \mathbf{Z'\Sigma^{-1}Y} \end{array} \right] \]

Thus, the solution to the mixed model equations give:

\[ \left[ \begin{array} {c} \hat{\beta} \\ \hat{\mathbf{b}} \end{array} \right] = \left[ \begin{array} {cc} \mathbf{X'\Sigma^{-1}X} & \mathbf{X'\Sigma^{-1}Z} \\ \mathbf{Z'\Sigma^{-1}X} & \mathbf{Z'\Sigma^{-1}Z + B^{-1}} \end{array} \right] ^{-1} \left[ \begin{array} {c} \mathbf{X'\Sigma^{-1}Y} \\ \mathbf{Z'\Sigma^{-1}Y} \end{array} \right] \]

Equivalently,

Bayes’ theorem

\[ f(\mathbf{b}| \mathbf{Y}) = \frac{f(\mathbf{Y}|\mathbf{b})f(\mathbf{b})}{\int f(\mathbf{Y}|\mathbf{b})f(\mathbf{b}) d\mathbf{b}} \]

where

  • \(f(\mathbf{Y}|\mathbf{b})\) is the “likelihood”
  • \(f(\mathbf{b})\) is the prior
  • the denominator is the “normalizing constant”
  • \(f(\mathbf{b}|\mathbf{Y})\) is the posterior distribution

In this case

\[ \begin{aligned} \mathbf{Y} | \mathbf{b} &\sim N(\mathbf{X\beta+Zb,\Sigma}) \\ \mathbf{b} &\sim N(\mathbf{0,B}) \end{aligned} \]

The posterior distribution has the form

\[ \mathbf{b}|\mathbf{Y} \sim N(\mathbf{BZ'V^{-1}(Y-X\beta),(Z'\Sigma^{-1}Z + B^{-1})^{-1}}) \]

Hence, the best predictor (based on squared error loss)

\[ E(\mathbf{b}|\mathbf{Y}) = \mathbf{BZ'V^{-1}(Y-X\beta)} \]

8.2.1 Estimating \(\mathbf{V}\)

If we have \(\tilde{\mathbf{V}}\) (estimate of \(\mathbf{V}\)), then we can estimate:

\[ \begin{aligned} \hat{\beta} &= \mathbf{(X'\tilde{V}^{-1}X)^{-1}X'\tilde{V}^{-1}Y} \\ \hat{\mathbf{b}} &= \mathbf{BZ'\tilde{V}^{-1}(Y-X\hat{\beta})} \end{aligned} \]

where \({\mathbf{b}}\) is EBLUP (estimated BLUP) or empirical Bayes estimate

Note:

  • \(\hat{var}(\hat{\beta})\) is a consistent estimator of \(var(\hat{\beta})\) if \(\tilde{\mathbf{V}}\) is a consistent estimator of \(\mathbf{V}\)
  • However, \(\hat{var}(\hat{\beta})\) is biased since the variability arises from estimating \(\mathbf{V}\) is not accounted for in the estimate.
  • Hence, \(\hat{var}(\hat{\beta})\) underestimates the true variability

Ways to estimate \(\mathbf{V}\)

8.2.1.1 Maximum Likelihood Estimation (MLE)

Grouping unknown parameters in \(\Sigma\) and \(B\) under a parameter vector \(\theta\). Under MLE, \(\hat{\theta}\) and \(\hat{\beta}\) maximize the likelihood \(\mathbf{y} \sim N(\mathbf{X\beta, V(\theta))}\). Synonymously, \(-2\log L(\mathbf{y;\theta,\beta})\):

\[ -2l(\mathbf{\beta,\theta,y}) = \log |\mathbf{V(\theta)}| + \mathbf{(y-X\beta)'V(\theta)^{-1}(y-X\beta)} + N \log(2\pi) \]

  • Step 1: Replace \(\beta\) with its maximum likelihood (where \(\theta\) is known \(\hat{\beta}= (\mathbf{X'V(\theta)^{-1}X)^{-1}X'V(\theta)^{-1}y}\)
  • Step 2: Minimize the above equation with respect to \(\theta\) to get the estimator \(\hat{\theta}_{MLE}\)
  • Step 3: Substitute \(\hat{\theta}_{MLE}\) back to get \(\hat{\beta}_{MLE} = (\mathbf{X'V(\theta_{MLE})^{-1}X)^{-1}X'V(\theta_{MLE})^{-1}y}\)
  • Step 4: Get \(\hat{\mathbf{b}}_{MLE} = \mathbf{B(\hat{\theta}_{MLE})Z'V(\hat{\theta}_{MLE})^{-1}(y-X\hat{\beta}_{MLE})}\)

Note:

  • \(\hat{\theta}\) are typically negatively biased due to unaccounted fixed effects being estimated, which we could try to account for.

8.2.1.2 Restricted Maximum Likelihood (REML)

REML accounts for the number of estimated mean parameters by adjusting the objective function. Specifically, the likelihood of linear combination of the elements of \(\mathbf{y}\) is accounted for.

We have \(\mathbf{K'y}\), where \(\mathbf{K}\) is any \(N \times (N - p)\) full-rank contrast matrix, which has columns orthogonal to the \(\mathbf{X}\) matrix (that is \(\mathbf{K'X} = 0\)). Then,

\[ \mathbf{K'y} \sim N(0,\mathbf{K'V(\theta)K}) \]

where \(\beta\) is no longer in the distribution

We can proceed to maximize this likelihood for the contrasts to get \(\hat{\theta}_{REML}\), which does not depend on the choice of \(\mathbf{K}\). And \(\hat{\beta}\) are based on \(\hat{\theta}\)

Comparison REML and MLE

  • Both methods are based upon the likelihood principle, and have desired properties for the estimates:

    • consistency

    • asymptotic normality

    • efficiency

  • ML estimation provides estimates for fixed effects, while REML can’t

  • In balanced models, REML is identical to ANOVA

  • REML accounts for df for the fixed effects int eh model, which is important when \(\mathbf{X}\) is large relative to the sample size

  • Changing \(\mathbf{\beta}\) has no effect on the REML estimates of \(\theta\)

  • REML is less sensitive to outliers than MLE

  • MLE is better than REML regarding model comparisons (e.g., AIC or BIC)

8.2.1.3 Estimated Generalized Least Squares

MLE and REML rely upon the Gaussian assumption. To overcome this issue, EGLS uses the first and second moments.

\[ \mathbf{Y}_i = \mathbf{X}_i \beta + \mathbf{Z}_i \mathbf{b}_i + \epsilon_i \]

where

  • \(\epsilon_i \sim (\mathbf{0,\Sigma_i})\)
  • \(\mathbf{b}_i \sim (\mathbf{0,D})\)
  • \(cov(\epsilon_i, \mathbf{b}_i) = 0\)

Then the EGLS estimator is

\[ \begin{aligned} \hat{\beta}_{GLS} &= \{\sum_{i=1}^n \mathbf{X'_iV_i(\theta)^{-1}X_i} \}^{-1} \sum_{i=1}^n \mathbf{X'_iV_i(\theta)^{-1}Y_i} \\ &=\{\mathbf{X'V(\theta)^{-1}X} \}^{-1} \mathbf{X'V(\theta)^{-1}Y} \end{aligned} \]

depends on the first two moments

  • \(E(\mathbf{Y}_i) = \mathbf{X}_i \beta\)
  • \(var(\mathbf{Y}_i)= \mathbf{V}_i\)

EGLS use \(\hat{\mathbf{V}}\) for \(\mathbf{V(\theta)}\)

\[ \hat{\beta}_{EGLS} = \{ \mathbf{X'\hat{V}^{-1}X} \}^{-1} \mathbf{X'\hat{V}^{-1}Y} \]

Hence, the fixed effects estimators for the MLE, REML, and EGLS are of the same form, except for the estimate of \(\mathbf{V}\)

In case of non-iterative approach, EGLS can be appealing when \(\mathbf{V}\) can be estimated without much computational burden.

8.2.1.4 Bayesian Hierarchical Models (BHM)

Joint distribution cane be decomposed hierarchically in terms of the product of conditional distributions and a marginal distribution

\[ f(A,B,C) = f(A|B,C) f(B|C)f(C) \]

Applying to estimate \(\mathbf{V}\)

\[ \begin{aligned} f(\mathbf{Y, \beta, b, \theta}) &= f(\mathbf{Y|\beta,b, \theta})f(\mathbf{b|\theta,\beta})f(\mathbf{\beta|\theta})f(\mathbf{\theta}) & \text{based on probability decomposition} \\ &= f(\mathbf{Y|\beta,b, \theta})f(\mathbf{b|\theta})f(\mathbf{\beta})f(\mathbf{\theta}) & \text{based on simplifying modeling assumptions} \end{aligned} \]

elaborate on the second equality, if we assume conditional independence (e.g., given \(\theta\), no additional info about \(\mathbf{b}\) is given by knowing \(\beta\)), then we can simply from the first equality

Using Bayes’ rule

\[ f(\mathbf{\beta, b, \theta|Y}) \propto f(\mathbf{Y|\beta,b, \theta})f(\mathbf{b|\theta})f(\mathbf{\beta})f(\mathbf{\theta}) \]

where

\[ \begin{aligned} \mathbf{Y| \beta, b, \theta} &\sim \mathbf{N(X\beta+ Zb, \Sigma(\theta))} \\ \mathbf{b | \theta} &\sim \mathbf{N(0, B(\theta))} \end{aligned} \]

and we also have to have prior distributions for \(f(\beta), f(\theta)\)

With normalizing constant, we can obtain the posterior distribution. Typically, we can’t get analytical solution right away. Hence, we can use Markov Chain Monte Carlo (MCMC) to obtain samples from the posterior distribution.

Bayesian Methods:

  • account for the uncertainty in parameters estimates and accommodate the propagation of that uncertainty through the model
  • can adjust prior information (i.e., priori) in parameters
  • Can extend beyond Gaussian distributions
  • but hard to implement algorithms and might have problem converging

References

Henderson, Charles R. 1975. “Best Linear Unbiased Estimation and Prediction Under a Selection Model.” Biometrics, 423–47.