F The Linear Model

Suppose we have \(i=1,\ldots,n\) continuous outcomes \(Y_i\) with covariates \(\boldsymbol{X}_i\), a vector of length \(p\). To simplify notation, the first entry of \(\boldsymbol{X}_i\) is assumed to be always 1. The linear model relates the covariates \(\boldsymbol{X}_i\) to the outcomes \(Y_i\) in a linear fashion. This is most easily described if we stack all outcomes into a vector \(\boldsymbol{Y}\) of length \(n\) and stack the covariates \(\boldsymbol{X}_i\) into a matrix \[ \boldsymbol{X} = \left( \begin{array}{c} \boldsymbol{X}_1^{\top} \\ \boldsymbol{X}_2^{\top} \\ \vdots \\ \boldsymbol{X}_n^{\top} \end{array} \right). \] Likewise, additional measurement error terms \(\epsilon_i\), \(i=1,\ldots,n\), are also stacked into a vector \(\boldsymbol{\epsilon}\) of length \(n\).

F.1 Standard Formulation

The standard formulation of the linear model assumes, that the error terms \(\epsilon_i\) have mean zero and are mutually independent with common variance \(\sigma^2\). The relationship between \(\boldsymbol{X}\) and \(\boldsymbol{Y}\) is described by the regression coefficients \(\boldsymbol{\beta}\), a vector of length \(p\), and we obtain the standard linear model \[ \boldsymbol{Y} = \boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}. \] The least squares estimate of \(\boldsymbol{\beta}\) is \[ \widehat{\boldsymbol{\beta}} = (\boldsymbol{X}^{\top} \boldsymbol{X})^{-1}\boldsymbol{X}^{\top} \boldsymbol{Y}. \tag{F.1} \] Note that this estimate this does not require knowledge of \(\sigma^2\), but the covariance matrix of (F.1) depends on \(\sigma^2\): \[ \mathop{\mathrm{Cov}}(\widehat{\boldsymbol{\beta}}) = \sigma^2 (\boldsymbol{X}^{\top} \boldsymbol{X})^{-1}. \tag{F.2} \] The variance \(\sigma^2\) can be estimated in a second step based on the sum of the squared residuals \(e_i = Y_i - \boldsymbol{X}_i^\top \widehat{\boldsymbol{\beta}}\), \(\boldsymbol{e}=(e_1, \ldots, e_n)^\top\): \[ \hat \sigma^2 = \frac{1}{n-p} \boldsymbol{e}^\top \boldsymbol{e} = \frac{1}{n-p} \left(\boldsymbol{Y} - \boldsymbol{X}\widehat{\boldsymbol{\beta}}\right)^{\top}\left(\boldsymbol{Y} - \boldsymbol{X}\widehat{\boldsymbol{\beta}}\right). \] Plugging in \(\hat \sigma^2\) into (F.2) can be used to obtain standard errors for the coefficients of \(\widehat{\boldsymbol{\beta}}\) taking the square roots of the diagonal elements.

An example of a linear model is the ANCOVA model discussed in Chapter 7.

F.2 The General Linear Model

This section introduces GEE and GLS for the analysis of longitudinal data, which account for correlations between measurements from the same individual.

Definition F.1 Longitudinal data is called balanced, if the same number of observations \(n_i=n\) for each individual \(i= 1,...,m\) at the same time points \(t_{ij}=t_j, j=1,...,n.\)

F.2.1 GEE’s

Model formulation \[\begin{equation} \tag{F.3} \boldsymbol{Y} \sim \boldsymbol{\mathop{\mathrm{N}}}(\boldsymbol{X\beta}, \boldsymbol{\Sigma}) \end{equation}\] where \(\boldsymbol{\Sigma}\) is block-diagonal with \(n \times n\) entries \(\boldsymbol{\Sigma_0}\). This means that \[ \boldsymbol{\Sigma} = \begin{pmatrix} \boldsymbol{\Sigma}_0 & & & \\ & \boldsymbol{\Sigma}_0 & & \\ & & \ddots & \\ & & &\boldsymbol{\Sigma}_0 \end{pmatrix} \] with zero entries elsewhere.

\(\widehat{\boldsymbol{\beta}}_{\boldsymbol{W}} = (\boldsymbol{X}^{\top}\boldsymbol{W}\boldsymbol{X})^{-1}\boldsymbol{X}^{\top}\boldsymbol{W}\boldsymbol{Y}\) is the weighted least squares estimate of \(\boldsymbol{\beta}\). Given any weight matrix \(\boldsymbol{W}\), \(\widehat{\boldsymbol{\beta}}_{\boldsymbol{W}}\) is unbiased with \(\mathop{\mathrm{\mathsf{E}}}(\widehat {\boldsymbol{\beta}}_{\boldsymbol{W}}) = {\boldsymbol{\beta}}\). The weight matrix \(\boldsymbol{W}\) is block-diagonal, with each block \(\boldsymbol{W_0}\) of size \(n \times n\) for \(n\) time points.

If \(\boldsymbol{W}^{-1} = \boldsymbol{I}\), then we obtain the least squares estimate \(\widehat {\boldsymbol{\beta}}_{\boldsymbol{W}} = (\boldsymbol{X}^{\top} \boldsymbol{X})^{-1}\boldsymbol{X}^{\top} \boldsymbol{Y}\).

For block-diagonal \(\boldsymbol{W}\) we obtain \[\begin{equation} \widehat{\boldsymbol{\beta}}_{\boldsymbol{W}} = \left( \sum_{i=1}^m \boldsymbol{X}_i^{\top}\boldsymbol{W}_0 \boldsymbol{X}_i \right)^{-1} \sum_{i=1}^m \boldsymbol{X}_i^{\top}\boldsymbol{W}_0 \boldsymbol{Y}_i. \tag{F.4} \end{equation}\]

The associated covariance matrix is (“sandwich formula”):

\[\begin{align*} \mathop{\mathrm{Cov}}(\widehat{\boldsymbol{\beta}}_{\boldsymbol{W}}) &= \boldsymbol{V} \, \boldsymbol{\Sigma} \, \boldsymbol{V}^{\top} \\ \mbox{ where } \boldsymbol{V} &= (\boldsymbol{X}^{\top}\boldsymbol{W}\boldsymbol{X})^{-1}\boldsymbol{X}^{\top}\boldsymbol{W} \end{align*}\]

and requires knowledge of \(\boldsymbol{\Sigma}\) (or an estimate \(\widehat{\boldsymbol{\Sigma}}\)).

Notice that \(\boldsymbol{W}^{-1}\) is often called working covariance matrix to distinguish it from the true covariance matrix \(\boldsymbol{\Sigma}\). It is an approximation to the true covariance matrix of the outcomes.

If \(\boldsymbol{W}^{-1} = \boldsymbol{\Sigma}\), then \(\mathop{\mathrm{Cov}}(\widehat{\boldsymbol{\beta}}_{\boldsymbol{W}})\) can be simplified to \[ \mathop{\mathrm{Cov}}(\widehat{\boldsymbol{\beta}}_{\boldsymbol{W}}) = (\boldsymbol{X}^{\top}\boldsymbol{W}\boldsymbol{X})^{-1} \] If we set \({\boldsymbol{\Sigma}}\) to \(\boldsymbol{W}^{-1}\) in the sandwich formula for \(\mathop{\mathrm{Cov}}(\widehat{\boldsymbol{\beta}}_{\boldsymbol{W}})\), we assume that \(\mathop{\mathrm{Cov}}(\boldsymbol{Y})= \boldsymbol{W}^{-1}\) (up to scale). This gives us the naive standard errors of \(\widehat{\boldsymbol{\beta}}_{\boldsymbol{W}}\).

The sandwich estimate \(\widehat {\boldsymbol R}_{\boldsymbol{W}} = \boldsymbol{V} \, \widehat{\boldsymbol{\Sigma}} \, \boldsymbol{V}^{\top}\) is a consistent estimate of \(\mathop{\mathrm{Cov}}(\widehat{\boldsymbol{\beta}}_{\boldsymbol{W}})\), where the estimate \(\widehat{\boldsymbol{\Sigma}}_0\) is the empirical covariance matrix of the residuals \(\boldsymbol{r}_i = {\boldsymbol Y}_i - \boldsymbol{X}_i\widehat {\boldsymbol \beta}_{\boldsymbol{W}}\), and \(\widehat{\boldsymbol{\Sigma}}\) is block-diagonal with entries \(\widehat{\boldsymbol{\Sigma}}_0\). The robust standard errors can be extracted from the diagonal: \(\sqrt{\mathop{\mathrm{diag}}(\widehat {\boldsymbol R}_{\boldsymbol{W}})}\).

If the naive standard errors are close to the robust standard errors, then \(\boldsymbol{W}^{-1}\) must be close to \(\mathop{\mathrm{Cov}}(\boldsymbol{Y})\) (up to scale).

In short, the generalized estimating equations work in the following way:

assume some working correlation matrix \(\boldsymbol{W}\)
estimate the coefficients with the weighted least squares estimate \(\widehat{\boldsymbol{\beta}}_{\boldsymbol{W}}\)
estimate the correlation \(\widehat{\boldsymbol{\Sigma}}\) of the residuals
compute robust standard errors (taking residual correlation into account) from the covariance matrix \[\widehat {\boldsymbol{R}}_{\boldsymbol{W}} = \boldsymbol{V} \, \widehat{\boldsymbol{\Sigma}} \, \boldsymbol{V}^{\top}\] based on the “sandwich” formula applied to the “naive” covariance matrix \(\boldsymbol{V}\) of the weighted least-squares estimate \(\widehat{\boldsymbol{\beta}}_{\boldsymbol{W}}\).

F.2.2 GLS’s

Generalized Least Squares (GLS) is a parametric method for estimating regression parameters when errors are correlated. Different from GEE, GLS uses Maximum Likelihood (\(\texttt{ML}\)) or Restricted Maximum Likelihood (\(\texttt{REML}\)).

Model formulation \[\begin{equation} \boldsymbol{Y} \sim \boldsymbol{\mathop{\mathrm{N}}}(\boldsymbol{X\beta}, \boldsymbol{\Sigma}) \end{equation}\] where \(\Sigma=\sigma^2 \boldsymbol{V}\) is block-diagonal with \(n \times n\) entries \(\boldsymbol{\Sigma}_0=\sigma^2 \boldsymbol{V}_0\). Here \(\boldsymbol{V}_0\) is a correlation matrix with entries corresponding to correlations between observations from the same individual. This correlation matrix often only depends on a few parameters.

Below are some commonly used parametric correlation structures.

Uniform correlation model, compound symmetry \[ \boldsymbol{V}_0 = (1 - \rho) \boldsymbol{I} + \rho \boldsymbol{J}, \] where \(\boldsymbol{I}\) is the identity matrix, \(\boldsymbol{J}\) is a matrix of ones and \(\rho \in [0,1)\) is an unknown correlation parameter. This is also the \(\texttt{exchangeable}\) correlation structure.
Exponential correlation model

Set \(\rho_{jk} = \exp(- \phi d_{jk})\) where \(d_{jk}=|t_j - t_k|\) is the distance (in some unit) between two time points \(t_j\) and \(t_k\): \[ \boldsymbol{V}_0 = \left\{\exp(- \phi d_{jk})\right\}_{j,k}. \] The parameter \(\phi\) represents the exponential decay of the correlation. The inverse \(1/\phi\) is often called the range parameter: For a distance \(d_{jk}>1/\phi\) we have “small” correlation \(\rho_{jk} < \exp(-1)= 0.37\).
Continuous time AR-1 model: This is a reparameterisation of the exponential correlation model, where \(\rho_{jk} = \left\{\alpha^{d_{jk}}\right\}_{j,k}\) and \(\alpha=\exp(-\phi)\). This is also the \(\texttt{AR-1}\) correlation structure mentioned earlier.

To choose which correlation structure to use, we may use the model choice criteria \[ \mbox{AIC} = - 2 l(\boldsymbol{\hat{\theta}_{\text{ML}}}) + 2\tilde p\\ \mbox{BIC} = - 2 l(\boldsymbol{\hat{\theta}_{\text{ML}}}) + \tilde p\log(\tilde n) \] where \(l(.)\) is the log likelihood, \(\hat{\theta}_{\text{ML}}\) is the maximum likelihood estimate, \(\tilde p\) is the number of all unknown parameters \(\boldsymbol{\theta}\) (including the correlation parameters), and \(\tilde n\) is the number of all observations. Smaller values of \(\mbox{AIC}\) and \(\mbox{BIC}\) are preferred.

ML and REML estimation

In model (F.3), the block entries \(\boldsymbol{\Sigma}_0=\sigma^2 \boldsymbol{V}_0\) in \(\boldsymbol{\Sigma}=\sigma^2 \boldsymbol{V}\) often only depends on a few parameters \(\boldsymbol{\alpha}\) under parametric models, and it has to be estimated with numerical routines.

For fixed \(\boldsymbol{V}_0\) we obtain the ML estimate of \(\boldsymbol{\beta}\): \[\begin{equation} \widehat{\boldsymbol{\beta}} (\boldsymbol{V}_0)= (\boldsymbol{X}^{\top}\boldsymbol{V}^{-1}\boldsymbol{X})^{-1}\boldsymbol{X}^{\top}\boldsymbol{V}^{-1}\boldsymbol{y}, \tag{F.5} \end{equation}\] which can be identified as the weighted least-squares estimate with \(\boldsymbol{W}=\boldsymbol{V}^{-1}\).

We can also obtain the ML estimate of \(\sigma^2\): \[ \widehat{\sigma}^2 (\boldsymbol{V}_0)= \mbox{RSS}(\boldsymbol{V}_0)/(nm), \]

\[ \text{ where } \text{RSS}(\boldsymbol{V}_0) = {(\boldsymbol{y}-\boldsymbol{X}\widehat{\boldsymbol{\beta}})^{\top}\boldsymbol{V}^{-1}(\boldsymbol{y}-\boldsymbol{X}\widehat{\boldsymbol{\beta}})} \] are the (weighted) residual sum of squares, where \(\widehat{\boldsymbol{\beta}} = \widehat{\boldsymbol{\beta}} (\boldsymbol{V}_0)\) is defined in (F.5).

However, ML estimates of \(\sigma^2\) and \(\boldsymbol{V}_0\) are typically biased and problematic if the dimension \(p\) of \(\boldsymbol{\beta}\) is large. For example, in the classical linear model (\(\boldsymbol{V}_0=\boldsymbol{I}\)) the ML estimate \({\sigma_{\text{ML}}}^2 = \mbox{RSS}/(nm)\) is biased, whereas \({\sigma_{\text{REML}}}^2 = \mbox{RSS}/(nm-p)\) is unbiased. The latter estimate is a Restricted Maximum Likelihood (REML) estimate, and is the default in function \(\texttt{gls()}\). Also notice that if the sample size \(nm\) is large, then ML and REML estimates are very similar.

F.3 Random Effects Model

F.3.1 General formulation

Suppose there are \(p\) covariates and we want to consider \(q\) of them as random effects, we can include individual-specific random effects \(\boldsymbol{U}_i\) into the model: \[\begin{equation} Y_{ij} \,\vert\,\boldsymbol{U}_i,{\epsilon}_{ij} = \boldsymbol{x}_{ij}^{\top}\boldsymbol{\beta} + \boldsymbol{d}_{ij}^{\top}\boldsymbol{U}_i + \epsilon_{ij} \end{equation}\] where \(\boldsymbol{U}_i\)’s are mutually independent and \(\boldsymbol{U}_i\sim \mathop{\mathrm{N}}_q(\boldsymbol{0}, \boldsymbol{G})\) and \(\epsilon_{ij} \sim \mathop{\mathrm{N}}(0, \tau^2)\). The vector \(\boldsymbol{d}_{ij}\) (\(q \times 1\)) is known and in general a sub-vector of the covariates \(\boldsymbol{x}_{ij}\) (\(p \times 1\)). Then we obtain a linear mixed effects model with fixed effects \(\boldsymbol{\beta}\) and random effects \(\boldsymbol{U}_i\).

We might want to consider random effects when we want to capture individual heterogeneity in the mean response. Random effect model is useful as it induce marginal correlation between observations \(Y_{ij}\) and \(Y_{ik}\) from the same individual (see equation (F.7)).

Parameter estimation The marginal moments \[\begin{align*} \mathop{\mathrm{\mathsf{E}}}(\boldsymbol{Y}_i) & = \boldsymbol{X}_i \boldsymbol{\beta} \tag{F.6} \\ \mathop{\mathrm{Cov}}(\boldsymbol{Y}_i) & = \boldsymbol{D}_i \boldsymbol{G}\boldsymbol{D}_i^{\top} + \tau^2 \boldsymbol{I}_{n_i} \tag{F.7} \end{align*}\] determine the (normal) likelihood to estimate \(\boldsymbol{\beta}\), \(\boldsymbol{G}\) and \(\tau^2\) via (RE)ML. Here \(\boldsymbol{D}_i = (\boldsymbol{d}_{i,1},\ldots,\boldsymbol{d}_{i,n_i})^{\top}\) (\(n_i \times q\)).

It is also possible to estimate (“predict”) the random effects \(\boldsymbol{U}_i\) and to compute standard errors, and this is done with the empirical Bayes (EB) estimate (the empirical BLUP estimate).

F.3.2 Random intercept model

The random intercept model is a special case of a random effects model. In this case, \(q=1\) and \({d}_{ij}=1\).

With \(\boldsymbol{x}_{ij}^{\top} = (1, \tilde{\boldsymbol{x}}_{ij}^{\top})\) we have \[ Y_{ij}\,\vert\,{U}_i,{\epsilon}_{ij} = (1, \tilde{\boldsymbol{x}}_{ij}^{\top}) \cdot \boldsymbol{\beta} + {U}_i + \epsilon_{ij}. \]

This can be rewritten as \[ Y_{ij}\,\vert\,{\tilde U}_i,{\epsilon}_{ij} = {\tilde U}_i + \tilde{\boldsymbol{x}}_{ij}^{\top}\tilde{\boldsymbol{\beta}} + \epsilon_{ij} \] where \({\tilde U}_i \sim \mathop{\mathrm{N}}(\beta_0, \nu^2)\) represents the random intercept, varying between individuals with mean \(\beta_0\) and variance \(\nu^2\).

F.3.3 Random slope model

Suppose observations \(Y_{ij}\) are made at time points \(t_{ij}\), which are included in the covariate vector \(\boldsymbol{x}_{ij}=(1,t_{ij},\tilde{\boldsymbol{x}}_{ij}^{\top})^{\top}\).

Suppose \(q=2\) and \(\boldsymbol{d}_{ij}=(1, t_{ij})^{\top}\), then the model becomes \[ Y_{ij}\,\vert\,\boldsymbol{U}_i,{\epsilon}_{ij} = (1, t_{ij},\tilde{\boldsymbol{x}}_{ij}^{\top}) \cdot \boldsymbol{\beta} + (1, t_{ij}) \cdot \boldsymbol{U}_i + \epsilon_{ij} \] where \(\boldsymbol{U}_i\) is a 2-dimensional random effect with mean zero and \(2 \times 2\) symmetric covariance matrix \(\boldsymbol{G}\). This is the random slope (strictly random intercept and slope) model.