Module 2 Linear Mixed Models

Reading

Galecki & Burzykowski (2013). Linear Mixed Effects Models using R.
Fox J (2008). Applied Regression Analysis and Generalized Linear Models 2nd ed.
Wood, S. Generalized Additive Models. Ch 2.
https://m-clark.github.io/mixed-models-with-R/random_intercepts.html
https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf

Basic Concepts

Mixed models for data that are not independent, e.g. repeated measures or clustered data.
Correlated data: multiple observations on individuals, i.e., “units”, “groups”, “clusters”.
Correlated data allows us (at minimum) to gain information about the variability of measurements on individuals.
The cost of correlated data is that each additional correlated individual contributes less information than the previous one, and much less than a single person in a new cluster.
Longitudinal data: Measurements on the same individual are made repeatedly over time- a type of correlated data.
Multilevel data: Individuals are nested within groups.
Random intercept model: motivation and interpretation
Shrinkage estimation and BLUPs of random effects.
Random slope model
Hierarchical formulation of random effect model.

Examples of clustered data

Longitudinal data: Repeated measures data on each subject
- $y_{i t}$ denotes the measure at time $t$ for individual $i$ .
Clustered or multilevel data: Observations within nested group structures.
- A collection of units from a population of similar units, e.g., $y_{i j}$ denotes measure from person $i$ in demographic group $j$ .

Motivating Example

Example of both longitudinal data and grouped/nested data:

The Treatment of Lead-Exposed Children (TLC) was examined using a placebo-controlled randomized study of succimer (a chelating agent) in children with blood lead levels of 20-44 micrograms/dL. These data consist of four repeated measurements of blood lead levels obtained at baseline (or week 0) , week 1, week 4, and week 6 on 100 children who were randomly assigned to chelation treament with succimer or placebo.

Variables:

ID: Child ID number
Treatment Group: Placebo vs. Succimer
Week: week X (x=0, 1, 4, 6)
Lead: lead level

tlcdata <- readRDS("data/tlc.RDS")
head(tlcdata)

    id treatment week lead
1    1   placebo    0 30.8
101  1   placebo    1 26.9
201  1   placebo    4 25.8
301  1   placebo    6 23.8
2    2  succimer    0 26.5
102  2  succimer    1 14.8

library(ggplot2)
library(dplyr)
library(gridExtra) 

placebo <- tlcdata %>% 
  filter(treatment == "placebo") %>%
  ggplot() +
  geom_line(aes(x = week, y = lead, col = id)) +
  labs(x = "Week", y="Lead", title = "Placebo") +
  theme(legend.position = "none")

succimer <- tlcdata %>% 
  filter(treatment == "succimer") %>%
  ggplot() +
  geom_line(aes(x = week, y = lead, col = id, linetype = treatment)) +
  labs(x = "Week", y="Lead",title = "Succimer") +
  theme(legend.position = "none")

gridExtra::grid.arrange(placebo, succimer)

What are some scientific questions we may be interested in studying?

The Wrong Approach : Ignoring clustered or longitudinal structure= The wrong way to do it!!!

$\begin{aligned} y_{i j} & = β_{0} + {\hat{β}}_{1} x_{i j} + ϵ_{i j} \\ ϵ_{i j} & \overset{i i d}{\sim} N (0, σ^{2}) \end{aligned}$

fit.lm = lm(lead ~ week, data = tlcdata)
summary(fit.lm)


Call:
lm(formula = lead ~ week, data = tlcdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.775  -3.749   0.328   4.454  43.330 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  22.9760     0.6079  37.793   <2e-16 ***
week         -0.4010     0.1670  -2.401   0.0168 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.966 on 398 degrees of freedom
Multiple R-squared:  0.01428,   Adjusted R-squared:  0.0118 
F-statistic: 5.765 on 1 and 398 DF,  p-value: 0.01681

What assumptions of OLS are we violating?

If observations were uncorrelated, i.e., if we had measured different subjects at 0, 1, 4, and 6 weeks post-exposure, then we could fit the standard linear model:

$y_{i t} = β_{0} + β_{1} I (t = 1) + β_{2} I (t = 4) + β_{3} I (t = 6) + e_{i t}$
- There is an indicator for each time point, where $t = 0$ is treated as the reference group.
- BUT the data are correlated. How do we handle this?
- We extend the linear model to allow correlation (covariance).
- A Better Approach: Linear Mixed Models

Approach 2: Mixed Models

Linear Mixed Models or LMMs

LMMs are extensions of regression in which intercepts and/or slopes are allowed to vary across individuals.
Mixed models were developed to deal with correlated data.
They allow for the inclusion of both fixed and random effects.
- Fixed effect: a global or constant parameter that does not vary.
- Random effect: Parameters that are random variables and are assigned a distributional assumption, e.g., $β \sim N (μ, σ^{2})$ .

Fixed effects parametrization

Taking our motivating example, we can assign a separate intercept for each child:

$\begin{aligned} y_{i j} & = β_{0 i} + β_{1} x_{i j} + ϵ_{i j}, \\ ϵ_{i j} & \overset{i i d}{\sim} N (0, σ^{2}) \end{aligned}$

Interpretation:

$β_{0 i}$ is the child-specific weight at zero and $β_{1}$ is the constant slope.
$σ^{2}$ captures within child variability.

Limitations:
- Estimating lots of parameters: subject-specific coefficients do not leverage overall population information and have less precision because of smaller sample size.
- Cannot forecast the grow curve of a new child.

Fitting a fixed effects model in R:

We treat the “group” or “nested” unit as a factor/dummy variable. Therefore, we DO NOT include the intercept (see Module 1). Including the “cluster” as a factor allows for individual intercepts across children. Use the $f a c t o r (i d)$ function to create a factor variable for child ID.

fit.strat = lm(lead ~week + factor(id)-1, data =tlcdata) #do not include the intercept!!
summary(fit.strat$coefficients[1])

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -0.401  -0.401  -0.401  -0.401  -0.401  -0.401

summary(fit.strat$coefficients[2])

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  27.93   27.93   27.93   27.93   27.93   27.93

Data are balanced (same number of observations for each child).
The slope of week is the same in the model with a single intercept and the model with child-specific intercepts.
Controlling for group-specific intercepts gives a smaller standard error for the slope of week because week is the same for all children.
Note that oftentimes, the standard error will be larger if there is sparse or unbalanced data.

In class exercise

What is the interpretation of $\hat{β_{j}}$ in this model?

Random Intercept Model

Consider the random intercept model with a vector of predictors $x_{i j}$ :

$\begin{aligned} y_{i j} & = μ + θ_{i} + x_{i j}^{T} β + ϵ_{i j} \\ ϵ_{i j} & \overset{i i d}{\sim} N (0, σ^{2}) \\ θ_{i} & \overset{i i d}{\sim} N (0, τ^{2}) \\ θ_{i} & ⊥ ϵ_{i j} \end{aligned}$

$μ$ denotes the overall intercept (population mean when $x_{i j} = 0$ .
$θ_{i}$ denotes the child specific difference from $μ$ .
$β_{0 i} = μ + θ_{i}$ participant specific intercept or child specific average.
$β$ time coefficient that does not vary across participants.
$τ^{2}$ denotes the between participant variability and here is the random effect variance.
$σ^{2}$ denotes residual variance or within group variability “measurement error”.

What makes this a mixed model?

$β_{0 i}$ is random vs. $β$ is global or fixed.

The following two models are equivalent:

Model 1: $y_{i j} = (μ + θ_{i}) + x_{i j}^{T} β + ϵ_{i j}, θ_{i} \sim N (0, τ^{2}) a n d ϵ_{i j} \sim N (0, σ^{2})$

Model 2: The hierarchical model framework

$\begin{aligned} y_{i j} & = β_{0 i} + x_{i j}^{'} β + ϵ_{i j} \\ β_{0 i} & \sim N (μ, τ^{2}) \\ ϵ_{i j} & \sim N (0, σ^{2}) \end{aligned}$

Model 2 is an example of a hierarchical model in which there are hierarchical structures placed on random effects. Here the random coeffcient $β_{0 i}$ have a higher level mean $μ$ .
Assumptions:
- $ϵ_{i j} ⊥ θ_{i}$ for all $i$ and $j$ .
- $θ_{i}$ are indentical and independent and normal for all $i$ .
- $ϵ_{i j}$ are identical and independent and normal for all $i$ and $j$ .

Properties of the Random Intercept Model

In-class exercise

What is the overall (average) trend?
What is total variability around the overall trend?
What is the conditional (group-specific) trend?
What is the conditional (within-group) residual variance?

Fitting a mixed effects model with a random intercept in R:

library(lmerTest)

Loading required package: lme4

Loading required package: Matrix


Attaching package: 'lmerTest'

The following object is masked from 'package:lme4':

    lmer

The following object is masked from 'package:stats':

    step

fit.randomeffects = lmer(lead ~ week + (1|id), data = tlcdata)
summary(fit.randomeffects)

Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: lead ~ week + (1 | id)
   Data: tlcdata

REML criterion at convergence: 2695.1

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.6337 -0.3461  0.0601  0.4131  6.1167 

Random effects:
 Groups   Name        Variance Std.Dev.
 id       (Intercept) 29.63    5.444   
 Residual             33.98    5.829   
Number of obs: 400, groups:  id, 100

Fixed effects:
            Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)  22.9760     0.7030 161.6442  32.683  < 2e-16 ***
week         -0.4010     0.1222 299.0000  -3.281  0.00116 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
     (Intr)
week -0.478

Compared to a model fitted with group dummy variables, the week slope estimate and SE are identical.

Model Interpretation

Interpretation: The fixed effect captures the global or average change in lead level for every one unit change in time, i.e., from week 1 to week 2 ${\hat{β}}_{1} = - 0.410$ meaning that there is an average 0.4 decrease in lead for every unit increase in week which was found to be statistically significant $p < 0.01$ .

Now the intercept 22.97 corresponds to the population-average lead at week 0.

The SE of the slope $s e (\hat{β_{1}}) = 0.122$ capturing uncertainty associated with the estimate.

The random effect of child $β_{0 i}$ allows for child-specific random intercepts, where the variance of the intercepts is 29.63.

#Extracting the random effects theta_i's
random.eff = ranef(fit.randomeffects)$id
hist(random.eff$`(Intercept)`)

Let’s fit an alternative model to the data including a treatment effect:

library(lmerTest)
fit.randomeffects = lmer(lead ~ week + treatment +  (1|id), data = tlcdata)
summary(fit.randomeffects)

Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: lead ~ week + treatment + (1 | id)
   Data: tlcdata

REML criterion at convergence: 2670.2

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.4885 -0.3686 -0.0176  0.4514  6.3428 

Random effects:
 Groups   Name        Variance Std.Dev.
 id       (Intercept) 22.09    4.700   
 Residual             33.98    5.829   
Number of obs: 400, groups:  id, 100

Fixed effects:
                  Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)        25.7648     0.8512 136.0165  30.269  < 2e-16 ***
week               -0.4010     0.1222 299.0000  -3.281  0.00116 ** 
treatmentsuccimer  -5.5775     1.1060  98.0000  -5.043  2.1e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) week  
week        -0.395       
trtmntsccmr -0.650  0.000

In class exercise

What are the overall findings from this model? And how does this model differ from the previous model?

Covariance Structure

A random intercept model is also known as a two-level variance component model:

$\begin{aligned} y_{i j} & = μ + θ_{i} + β x_{i j} + ϵ_{i j} \\ θ_{i} & \overset{i i d}{\sim} N (0, τ^{2}) \\ ϵ_{i j} & \overset{i i d}{\sim} N (0, σ^{2}) \end{aligned}$

Random Intercept Model in Matrix Form

Random intercept model

Consider the mixed model with random intercepts for $n$ groups and define $N = \sum_{i = 1}^{n} r_{i}$ .

$y = Z θ + X β + ϵ$

where

$y = N \times 1$ vector of response
$Z = N \times n$ design matrix of indicator variables for each group.
$θ = n \times 1$ vector of random intercepts.
$X = N \times p$ design matrix of fixed effects (including overall intercept).
$β = p \times 1$ vector of fixed effects.
$ϵ = N \times 1$ vector of residual error.

Assumptions of the above model

$θ \sim N (0, τ^{2} I_{n \times n}$ .
$ϵ \sim N (0, σ^{2} I_{N \times N}$ ).

Summary I

What is the model?	In the random intercept model, we regress a vector of outcomes $y_{i j}$ for individual $i$ and within measure $j$ on values on a vector of predictor values $x_{i j}$ with: $Y = μ + θ_{i} + x_{i j}^{'} β + ϵ_{i j}$ $μ$ is the population mean or overall average across all subjects-times.
Model Assumptions	$ϵ_{i j} \sim N (0, σ^{2})$ $V a r (ϵ) = σ^{2}$ constant within subject variance $θ_{i} \sim N (0, τ^{2})$ . $V a r (θ_{i}) = τ^{2}$ across subject variance. $θ_{i} ⊥ ϵ_{i j}$
Interpretation	$θ_{i}$ individual deviations away from the population mean $μ$ . $β_{1}$ the global value of the change of* $Y$ for every one unit change of $X$ . $σ^{2} + τ^{2}$ is the total variance.

Intraclass Correlation

Note that the within-group covariance is: $C o v (y_{i j}, y_{i j^{'}}) = τ^{2}$ .
So the correlation between observations within within the same group is

$ρ = C o r r (y_{i j}, y_{i j^{'}}) = \frac{τ^{2}}{τ^{2} + σ^{2}} for all j \neq j^{'} .$

The value $ρ$ is often called the intraclass correlation. It measures the degree of similarity among same-group observations compared to the residual error $σ^{2}$ .
Describes how strongly units in the same group resemble eachother.
$ρ \to 0$ when $τ^{2} \to 0$ (i.e. same intercept)
$ρ \to 0$ when $σ^{2} \to \infty$ (i.e., growing measurement error).
$ρ \to 1$ when $τ^{2} \to \infty$ (i.e., large separation in intercepts).
$ρ \to 1$ when $σ^{2} \to 0$ (i.e., zero measurement error).
The above ICC has an exchangeable structure because the correlation is constant between any pair of within-group observations.

Simulation Varying Between Subject Variability

Simulation Varying Measurement Error

Shrinkage and Random Effects

To simplify the derivation first consider a random effects model without fixed effects:

$y_{i j} = θ_{i} + ϵ_{i j}, θ_{i} \overset{i i d}{\sim} N (0, τ^{2}), ϵ_{i j} \overset{i i d}{\sim} N (0, σ^{2})$ - The join density of the data and random effects is given by:

$\prod_{i, j} f (y_{i j}, θ_{i}) = \prod_{i, j} f (y_{i j} | θ_{i}) \cdot \prod_{i} g (θ_{i})$

Now consider the matrix formulation $y = Z θ + ϵ$ where $Z =$ deisgn matrix of indicator variables denoting the $i j^{t h}$ observation belongs to group $i$ .
Given the values of $σ^{2}$ and $τ^{2}$ , it is easy to find the closed form solution to this.

Note that:
- ${\hat{θ}}_{i} \to 0$ when $τ^{2} \to 0$ (shrinks all random intercepts to zero).
- ${\hat{θ}}_{i} \to {\bar{y}}_{i .}$ when $τ^{2} \to \infty$ (no shrinkage = raw group mean estimates).
- ${\hat{θ}}_{i} \to {\bar{y}}_{i .}$ when $σ^{2} \to 0$ (no shrinkage = raw group mean estimates).
- ${\hat{θ}}_{i} \to {\bar{y}}_{i .}$ when $r \to \infty$ (no shrinkage = raw group mean estimates).
$τ^{2}$ controls the amount of shrinkage and how much information to borrow across groups.

Hierarchical Formulation

Now we assume the random effects are centered around a common mean. $μ$ . Let’s take look at the hierarchical formulation.

$y_{i j} = θ_{i} + ϵ_{i j}, θ_{i} \sim N (μ, τ^{2}), ϵ_{i j} \sim N (0, σ^{2})$

The joint density of the data and random effects is a multivariate normal given by:

$\begin{aligned} \prod_{i, j} f (y_{i j}, θ_{i}) & = \prod_{i, j} f (y_{i j} | θ_{i}) \cdot \prod_{i} g (θ_{i}) \\ \propto e x p (- \frac{1}{2 σ^{2}} [(y - Z θ)^{T} (y - Z θ) + \frac{σ^{2}}{τ^{2}} (θ - μ)^{T} (θ - μ)]) \end{aligned}$

Taking the first derivative, setting to 0, and solve! (On your own time).

Borrowing of Information

We can express ${\hat{θ}}_{i}$ as

${\hat{θ}}_{i} = \frac{(1 / τ^{2}) μ + (r_{i} / σ^{2}) {\bar{y}}_{i .}}{1 / τ^{2} + (r_{i} / σ^{2})}$

Since $σ^{2} / r_{i}$ is the sample variance of the estimated sample mean ${\bar{y}}_{i .}$ , the above form shows that random effects can be viewed as weighted average of:
- subject-specific mean: ${\bar{y}}_{i .}$ .
- overall mean $μ$
We can also express ${\hat{θ}}_{i}$ in terms of the intraclass correlation $ρ = τ^{2} / (τ^{2} + σ^{2})$ .

${\hat{θ}}_{i} = \frac{ρ^{- 1} μ + r_{i} (1 - ρ)^{- 1} {\bar{y}}_{i .}}{ρ^{- 1} + r_{i} (1 - ρ)^{- 1}}$
Less shrinkage is expected for $ρ \to 1$ .
For unknown variance parameters ${\hat{θ}}_{i}$ is the BLUP: Best Linear Unbiased Predictor.
- Unbiased: $E ({\hat{θ}}_{i}) = μ$ .
- Best Linear Unbiased Prediction (BLUP) = Best means unbiased and minimum variance: $E ({\hat{θ}}_{i} - θ_{i})^{2}$ is minimized.

Note that
- ${\hat{θ}}_{i} \to μ$ when you $τ^{2} \to 0$ (shrink ALL random intercepts to a common mean $μ$ ).
- ${\hat{θ}}_{i} \to {\bar{y}}_{i .}$ when $τ^{2} \to \infty$ (no shrinkage = raw mean estimates).

Parameter Estimation

$y = Z θ + X β + ϵ$ - Since $θ$ and $ϵ$ are random effects, we can rewrite the above as

$\begin{aligned} y & = X β + ϵ^{*} \\ ϵ * & = Z θ + ϵ \end{aligned}$ - This is equivalent to integrating out the random effects. Then the marginal model is:

$y \sim N (X β, V)$

The MLE approach from simple linear regression does not work because data are now correlated. We extend the simple OLS estimate to include covariance $V$ :

MLE estimation

$\hat{β} = (X^{T} V^{- 1} X)^{- 1} X^{T} V^{- 1} y$

The MLE estimate of the variances is biased! (Similar to the estimate in simple linear regression).
An alternative is the restricted maximum likelihood (REML):

${\hat{σ}}_{M L E}^{2} = \frac{1}{n} \sum_{i = 1}^{n} (x_{i} - \bar{x})^{2}$

vs.

${\hat{σ}}_{R E M L}^{2} = \frac{1}{n - 1} \sum_{i = 1}^{n} (x_{i} - \bar{x})^{2}$

Random Intercept and Random Slope Model

What are some scientific questions we may be interested in studying?

Consider the mixed effects model with a vector of predictors $x_{i j}$ :

$y_{i j} = (β_{0} + θ_{0 i}) + (β_{1} + θ_{1 i}) x_{i j} + ϵ_{i j}, ϵ_{i j} \overset{i i d}{\sim} N (0, σ^{2})$

$_{0i} = _0 + $ participant specific intercept or child specific average.
$β_{1 i} = β_{1} + θ_{1 i}$ participant specific rate of change or slope.
$β_{0}, β_{1}$ the intercept and slope that does not vary across participants.
$σ^{2}$ within subject variability ie measurement error.

In class exercise

What do we do for $θ_{1 i}$ ?

Fitting the Random Intercept and Slope model in R

Let’s fit the model using the tlc data:

placebo_data = tlcdata %>% filter(treatment == "placebo")
fit_random <- lmer(lead ~ week + (week|id), data = placebo_data)
summary(fit_random)

Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: lead ~ week + (week | id)
   Data: placebo_data

REML criterion at convergence: 1053.3

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.20808 -0.49837 -0.02268  0.47100  2.30541 

Random effects:
 Groups   Name        Variance Std.Dev. Corr
 id       (Intercept) 23.3113  4.8282       
          week         0.1581  0.3976   0.02
 Residual              4.5016  2.1217       
Number of obs: 200, groups:  id, 50

Fixed effects:
            Estimate Std. Error       df t value Pr(>|t|)    
(Intercept) 25.68536    0.72018 49.01076  35.665  < 2e-16 ***
week        -0.37213    0.08437 49.00066  -4.411 5.64e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
     (Intr)
week -0.164

Interpretation:

Child-specific intercept: $(β_{0} + θ_{0 i}) \sim N (25.68, 23.31)$ .
Child-specific slope: $(β_{1} + θ_{1 i}) \sim N (- 0.372, 0.158)$ .
The correlation $r h o$ between $(β_{0 i} β_{1 i}) = 0.02$ .
The 95% CIs of the subject-specific slopes does not include zero in the distribution. This we reject the null that a child’s lead levels do not increase with time.

Comparing Models

Is the random intercept plus slope preferred over random intercept only?

AIC

AIC: Akaike’s Information Criterion.
AIC is often used in model selection. It is a popular approach to choosing which variables should be in the model. $A I C = - 2 l (θ) + 2 p$ where $l (θ)$ is the log-likelihood for all parameters $θ$ and $p$ is the number of parameters.
The MLE estimate of the variances is biased! (Similar to the estimate in simple linear regression).
The general rule of thumb is a differenc of 2+ is substantially better. A lower AIC is better.
The AIC uses the maximum likelihood so use $R E M L = F A L S E$ to set the parameter estimation to MLE.

BREAKOUT SESSION 2

Read in the Nepalese Children data from Canvas.

Motivating Example

Study Design:

Time-varying variables of 200 children collected up to 5 time points about 4 months apart.

age (month), indicator for current breastfeeding status, arm circumference (cm), height (cm), weight(kg).

Time invariant baseline information:

sex of the child, mother’s age at birth, indicator of mother’s literacy.

load("/Users/emilypeterson/Library/CloudStorage/OneDrive-EmoryUniversity/BIOS 526/BIOS526_Book/data/Nepal200.RData")

Review the data to make sure you understand the variables. \
Fit the random intercept model using the code given above where arm circumference is regressed on age .
Interpret the findings from this model.
Fit the same model, but using MLE instead. Do this by setting $R E M L = F A L S E$ as a parameter in your code.
Identify the differences in coefficient estimates.
Fit the random intercept and slope model using the code given above where arm circumference is regressed on age.
Interpret the findings of this model.
Compare the random intercept only, and the random intercept plus slope model using AIC metrics. $R E M L = F A L S E$ must be included in the $f i t < - l m e r (y \sim . . ., R E M L = F)$ function call. To get the AIC from a model fit using the function $A I C (f i t)$ .
Based on the AIC findings, which model is better?