Note 2 Simulating Multilevel Data

# Load required packages
library(tidyverse)
theme_set(theme_classic() +
            theme(panel.grid.major.y = element_line(color = "grey92")))
library(mnormt)
library(lme4)

2.1 Linear Growth Model

General Mixed Model in Matrix form:

$\begin{align*} Y_{ij} & = \boldsymbol{\mathbf{x}}_{ij} \boldsymbol{\mathbf{\beta}}_{j} + e_{ij} \\ \boldsymbol{\mathbf{\beta}}_{j} & = \boldsymbol{\mathbf{\gamma }}+ \boldsymbol{\mathbf{u}}_j, \end{align*}$

with distributional assumptions:

$\begin{align*} e_{ij} & \sim \mathcal{N}(0, \sigma^2) \\ \boldsymbol{\mathbf{u}}_{j} & \sim \mathcal{N}(\boldsymbol{\mathbf{0}}, \boldsymbol{\mathbf{G}}) \end{align*}$

In a linear growth model, the predictor is time, and it’s assumed that everyone in the data is measured with the same number of time points. A random slope is included, meaning that the growth rates are different across individuals. The two random effects, $u_0$ (random intercepts) and $u_1$ (random slopes), have a variance-covariance matrix of $\boldsymbol{\mathbf{G}}$ such that:

$\begin{equation} \boldsymbol{\mathbf{G}} = \textrm{Var}(u_{j}) = \textrm{Var}\left(\begin{bmatrix} u_{0j} \\ u_{1j} \end{bmatrix}\right) = \begin{bmatrix} \tau_{00} & \tau_{01} \\ \tau_{01} & \tau_{11} \end{bmatrix}. \end{equation}$

We’ll look into each step for simulating the data. For now, check out an example simulated data set first:

set.seed(2208)  # set the seed
J <- 20  # number of individuals (clusters)
cs <- 4  # number of time points (cluster size)
gam <- c(0, 0.5)  # fixed effects
G <- matrix(c(0.25, 0, 
              0, 0.125), nrow = 2)  # random effect variances (G-matrix)
sigma2 <- 1  # within-person variance (lv-1)
X <- cbind(1, seq_len(cs) - 1)  # for each individual
X <- X[rep(seq_len(cs), J), ]  # repeat each row cs times
pid <- seq_len(J)  # individual id
pid <- rep(pid, each = cs)  # repeat each ID cs times
# Generate person-level (lv-2) random effects
uj <- rmnorm(J, mean = rep(0, 2), varcov = G)
# Generate repeated-measure-level (lv-1) error term
eij <- rnorm(J * cs, sd = sqrt(sigma2))
# Compute beta_j's
betaj <- matrix(gam, nrow = J, ncol = 2, byrow = TRUE) + uj
# Compute outcome:
y <- rowSums(X * betaj[pid, ]) + eij
# Form a data frame
sim_dat1 <- tibble(y, time = X[ , 2], pid)

# Plot the data:
sim_dat1 %>% 
  ggplot(aes(x = time, y = y, color = factor(pid), group = pid)) + 
  geom_line() +  # individual trajectory
  geom_smooth(aes(group = 1), method = "lm")  # overall trend line

# Run multilevel model:
summary(lmer(y ~ time + (time | pid), data = sim_dat1))

># Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl =
># control$checkConv, : Model failed to converge with max|grad| = 0.00340355
># (tol = 0.002, component 1)

># Linear mixed model fit by REML ['lmerMod']
># Formula: y ~ time + (time | pid)
>#    Data: sim_dat1
># 
># REML criterion at convergence: 254.2
># 
># Scaled residuals: 
>#      Min       1Q   Median       3Q      Max 
># -2.58337 -0.52403  0.00055  0.45941  2.26654 
># 
># Random effects:
>#  Groups   Name        Variance Std.Dev. Corr
>#  pid      (Intercept) 0.33239  0.5765       
>#           time        0.02529  0.1590   0.70
>#  Residual             0.99704  0.9985       
># Number of obs: 80, groups:  pid, 20
># 
># Fixed effects:
>#             Estimate Std. Error t value
># (Intercept)  0.01368    0.22697   0.060
># time         0.46195    0.10600   4.358
># 
># Correlation of Fixed Effects:
>#      (Intr)
># time -0.489
># convergence code: 0
># Model failed to converge with max|grad| = 0.00340355 (tol = 0.002, component 1)

2.1.1 Overview of Simulation Process for Linear Growth Model

Set the seed

set.seed(2208)  # set the seed

Define Simulation Parameters:

The parameters are:

Number of individuals/clusters (J)
Number of time points (cluster size, cs)
Fixed effects (gam):
- $\gamma_0$ = grand intercept
- $\gamma_1$ = average slope
Variance-covariance matrix of random effects (G matrix)
- There are two random effects, so a 2 x 2 matrix
Within-person error variance ( $\sigma^2$ , sigma2)

J <- 20  # number of individuals (clusters)
cs <- 4  # number of time points (cluster size)
gam <- c(0, 0.5)  # fixed effects
G <- matrix(c(0.25, 0, 
              0, 0.125), nrow = 2)  # random effect variances (G-matrix)
sigma2 <- 1  # within-person variance (lv-1)

Define the design matrix (i.e., predictor matrix)

X <- cbind(1, seq_len(cs) - 1)  # for each individual
X <- X[rep(seq_len(cs), J), ]  # repeat each row cs times

Generating ID variable

pid <- seq_len(J)  # individual id
pid <- rep(pid, each = cs)  # repeat each ID cs times

Simulating random components

In this example, the three random components are: $u_{0j}$ , $u_{1j}$ , and $e_{ij}$ . $u_{0j}$ and $u_{1j}$ are multivariate normal, each of length J, and $e_{ij}$ is normal of length J * cs.

# Generate person-level (lv-2) random effects
uj <- rmnorm(J, mean = rep(0, 2), varcov = G)
# Generate repeated-measure-level (lv-1) error term
eij <- rnorm(J * cs, sd = sqrt(sigma2))

6a. Compute $\beta$ s

# Compute beta_j's
betaj <- matrix(gam, nrow = J, ncol = 2, byrow = TRUE) + uj

6b. Compute $y$ s

# Compute outcome:
y <- rowSums(X * betaj[pid, ]) + eij

Combine into a data frame

# Form a data frame
sim_dat1 <- tibble(y, time = X[ , 2], pid)

2.1.2 Define a Data Generating Function

Because the data generating process will be repeated numerous time, it is beneficial to make it into a function. There are, however, some decisions to be made regarding which part of the simulation should go into a function, and which part(s) should become the arguments of the function. For example, I can set a function that create the number of clusters, cluster sizes, and everything inside the function, and the function basically just rerun steps 1 to 7. Such a function, however, will not be very useful as it is likely that one may want to change these parameters to be different in various simulation conditions. In this example, I have decided to create a function that has the following arguments (parameters):

J, cs, gam, G, sigma2

I will treat X as the same, meaning that the simulation function likely will be only used for generating data with linear growth. If other growth shape will be compared in the simulation study, then it may be better to include X as an argument of the function so that it’s easier to change it.

After all, the fewer parameters there are in a function, the more specific the function is; the more parameters, the more general it is. There should be a balance between the two.

Okay, here is the function:

gen_lg_data <- function(J, cs, gam, G, sigma2 = 1) {
  X <- cbind(1, seq_len(cs) - 1)  # for each individual
  X <- X[rep(seq_len(cs), J), ]  # repeat each row cs times
  pid <- seq_len(J)  # individual id
  pid <- rep(pid, each = cs)  # repeat each ID cs times
  # Generate person-level (lv-2) random effects
  uj <- rmnorm(J, mean = rep(0, 2), varcov = G)
  # Generate repeated-measure-level (lv-1) error term
  eij <- rnorm(J * cs, sd = sqrt(sigma2))
  # Compute beta_j's
  betaj <- matrix(gam, nrow = J, ncol = 2, byrow = TRUE) + uj
  # Compute outcome:
  y <- rowSums(X * betaj[pid, ]) + eij
  # Form a data frame
  sim_dat1 <- tibble(y, time = X[ , 2], pid)
  # Return data
  return(sim_dat1)
}

I can generate some data with 30 people and 4 time points:

gen_lg_data(30, 4, gam = c(0, 0.5), 
            G = matrix(c(0.1, 0, 
                         0, 0.01), nrow = 2))  # sigma2 is default to 1.0

># # A tibble: 120 x 3
>#          y  time   pid
>#      <dbl> <dbl> <int>
>#  1  1.60       0     1
>#  2  0.0834     1     1
>#  3  1.21       2     1
>#  4  2.37       3     1
>#  5 -0.386      0     2
>#  6 -0.276      1     2
>#  7  0.298      2     2
>#  8  0.443      3     2
>#  9 -0.0782     0     3
># 10 -0.0982     1     3
># # … with 110 more rows

Note that I set the default of sigma2 to 1 as this can be constant across conditions. This is useful for a parameter that are mostly constant, but I just want to make the function work in case I need the parameter to be something different.

I can generate with 50 people and 6 time points:

gen_lg_data(50, 6, gam = c(0, 0.5), 
            G = matrix(c(0.1, 0, 
                         0, 0.01), nrow = 2))  # sigma2 is default to 1.0

># # A tibble: 300 x 3
>#         y  time   pid
>#     <dbl> <dbl> <int>
>#  1  1.54      0     1
>#  2  1.34      1     1
>#  3  1.91      2     1
>#  4  0.978     3     1
>#  5  2.04      4     1
>#  6  2.02      5     1
>#  7  0.128     0     2
>#  8 -0.393     1     2
>#  9  0.457     2     2
># 10  0.140     3     2
># # … with 290 more rows

2.1.3 Define a Function to Run the Analysis

I just realized I haven’t told you the goal of the simulation. Let’s say I’m interested in what will happen to the fixed effect estimates when the random slope term is ignored. To do that, for each generated data set, I need to fit a multilevel model ignoring the random slope. Here’s a function to do that:

# Function to run a random intercept model
run_ri <- function(df) {
  # Only requires input of a data frame
  lmer(y ~ time + (1 | pid), data = df)
}
# Test it on our simulated data set:
run_ri(sim_dat1)

># Linear mixed model fit by REML ['lmerMod']
># Formula: y ~ time + (1 | pid)
>#    Data: df
># REML criterion at convergence: 255.329
># Random effects:
>#  Groups   Name        Std.Dev.
>#  pid      (Intercept) 0.7554  
>#  Residual             1.0187  
># Number of obs: 80, groups:  pid, 20
># Fixed Effects:
># (Intercept)         time  
>#     0.01368      0.46195

2.1.4 Run the Simulation

Now, we can create a for loop to run the simulation:

# Set the seed
set.seed(2208)
# Note: use capital letters for GLOBAL VARIABLES
NREP <- 100  # number of replications; should be increased in practice
J <- 20  # number of individuals (clusters)
CS <- 4  # number of time points (cluster size)
GAM <- c(0, 0.5)  # fixed effects
G <- matrix(c(0.25, 0, 
              0, 0.125), nrow = 2)  # random effect variances (G-matrix)
# Initialize place holders for results
sim_result <- vector("list", length = NREP)
# Create a for loop:
for (i in seq_len(NREP)) {
  sim_dat <- gen_lg_data(J, CS, GAM, G)
  sim_result[[i]] <- run_ri(sim_dat)
}
# Check the last result:
sim_result[[NREP]]

># Linear mixed model fit by REML ['lmerMod']
># Formula: y ~ time + (1 | pid)
>#    Data: df
># REML criterion at convergence: 264.3811
># Random effects:
>#  Groups   Name        Std.Dev.
>#  pid      (Intercept) 0.6601  
>#  Residual             1.1184  
># Number of obs: 80, groups:  pid, 20
># Fixed Effects:
># (Intercept)         time  
>#     -0.2072       0.6748

2.1.5 Extract Target Statistics

Let’s extract the fixed effect for the time variable for each result. We can use the purrr::map() function (or you can do a for loop too):

# The `fixef` function can extract the fixed effect coefficients
fixef(sim_result[[1]])

># (Intercept)        time 
>#  0.01367995  0.46194936

# The following extracts the standard error
sqrt(diag(vcov(sim_result[[1]])))

># [1] 0.2546643 0.1018698

# And confidence intervals:
confint(sim_result[[1]], parm = "time")

>#          2.5 %    97.5 %
># time 0.2607471 0.6631516

# Now get the fixed effect and the standard error for all replications
fixefs_time <- map(sim_result, 
                   ~ tibble(est = fixef(.x)[2], 
                            se = sqrt(diag(vcov(.x))[2]), 
                            ci = confint(.x, parm = "time")) %>% 
                     # expand CI to two columns
                     transmute(est, se, li = ci[1], ui = ci[2])) %>% 
  # merge the results
  bind_rows()

2.1.6 Summarize the Results

We can plot the sampling distribution of the fixed effect coefficient

fixefs_time %>% 
  ggplot(aes(x = est)) + 
  geom_density()

The main evaluation is the bias and SE bias:

# bias:
summarize(fixefs_time, 
          ave_est = mean(est), 
          ave_se = mean(se), 
          sd_est = sd(est), 
          ci_coverage = mean(li <= GAM[2] & ui >= GAM[2])) %>% 
  # Compute bias and SE bias
  mutate(bias = ave_est - GAM[2], 
         rbias = bias / GAM[2], 
         se_bias = ave_se - sd_est, 
         rse_bias = se_bias / sd_est, 
         rmse = bias^2 + sd_est^2)

># # A tibble: 1 x 9
>#   ave_est ave_se sd_est ci_coverage     bias   rbias se_bias rse_bias
>#     <dbl>  <dbl>  <dbl>       <dbl>    <dbl>   <dbl>   <dbl>    <dbl>
># 1   0.492  0.110  0.129        0.86 -0.00826 -0.0165 -0.0191   -0.149
># # … with 1 more variable: rmse <dbl>

2.2 Exercise

Create a new function, run_rs(), which analyze the data with the random slope term, by modifying the run_ri() function.
Replicate the previous simulation, but with the new function.
Compare the results with and without random slopes.