Chapter 9 Linear Mixed Models

Example 9.1 (Dependent Samples on the Mean) Consider inference on a population’s mean. Supposedly, more observations imply more information. This, however, is not the case if samples are completely dependent. More observations do not add any new information. From this example one may think that dependence reduces information. This is a false intuition: negative correlations imply oscillations about the mean, so they are actually more informative on the mean than independent observations.
Example 9.2 (Repeated Measures) Consider a prospective study, i.e., data that originates from selecting a set of subjects and making measurements on them over time. Also assume that some subjects received some treatment, and other did not. When we want to infer on the population from which these subjects have been sampled, we need to recall that some series of observations came from the same subject. If we were to ignore the subject of origin, and treat each observation as an independent sample point, we will think we have more information on treatment effects than we actually do, i.e., we will have a false sense of security in our inference.

Sources of variability, i.e. noise, are known in the statistical literature as “random effects”. Specifying these sources determines the correlation structure in our measurements. In the simplest linear models of Chapter 7, we thought of the variability as originating from measurement error, thus independent of anything else. No-correlation, and fixed variability is known as sphericity. Sphericity is of great mathematical convenience, but quite often, unrealistic.

The effects we want to infer on are assumingly non-random, and known “fixed-effects”. Sources of variability in our measurements, known as “random-effects” are usually not the object of interest. A model which has both random-effects, and fixed-effects, is known as a “mixed effects” model. If the model is also linear, it is known as a linear mixed model (LMM). Here are some examples where LMMs arise.

Example 9.3 (Fixed and Random Machine Effect) Consider a problem from industrial process control: testing for a change in diamteters of manufactured bottle caps. We want to study the fixed effect of time: before versus after. Bottle caps are produced by several machines. Clearly there is variablity in the diameters within-machine and between-machines. Given a sample of bottle caps from many machines, we could standardize measurements by removing each machine’s average. This implies we treat machines as fixed effects, subtract them, and consider within-machine variability is the only source of variability. The subtraction of the machine effect, removed information on between-machine variability.
Alternatively, we could consider between-machine variability as another source of uncertainty when inferring on the temporal fixed effect. In which case, would not subtract the machine-effect, bur rather, treat it as a random-effect, in the LMM framework.
Example 9.4 (Fixed and Random Subject Effect) Consider an experimenal design where each subject is given 2 types of diets, and his health condition is recorded. We could standardize over subjects by removing the subject-wise average, before comparing diets. This is what a paired (t-)test does. This also implies the within-subject variability is the only source of variability we care about. Alternatively, for inference on the population of “all subjects” we need to adress the between-subject variability, and not only the within-subject variability.

The unifying theme of the above examples, is that the variability in our data has several sources. Which are the sources of variability that need to concern us? This is a delicate matter which depends on your goals. As a rule of thumb, we will suggest the following view: If information of an effect will be available at the time of prediction, treat it as a fixed effect. If it is not, treat it as a random-effect.

LMMs are so fundamental, that they have earned many names:

  • Mixed Effects: Because we may have both fixed effects we want to estimate and remove, and random effects which contribute to the variability to infer against.

  • Variance Components: Because as the examples show, variance has more than a single source (like in the Linear Models of Chapter 7).

  • Hierarchical Models: Because as Example 9.4 demonstrates, we can think of the sampling as hierarchical– first sample a subject, and then sample its response.

  • Multilevel Analysis: For the same reasons it is also known as Hierarchical Models.

  • Repeated Measures: Because we make several measurements from each unit, like in Example 9.4.

  • Longitudinal Data: Because we follow units over time, like in Example 9.4.

  • Panel Data: Is the term typically used in econometric for such longitudinal data.

Whether we are aiming to infer on a generative model’s parameters, or to make predictions, there is no “right” nor “wrong” approach. Instead, there is always some implied measure of error, and an algorithm may be good, or bad, with respect to this measure (think of false and true positives, for instance). This is why we care about dependencies in the data: ignoring the dependence structure will probably yield inefficient algorithms. Put differently, if we ignore the statistical dependence in the data we will probably me making more errors than possible/optimal.

We now emphasize:

  1. Like in previous chapters, by “model” we refer to the assumed generative distribution, i.e., the sampling distribution.

  2. In a LMM we specify the dependence structure via the hierarchy in the sampling scheme E.g. caps within machine, students within class, etc. Not all dependency models can be specified in this way! Dependency structures that are not hierarchical include temporal dependencies (AR, ARIMA, ARCH and GARCH), spatial, Markov Chains, and more. To specify dependency structures that are no hierarchical, see Chapter 8 in (the excellent) Weiss (2005).

  3. If you are using LMMs for predictions, and not for inference on the fixed effects or variance components, then see the Supervised Learning Chapter 10. Also recall that machine learning from non-independent observations (such as LMMs) is a delicate matter.

9.1 Problem Setup

We denote an outcome with \(y\) and assume its sampling distribution is given by (the generative process) \[\begin{align} y|x,u = x'\beta + z'u + \varepsilon \tag{9.1} \end{align}\] where \(x\) are the factors with (fixed) effects we want to study, and\(\beta\) denotes these effects. The factors \(z\), with effects \(u\), merely contribute to variability in \(y|x\).

In matrix notation:

\[ y = X \beta + Zu + \epsilon,\] where:

here \(y\) is a vector of observations (not scalar as above), with mean \(E(y) = X \beta\); \(\beta\) is unknown vector of fixed effects; \(u\) is an unknown vector of random effects, with mean \(E(u)=0\), and covariance matrix \(var(u) = G\); \(\epsilon\) is an unknown vector (again, not scalar) of random errors, with mean \(E(\epsilon)=0\) and variance \(var(\epsilon)=R\) (usually \(R = \sigma^2 I_{N \times N}\)); \(X\) and \(Z\) are known design matrices. See this example for better understanding the matrix notation.

In LMM we typically assume: \(u \sim \mathcal{N}(0,G);\ \epsilon \sim \mathcal(0,R); \ Cov(u,\epsilon)=0\). Note: the random effects are assumed to be sampled from a multivariate Gaussian distribution \(\mathcal{N}(0,G)\). The fixed and random effects \(\beta\) and \(u\) are given by maximizing the joint density \(f(y,u)\).

In our repeated measures example (9.2) the treatment is a fixed effect, and the subject is a random effect. In our bottle-caps example (9.3) the time (before vs. after) is a fixed effect, and the machines may be either a fixed or a random effect (depending on the purpose of inference). In our diet example (9.4) the diet is the fixed effect and the subject is a random effect.

Notice that we state \(y|x,z\) merely as a convenient way to do inference on \(y|x\). We could, instead, specify \(Var[y|x]\) directly. The second approach seems less convenient. This is the power of LMMs! We specify the covariance not via the matrix \(Var[z'u|x]\), or \(Var[y|x]\), but rather via the sampling hierarchy.

Given a sample of \(n\) observations \((y_i,x_i,z_i)\) from model (9.1), we will want to estimate \((\beta,u)\). Under the assumption on the distribution of \(\varepsilon\) and \(z\) mentioned above, we can use maximum likelihood (ML). In the context of LMMs, however, ML is typically replaced with restricted maximum likelihood (ReML), because it returns unbiased estimates of \(Var[y|x]\) and ML does not.

9.1.1 Non-Linear Mixed Models

The idea of random-effects can also be extended to non-linear mean models. Formally, this means that \(y|x,z=f(x,z,\varepsilon)\) for some non-linear \(f\). This is known as non-linear-mixed-models, which will not be discussed in this text.

9.1.2 Generalized Linear Mixed Models (GLMM)

You can marry the ideas of random effects, with non-linear link functions, and non-Gaussian distribution of the response. These are known as Generalized Linear Mixed Models (GLMM), which will not be discussed in this text.

9.2 LMMs in R

We will fit LMMs with the lme4::lmer function. The lme4 is an excellent package, written by the mixed-models Guru Douglas Bates. We start with a small simulation demonstrating the importance of acknowledging your sources of variability. Our demonstration consists of fitting a linear model that assumes independence, when data is clearly dependent.

n.groups <- 4 # number of groups
n.repeats <- 2 # samples per group
groups <- rep(1:n.groups, each=n.repeats) %>% as.factor
n <- length(groups)
z0 <- rnorm(n.groups, 0, 10) 
(z <- z0[as.numeric(groups)]) # generate and inspect random group effects
## [1]   8.242268   8.242268 -13.706530 -13.706530  10.091668  10.091668
## [7]  -2.141764  -2.141764
epsilon <- rnorm(n,0,1) # generate measurement error
beta0 <- 2 # this is the actual parameter of interest! The global mean.
y <- beta0 + z + epsilon # sample from an LMM

We can now fit the linear and LMM.

# fit a linear model assuming independence
lm.5 <- lm(y~1)  
# fit a mixed-model that deals with the group dependence
lme.5 <- lmer(y~1|groups) 

The summary of the linear model

summary.lm.5 <- summary(lm.5)
## Call:
## lm(formula = y ~ 1)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.949  -7.275   1.629   8.668  10.005 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    3.317      3.500   0.948    0.375
## Residual standard error: 9.898 on 7 degrees of freedom

The summary of the LMM

summary.lme.5 <- summary(lme.5)
## Linear mixed model fit by REML ['lmerMod']
## Formula: y ~ 1 | groups
## REML criterion at convergence: 41
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.15395 -0.50048  0.04306  0.55891  0.99797 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  groups   (Intercept) 111.962  10.581  
##  Residual               2.012   1.418  
## Number of obs: 8, groups:  groups, 4
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)    3.317      5.314   0.624

Look at the standard error of the global mean, i.e., the intercept: for lm it is 3.5, and for lme it is 5.314. Why this difference? Because lm treats the group effect as fixed, while the mixed model treats the group effect as a source of noise/uncertainty. Inference using lm underestimates our uncertainty in the estimated population mean (\(\beta_0\)). This is that false-sense of security we may have when ignoring correlations. Relation to Paired t-test

Recall the paired t-test. Our two-sample–per-group example of the LMM is awfully similar to a paired t-test. It would be quite troubling if the well-known t-test and the oh-so-powerful LMM would lead to diverging conclusions. In the previous, we inferred on the global mean; a quantity that cancels out when pairing. For a fair comparison, let’s infer on some temporal effect. Compare the t-statistic below, to the t value in the summary of lme.6. Luckily, as we demonstrate, the paired t-test and the LMM are equivalent. So if you follow authors like (???) that recommend LMMs instead of pairing, remember, these things are sometimes equivalent.

time.fixed.effect <- rep(c('Before','After'), times=4) %>% factor
##              y groups time.fixed.effect
## [1,]  10.76759      1                 2
## [2,]  10.36614      1                 1
## [3,] -11.57154      2                 2
## [4,] -10.24338      2                 1
## [5,]  13.18838      3                 2
## [6,]  13.04653      3                 1
lme.6 <- lmer(y~time.fixed.effect+(1|groups)) 
##                            Estimate Std. Error    t value
## (Intercept)              3.31805208  5.4773030  0.6057821
## time.fixed.effectBefore -0.04630831  0.4374409 -0.1058619
t.test(y~time.fixed.effect, paired=TRUE)$statistic
##         t 
## 0.1058618

9.2.1 A Single Random Effect

We will use the Dyestuff data from the lme4 package, which encodes the yield, in grams, of a coloring solution (dyestuff), produced in 6 batches using 5 different preparations.

data(Dyestuff, package='lme4')
##   Batch Yield
## 1     A  1545
## 2     A  1440
## 3     A  1440
## 4     A  1520
## 5     A  1580
## 6     B  1540

And visually


The plot confirms that Yield varies between Batchs. We do not want to study this batch effect, but we want our inference to apply to new, unseen, batches15. We thus need to account for the two sources of variability when inferring on the (global) mean: the within-batch variability, and the between-batch variability We thus fit a mixed model, with an intercept and random batch effect.

lme.1<- lmer( Yield ~ 1 + (1|Batch)  , Dyestuff )
## Linear mixed model fit by REML ['lmerMod']
## Formula: Yield ~ 1 + (1 | Batch)
##    Data: Dyestuff
## REML criterion at convergence: 319.7
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.4117 -0.7634  0.1418  0.7792  1.8296 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Batch    (Intercept) 1764     42.00   
##  Residual             2451     49.51   
## Number of obs: 30, groups:  Batch, 6
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  1527.50      19.38    78.8

Things to note:

  • The syntax Yield ~ (1|Batch) tells lme4::lmer to fit a model with a global intercept (1) and a random Batch effect (1|Batch). The | operator is the cornerstone of random effect modeling with lme4::lmer.
  • 1+ isn’t really needed. lme4::lmer, like stats::lm adds it be default. We put it there to remind you it is implied.
  • As usual, summary is content aware and has a different behavior for lme class objects.
  • The output distinguishes between random effects (\(u\)), a source of variability, and fixed effect (\(\beta\)), which we want to study. The mean of the random effect is not reported because it is unassumingly 0.
  • Were we not interested in standard errors, lm(Yield ~ Batch) would have returned the same (fixed) effects estimates.

Some utility functions let us query the lme object. The function coef will work, but will return a cumbersome output. Better use fixef to extract the fixed effects, and ranef to extract the random effects. The model matrix (of the fixed effects alone), can be extracted with model.matrix, and predictions with predict.

9.2.2 A Full Mixed-Model

In the sleepstudy data, we recorded the reaction times to a series of tests (Reaction), after various subject (Subject) underwent various amounts of sleep deprivation (Day).

We now want to estimate the (fixed) effect of the days of sleep deprivation on response time, while allowing each subject to have his/hers own effect. Put differently, we want to estimate a random slope for the effect of day. The fixed Days effect can be thought of as the average slope over subjects.

lme.3 <- lmer ( Reaction ~ Days + ( Days | Subject ) , data= sleepstudy )

Things to note:

  • ~Days specifies the fixed effect.
  • We used the (Days|Subject) syntax to tell lme4::lmer we want to fit the model ~Days within each subject. Just like when modeling with stats::lm, (Days|Subject) is interpreted as (1+Days|Subject), so we get a random intercept and slope, per subject.
  • Were we not interested in standard errors, stats::lm(Reaction~Days*Subject) would have returned (almost) the same effects. Why “almost”? See below…

The fixed day effect is:

## (Intercept)        Days 
##   251.40510    10.46729

The variability in the average response (intercept) and day effect is

ranef(lme.3) %>% lapply(head)
## $Subject
##     (Intercept)       Days
## 308    2.258565  9.1989719
## 309  -40.398577 -8.6197032
## 310  -38.960246 -5.4488799
## 330   23.690498 -4.8143313
## 331   22.260203 -3.0698946
## 332    9.039526 -0.2721707

Did we really need the whole lme machinery to fit a within-subject linear regression and then average over subjects? The short answer is that if we have a enough data for fitting each subject with it’s own lm, we don’t need lme. The longer answer is that the assumptions on the distribution of random effect, namely, that they are normally distributed, allow us to pool information from one subject to another. In the words of John Tukey: “we borrow strength over subjects”. If the normality assumption is true, this is very good news. If, on the other hand, you have a lot of samples per subject, and you don’t need to “borrow strength” from one subject to another, you can simply fit within-subject linear models without the mixed-models machinery. This will avoid any assumptions on the distribution of effects over subjects. For a full discussion of the pro’s and con’s of hierarchical mixed models, consult our Bibliographic Notes.

To demonstrate the “strength borrowing”, here is a comparison of the lme, versus the effects of fitting a linear model to each subject separately.

Here is a comparison of the random-day effect from lme versus a subject-wise linear model. They are not the same.

After fitting a LMM, we can do some model diagnostics, to verify if the assumptions hold: plot residuals vs. fitted values, and residuals vs. other features: residuals should have no trend, and equal variance. plot residuals vs. fitted values with each random effect group: residuals should be normal.

9.3 Another LMM example

In this example we analyze air-temperature prediction with LMM. Note that LMM may not be the best approach to capture continuous spatial effects (see this paper for instance), yet, it is very convenient, computationally cheep, and with proper modeling, in many cases it can be quite accurate in relation to heavy machine learning artillery.

We load a dataset that contains the minimal temperature (tmin), some remotely sensed measures (as aqua_night_lst), and other spatial/ spatio-temporal data, for several coordinates (locations of monitoring stations) in France, and for 50 days:

db <- readRDS("./data/db_france.Rda")[, 1:21]
##    date  stn_id      lat      lon stn_type         grd_1km_id climate_type
## 1:    1 2031001 49.83389 4.200556        4 49.837500-4.205703            2
## 2:    1 2153001 49.55333 3.498333        4 49.554167-3.500435            3
## 3:    1 2278001 49.86667 4.066667        4 49.862500-4.065677            2
## 4:    1 2298001 49.98778 3.646944        4 49.995833-3.649153            3
## 5:    1 2301002 49.52722 3.461944        4 49.529167-3.460127            3
## 6:    1 2321001 49.83639 3.879444        4 49.829167-3.882015            2
##    tmin t_ordered t_orderedf reg   stn_elev aqua_night_lst   aqua_emis
## 1:  2.5         1          1  22 -0.4056336    -0.34267324 -0.05906734
## 2:  1.0         1          1  22 -0.6055388     0.09128599  0.24306404
## 3:  6.0         1          1  22 -0.3929814    -0.29138715  0.24306404
## 4:  0.0         1          1  22 -0.5523994    -0.20854039 -0.05906734
## 5:  0.6         1          1  22 -0.5979474     0.02816465 -0.05906734
## 6:  0.0         1          1  22 -0.3955119    -0.34267324 -0.05906734
##     aqua_ndvi       elev         pop clc_artificial  clc_water   clc_bare
## 1: 0.49445289 -0.3932409  0.02314416     0.61350869 -0.2176221 -0.1343987
## 2: 0.47081709 -0.6069353 -0.13374497     0.07692565 -0.2176221 -0.1343987
## 3: 1.28726504 -0.3957257 -0.22623930    -0.73660348 -0.2176221 -0.1343987
## 4: 0.05820361 -0.5000881 -0.11149949    -0.15674761 -0.2176221 -0.1343987
## 5: 0.38978008 -0.6094201 -0.13725741     0.33656261 -0.2176221 -0.1343987
## 6: 0.69974667 -0.4926336 -0.23502041     0.02499826 -0.2176221 -0.1343987
##    clc_vegetation
## 1:    -0.54251044
## 2:    -0.01846593
## 3:     0.77605315
## 4:     0.20974700
## 5:    -0.27203585
## 6:     0.03224805

Let’s check the average tmin per location over the time period:

db_mean <- db[, mean(tmin), by = .(lat,lon)]
ggplot(data=db_mean, aes(x=lon,y=lat,color = V1)) + geom_point() +
  scale_color_gradient2(high = "red", low = "blue", mid = "white")

In such data, one may expect several sources of variation. mostly: space, time, and space-time. We can model spatial random effects by regions:

coords <- unique(db[,.(stn_id, lon, lat, reg)])
ggplot(data=coords, aes(x=lon,y=lat,color = as.factor(reg))) + geom_point() 

And temporal random effects by date:

boxplot(tmin~date, data = db)

(note the trend and seasonality in the data)

In these problems the effects might not be stationary over space and time. In this prediction problem aqua_night_lst is the main feature. and it is known that it’s effect change over time and space. See for example its correlation with tmin over time or regions:

sapply(split(db, db$date), function(d) cor(d$tmin,d$aqua_night_lst)) %>% density() %>% plot()

sapply(split(db, db$reg), function(d) cor(d$tmin,d$aqua_night_lst)) %>% density() %>% plot()

We now split for training and test data according to stations (note that we split randomly over stations, not space-based):

stn_intrain <- sample(unique(db$stn_id), size = length(unique(db$stn_id))*0.6)
train <- db[stn_id %in% stn_intrain,]
test <- db[!stn_id %in% stn_intrain,]
points(unique(test[,.(lon,lat)]), col = "red")
legend("topleft", pch=1, col = 1:2, legend = c("Train","Test"))

Let’s start by fitting a LMM with random intercept for time and evaluate it on the test set. Formaly:

\[ tmin_{s,t} = \beta_0 + \sum_{k=1}^p \beta_k x^k_{s,t} + (u_t) + \beta_{LST} LST_{s,t} + \epsilon_{s,t}\] where: \(s\) indicate space, \(t\) indicate time, \(\beta_k\) is the k’th fixed-effect and \(x^k_{s,t}\) is the corresponding fiture, \(u_t\) is temporal random effect, \(\beta_{LST}\) is the effect of \(LST_{s,t}\), and \(\epsilon_{s,t}\) is an independent and normally distributed error term.

tmin_lmm1 <- lmer(tmin ~  
                    aqua_night_lst+ reg+
                    (1|date), data = train)
tmin_lmm1_prd <- predict(tmin_lmm1, newdata = test)
mean((test$tmin - tmin_lmm1_prd)^2)
## [1] 5.135398

Now let’s add a random intercept for region: \[ tmin_{s,t} = \beta_0 + \sum_{k=1}^p \beta_k x^k_{s,t} + (u_t + g_r) + \beta_{LST} LST_{s,t} + \epsilon_{s,t}\] \(g_r\) is region-wise random effect (that shoud capture spatial effects).

tmin_lmm2 <- lmer(tmin ~  
                    (1|date/reg), data = train)
tmin_lmm2_prd <- predict(tmin_lmm2, newdata = test,
mean((test$tmin - tmin_lmm2_prd)^2)
## [1] 3.444913

This is called nested random effects, as the region is nested within the date level: every reg:date level only occur at the higher date level.

Compare it to a fixed-effect model with fixed date:reg interaction (note: this requires much more computational resources):

tmin_lm <- lm(tmin ~
                date*reg, data = train)
tmin_lm_prd <- predict(tmin_lm, newdata = test,
mean((test$tmin - tmin_lm_prd)^2)
## [1] 3.470221

The results are a little less good.

Note: means that new levels (or NA values) in the test set are allowed. if (FALSE is default), then the prediction will use the unconditional (population-level) values for data with previously unobserved levels (or NAs). Why there are new levels at all? because there are date-reg interaction levels in the train that are not in the test.

Note: the (1|date/reg) specify the hirarchy (the order is important) of the random effects and is equivalent to (1|date) + (1|date:reg) (where : denotes an interaction).

Adding also random LST slope for region-time interaction: \[ tmin_{s,t} = \beta_0 + \sum_{k=1}^p \beta_k x^k_{s,t} + (u_t + g_r) + (\beta_{LST} + v_t + h_{r,t}) LST_{s,t} + \epsilon_{s,t}\] where \(v_t\) is random LST slope, and \(h_{r,t}\) is random daily-region slope.

tmin_lmm3 <- lmer(tmin ~  
                    (1+aqua_night_lst|date/reg), data = train)
tmin_lmm3_prd <- predict(tmin_lmm3, newdata = test,
mean((test$tmin - tmin_lmm3_prd)^2)
## [1] 2.951929

Can you think on more hierarchical levels of random effects?

Let’s see the estimated random slope over time and space:

reg_date_randef <- lme4::ranef(tmin_lmm3)$`reg:date`
reg_date_randef <- cbind(str_split_fixed(rownames(reg_date_randef), ":", 2),
colnames(reg_date_randef) <- c("reg", "date", "intc", "slope")
##      reg date       intc       slope
## 11:1  11    1 -0.6925359 -0.19053620
## 11:2  11    2  2.2432866 -0.94072561
## 11:3  11    3  0.2730236 -0.28907070
## 11:5  11    5 -2.0661077  0.11976649
## 11:6  11    6 -1.6479033  0.63986723
## 11:7  11    7 -1.0950623  0.07895034
test <- merge(x=test, y=reg_date_randef, by=c("reg","date"), all.x = T)
ggplot(data = test[date %in% 47:50], 
       aes(x=lon,y=lat, color = slope)) + geom_point() + facet_wrap(~date)

Let’s see the estimated region-wise random intercept:

ggplot(data = reg_date_randef, aes(x=intc)) + geom_density() + facet_wrap(~reg)

It seems that the regions’ random effects are differently distributed in different regions.

9.3.1 lmer formula

What would V1 ~ (1+V3*V4|V2) + V3*V4 estimate?

  • P1: A global intercept
  • P2: A single global estimate for the effect of V3
  • P3: A single global estimate for the effect of V4
  • P4: A single global estimate for the interaction between V3 and V4
  • P5: Deviations of the intercept from P1 in each level of V2
  • P6: Deviations of the V3 effect from P2 in each level of V2
  • P7: Deviations of the V4 effect from P3 in each level of V2
  • P8: Deviations of the V3-by-V4 interaction from P4 in each level of V2
  • P9 Correlation between P5 and P6 across levels of V2
  • P10 Correlation between P5 and P7 across levels of V2
  • P11 Correlation between P5 and P8 across levels of V2
  • P12 Correlation between P6 and P7 across levels of V2
  • P13 Correlation between P6 and P8 across levels of V2
  • P14 Correlation between P7 and P8 across levels of V2

9.3.2 Sparsity and Memory Efficiency

In Sparse Representations Chapter at R(BGU) course we discuss how to efficiently represent matrices in memory. At this point we can already hint that the covariance matrices implied by LMMs are sparse. This fact is exploited in the lme4 package, making it very efficient computationally.

9.4 Serial Correlations in Space/Time

As previously stated, a hierarchical model of the type \(y=x'\beta+z'u+\epsilon\) is a very convenient way to state the correlations of \(y|x\) instead of specifying the matrix \(Var[z'u+\epsilon|x]\) for various \(x\) and \(z\). The hierarchical sampling scheme implies correlations in blocks. What if correlations do not have a block structure? Temporal data or spatial data, for instance, tend to present correlations that decay smoothly in time/space. These correlations cannot be represented via a hierarchical sampling scheme.

One way to go about, is to find a dedicated package for space/time data. For instance, in the Spatio-Temporal Data task view, or the Ecological and Environmental task view.

Instead, we will show how to solve this matter using the nlme package. This is because nlme allows to compound the blocks of covariance of LMMs, with the smoothly decaying covariances of space/time models.

We now use an example from the help of nlme::corAR1. The nlme::Ovary data is panel data of number of ovarian follicles in different mares (female horse), at various times.
We fit a model with a random Mare effect, and correlations that decay geometrically in time. In the time-series literature, this is known as an auto-regression of order 1 model, or AR(1), in short.

## Grouped Data: follicles ~ Time | Mare
##   Mare        Time follicles
## 1    1 -0.13636360        20
## 2    1 -0.09090910        15
## 3    1 -0.04545455        19
## 4    1  0.00000000        16
## 5    1  0.04545455        13
## 6    1  0.09090910        10
fm1Ovar.lme <- nlme::lme(fixed=follicles ~ sin(2*pi*Time) + cos(2*pi*Time), 
                   data = Ovary, 
                   random = pdDiag(~sin(2*pi*Time)), 
                   correlation=corAR1() )
## Linear mixed-effects model fit by REML
##  Data: Ovary 
##        AIC     BIC   logLik
##   1563.448 1589.49 -774.724
## Random effects:
##  Formula: ~sin(2 * pi * Time) | Mare
##  Structure: Diagonal
##         (Intercept) sin(2 * pi * Time) Residual
## StdDev:    2.858385           1.257977 3.507053
## Correlation Structure: AR(1)
##  Formula: ~1 | Mare 
##  Parameter estimate(s):
##       Phi 
## 0.5721866 
## Fixed effects: follicles ~ sin(2 * pi * Time) + cos(2 * pi * Time) 
##                        Value Std.Error  DF   t-value p-value
## (Intercept)        12.188089 0.9436602 295 12.915760  0.0000
## sin(2 * pi * Time) -2.985297 0.6055968 295 -4.929513  0.0000
## cos(2 * pi * Time) -0.877762 0.4777821 295 -1.837159  0.0672
##  Correlation: 
##                    (Intr) s(*p*T
## sin(2 * pi * Time)  0.000       
## cos(2 * pi * Time) -0.123  0.000
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -2.34910093 -0.58969626 -0.04577893  0.52931186  3.37167486 
## Number of Observations: 308
## Number of Groups: 11

Things to note:

  • The fitting is done with the nlme::lme function, and not lme4::lmer.
  • sin(2*pi*Time) + cos(2*pi*Time) is a fixed effect that captures seasonality.
  • The temporal covariance, is specified using the correlations= argument.
  • AR(1) was assumed by calling correlation=corAR1(). See nlme::corClasses for a list of supported correlation structures.
  • From the summary, we see that a Mare random effect has also been added. Where is it specified? It is implied by the random= argument. Read ?lme for further details.

We can now inspect the contrivance implied by our model’s specification. As expected, we see the blocks of non-null covariance within Mare, but unlike “vanilla” LMMs, the covariance within mare is not fixed. Rather, it decays geometrically with time.

9.5 Extensions

9.5.1 Cluster Robust Standard Errors

As previously stated, random effects are nothing more than a convenient way to specify covariances within a level of a random effect, i.e., within a group/cluster. This is also the motivation underlying cluster robust inference, which is immensely popular with econometricians, but less so elsewhere. With cluster robust inference, we assume a model of type \(y=f(x)+\varepsilon\); unlike LMMs we assume independence (conditional on \(x\)), but we allow \(\varepsilon\) within clusters defined by \(x\). For a longer comparison between the two approaches, see Michael Clarck’s guide.

9.5.2 Linear Models for Panel Data

nlme and lme4 will probably provide you with all the functionality you need for panel data. If, however, you are trained as an econometrician, and prefer the econometric parlance, then the plm and panelr packages for panel linear models, are just for you. In particular, they allow for cluster-robust covariance estimates, and Durbin–Wu–Hausman test for random effects. The plm package vignette also has an interesting comparison to the nlme package.

9.5.3 Testing Hypotheses on Correlations

After working so hard to model the correlations in observation, we may want to test if it was all required. Douglas Bates, the author of nlme and lme4 wrote a famous cautionary note, found here, on hypothesis testing in mixed models, in particular hypotheses on variance components. Many practitioners, however, did not adopt Doug’s view. Many of the popular tests, particularly the ones in the econometric literature, can be found in the plm package (see Section 6 in the package vignette). These include tests for poolability, Hausman test, tests for serial correlations, tests for cross-sectional dependence, and unit root tests.

9.6 Bibliographic Notes

Most of the examples in this chapter are from the documentation of the lme4 package (Bates et al. 2015). For a general and very applied treatment, see Pinero and Bates (2000). As usual, a hands on view can be found in Venables and Ripley (2013), and also in an excellent blog post by Kristoffer Magnusson For a more theoretical view see Weiss (2005) or Searle, Casella, and McCulloch (2009). Sometimes it is unclear if an effect is random or fixed; on the difference between the two types of inference see the classics: (???), (???), and the more recent Rosset and Tibshirani (2018). For an interactive, beautiful visualization of the shrinkage introduced by mixed models, see Michael Clark’s blog. For more on predictions in linear mixed models see Robinson (1991), Rabinowicz and Rosset (2018), and references therein. See Michael Clarck’s guide for various ways of dealing with correlations within groups. For the geo-spatial view and terminology of correlated data, see Christakos (2000), Diggle, Tawn, and Moyeed (1998), Allard (2013), and Cressie and Wikle (2015).

9.7 Practice Yourself

  1. Computing the variance of the sample mean given dependent correlations. How does it depend on the covariance between observations? When is the sample most informative on the population mean?

  2. Think: when is a paired t-test not equivalent to an LMM with two measurements per group?

  3. Return to the Penicillin data set. Instead of fitting an LME model, fit an LM model with lm. I.e., treat all random effects as fixed.

    1. Compare the effect estimates.
    2. Compare the standard errors.
    3. Compare the predictions of the two models.
  4. [Very Advanced!] Return to the Penicillin data and use the gls function to fit a generalized linear model, equivalent to the LME model in our text.

  5. Read about the “oats” dataset using ? MASS::oats.Inspect the dependency of the yield (Y) in the Varieties (V) and the Nitrogen treatment (N).

    1. Fit a linear model, does the effect of the treatment significant? The interaction between the Varieties and Nitrogen is significant?
    2. An expert told you that could be a variance between the different blocks (B) which can bias the analysis. fit a LMM for the data.
    3. Do you think the blocks should be taken into account as “random effect” or “fixed effect”?
  6. Return to the temporal correlation in Section 9.4, and replace the AR(1) covariance, with an ARMA covariance. Visualize the data’s covariance matrix, and compare the fitted values.

See DataCamps’ Hierarchical and Mixed Effects Models for more self practice.


Allard, Denis. 2013. “J.-P. Chilès, P. Delfiner: Geostatistics: Modeling Spatial Uncertainty.” Springer.

Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48.

Christakos, George. 2000. Modern Spatiotemporal Geostatistics. Vol. 6. Oxford University Press.

Cressie, Noel, and Christopher K Wikle. 2015. Statistics for Spatio-Temporal Data. John Wiley & Sons.

Diggle, Peter J, JA Tawn, and RA Moyeed. 1998. “Model-Based Geostatistics.” Journal of the Royal Statistical Society: Series C (Applied Statistics) 47 (3): 299–350.

Pinero, Jose, and Douglas Bates. 2000. “Mixed-Effects Models in S and S-Plus (Statistics and Computing).” Springer, New York.

Rabinowicz, Assaf, and Saharon Rosset. 2018. “Assessing Prediction Error at Interpolation and Extrapolation Points.” arXiv Preprint arXiv:1802.00996.

Robinson, George K. 1991. “That Blup Is a Good Thing: The Estimation of Random Effects.” Statistical Science, 15–32.

Rosset, Saharon, and Ryan J Tibshirani. 2018. “From Fixed-X to Random-X Regression: Bias-Variance Decompositions, Covariance Penalties, and Prediction Error Estimation.” Journal of the American Statistical Association, nos. just-accepted.

Searle, Shayle R, George Casella, and Charles E McCulloch. 2009. Variance Components. Vol. 391. John Wiley & Sons.

Venables, William N, and Brian D Ripley. 2013. Modern Applied Statistics with S-Plus. Springer Science & Business Media.

Weiss, Robert E. 2005. Modeling Longitudinal Data. Springer Science & Business Media.