13 Linear Mixed models
13.1 Introduction
We have already dealt with one extension of the General Linear Model, the extension to outcome variables with different kinds of distributions via the Generalized Linear Model (see previous session). In this session, we will deal with another extension, one that you are likely to use very often. This extension is called the Linear Mixed Model.
Linear Mixed Models are used for cases where your observations are clustered. This means that certain groups of your observations share something in common other than the values of the predictor variables and covariates. The most common use of the Linear Mixed Model that you are likely to encounter is for studies where there are repeated measures from the same individual, like a longitudinal study or an experiment with an independent variable manipulated within subjects. This is not the only application however: there are many other possible sources of clustering in datasets, and you can use Linear Mixed Models in these cases too.
In the case of repeated measures from the same individuals, you know that all the observations from participant 1 share something in common (even if they involved different levels of the independent variable). What they share in common is that they all came from the same person, and maybe that person has a characteristic way of doing the task. Maybe they always get rather high scores, or always rather low ones. You want to account for this non-independence of sets of observations in your model. Doing so will have two advantages.
First, it will help you avoid over-weighting observations from the same cluster for between-cluster comparisons. Imagine you have an experiment on music learning where the scores in one group are 3, 5, 5, 7, 8, 8 and the scores in the other group are 4, 4, 5, 47, 46, 49. This looks like you might have a between-group difference (the means of the two groups are certainly very different). But if I now tell you that the 47, the 46 and the 49 all came from the same participant doing the task three times, and that participant was a professional musician, you will not be so impressed. Essentially, you will say, the 47, the 46 and the 49 should count as one observation since they are not independent of one another. The linear mixed model will help us not overweight these kinds of non-independent observations in between-group comparisons, by clustering our observations by participant.
Second, the Linear Mixed Model will help us avoid under-interpreting small effects when they occur within clusters. Imagine we have a dataset where participants A, B, C, D, and E do a memory task two times, once with and once without a cognitive enhancing drug. Their scores are as follows:
Participant | No drug | Drug |
A | 3 | 5 |
B | 9 | 10 |
C | 18 | 19 |
D | 20 | 22 |
E | 28 | 29 |
Here, if we just compare the averages of the groups without considering the clustering by participant, we could easily conclude that the drug does not do much. The means of the two groups only differ by 1.4, which seems a trivial amount when you consider the huge amount of variation in performance there is on this task (the scores range from 3 to 29).
However, when you look at the data set out in the table, it’s clear that most of the variation is between participants: some always get high scores (28 and 29) and some always get low (3 and 5). But comparing within a participant, you can see that every participant does a bit better in the drug than the no drug condition. So the better inferential quantity to use, to assess the causal effect of the drug, is each participant’s performance in the drug condition relative to their own baseline in the non-drug condition. The linear mixed model will automatically do this, exactly by specifying which observations belong to the same cluster.
The Linear Mixed Model works by introducing into our familiar General Linear Model equation additional terms specifying what are called the random effects. This is a misleading term unless you are a mathematician, since the random effect terms are not generated randomly. But anyway, what the random effects terms do is identify which cases belong to the same cluster as which others. The model fitting then estimates a coefficient (or coefficients, in complex cases) for each cluster. In the simplest instances, you use a variable like participant name or participant number to identify which cases come from the same participant, and the model assigns each participant their own intercept. This ensures that any within-subjects differences will be interpreted as departures from that participant’s own baseline, and for between-subjects variables, multiple observations from the same participant will not get over-weighted.
All of this is much easier to demonstrate in practice than it is to explain theoretically, so let’s work through an example.
13.2 The study by Nettle and Saxe (2020) and its data
Our example today comes from some of my own work (Nettle and Saxe 2020). Specially, we are looking at study 1 of this 7-study paper.
In this study, we were interested in people’s intuitions about how much of what people earn should be shared with other members of society, and how much individuals should get to keep for themselves. We studied this by asking 100 participants to imagine hypothetical villages in which different conditions obtained. In these villages people grew crops in their gardens. The dependent variable in each case was what percentage of their crop the villagers should be obliged to share out with other villagers, the rest being kept for themselves. The independent variables were: the extent to which variation in garden productivity was due to luck rather than hard work (three levels: low importance of luck; medium importance; high importance); and the extent to which the villagers were culturally homogeneous (two levels: homogeneous and heterogeneous). This latter independent variable is to capture the hypothesis that people think one should share with one’s own group, more than outgroup members. Thus, there are 2 x 3 = 6 conditions overall (low luck homogeneous, low luck heterogeneous, medium luck homogeneous, etc.).
Manipulation of luck and heterogeneity was within subjects, so every participant completed the dependent measure six times in total. The predictions were that the greater the importance of luck, the more participants would think villagers should share; and, participants would think villagers should share more in the homogeneous than the heterogeneous villagers.
13.2.1 Getting the data
The data for the study are available in the repository at: https://osf.io/xrqae/. It is the file ‘study1.data.csv’ that we are interested in. Download it from the repository and save it into your working directory.
Now let’s read the data in.
You should have a data frame d
with 600 observations.
13.2.2 Wide versus long format
You might be forgiven for imagining that you were going to get 100 rows of data, because there are 100 participants. In other words, you might have expected the first few rows of the data frame to look something like this:
## Participant Low.Hom Low.Het Med.Hom Med.Het High.Hom
## 1 A 23 18 40 35 56
## 2 B 42 40 41 34 67
## 3 C 31 33 54 22 45
In fact, it looks something like this:
## participant level luck heterogeneity
## 1 R_0ijWqatE9iFA0Rr 80 High Heterogeneous
## 2 R_0ijWqatE9iFA0Rr 42 Low Heterogeneous
## 3 R_0ijWqatE9iFA0Rr 60 Medium Heterogeneous
## 4 R_0ijWqatE9iFA0Rr 70 High Homogeneous
## 5 R_0ijWqatE9iFA0Rr 20 Low Homogeneous
## 6 R_0ijWqatE9iFA0Rr 50 Medium Homogeneous
## 7 R_0TAfv2g6dKUBf9v 80 High Heterogeneous
## 8 R_0TAfv2g6dKUBf9v 31 Low Heterogeneous
## 9 R_0TAfv2g6dKUBf9v 50 Medium Heterogeneous
## 10 R_0TAfv2g6dKUBf9v 50 High Homogeneous
## 11 R_0TAfv2g6dKUBf9v 30 Low Homogeneous
## 12 R_0TAfv2g6dKUBf9v 50 Medium Homogeneous
The first format (1 row per participant, 6 columns corresponding to the dependent measures in the 6 conditions) is called wide format. The format the data are actually in (1 row per trial, so 600 rows overall, one column giving all the dependent measure responses, three other columns identifying respectively: which participant it was; whether the luck independent variable was low, medium or high; and whether the village was homogeneous or heteroegeneous) is called long format.
In R, and in statistics and computation more generally, you will almost always want your data to be in long format. This is because in wide format, observations of the same variable (the dependent variable, the rating of the percentage that should shared, called level
in the data set) are spread over six different columns. You always want observations of the same variable to be in one column, never more than one. This is known as the principle of tidy data. You then use additional columns to identify other attributes of that observation such as the participant and which levels of the independent or predictor variables it was generated under.
The dataset is thus already in the right long format for us. If you want to convert wide format data to long format, there is a tidyverse function called pivot_longer()
that does this neatly. It has an inverse function, pivot_wider()
for converting from long to wide. We won’t go into the syntax here for using these functions but you can easily find examples on the web.
13.3 Fitting a Linear Mixed Model
To recap, we now have a data frame d
with 600 observations of the dependent variable level
. The independent variables are luck
and heterogeneity
. A further variable participant
further identifies which of the 100 participants each response came from.
We fit Linear Mixed Models with a contributed package called lmerTest
. In fact, this is a wrapper for another package lme4
that actually fits the model; lmerTest
just presents it in a handy way. Anyway, if we install and load in lmerTest
it imports lme4
anyway.
## Error in install.packages : Updating loaded packages
Let’s examine what seems to be going on in the data. Recall, the predictions were that level
would be: higher when luck was more important; and lower when the village was heterogeneous than homogeneous. Let’s look at the means and standard deviations of level
by condition:
## `summarise()` has grouped output by 'luck'. You can
## override using the `.groups` argument.
## # A tibble: 6 × 4
## # Groups: luck [3]
## luck heterogeneity M SD
## <chr> <chr> <dbl> <dbl>
## 1 High Heterogeneous 58.8 27.5
## 2 High Homogeneous 63.4 25.5
## 3 Low Heterogeneous 42.4 29.6
## 4 Low Homogeneous 46.7 29.7
## 5 Medium Heterogeneous 48.1 20.9
## 6 Medium Homogeneous 50.2 18.6
It’s clear than the means are much higher when luck
is High
and lowest when luck
is Low
, so that seems to have worked. But, for each level of luck
, the mean level
is very slightly higher for Homogeneous
than Heterogeneous
: 63 to 59; 50 to 48; 47 to 42. So possibly there might be some support for the prediction concerning heterogeneity
too. The thing is that the standard deviations within each condition are really big (between 18 and 30), which kind of dwarfs the small differences in the means by heterogeneity
.
We can start by fitting our habitual General Linear Model:
##
## Call:
## lm(formula = level ~ luck + heterogeneity, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.935 -20.029 -0.985 19.445 57.290
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 59.265 2.092 28.328
## luckLow -16.555 2.562 -6.461
## luckMedium -11.950 2.562 -4.664
## heterogeneityHomogeneous 3.670 2.092 1.754
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## luckLow 2.16e-10 ***
## luckMedium 3.84e-06 ***
## heterogeneityHomogeneous 0.0799 .
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.62 on 596 degrees of freedom
## Multiple R-squared: 0.0739, Adjusted R-squared: 0.06924
## F-statistic: 15.85 on 3 and 596 DF, p-value: 6.269e-10
You can see that there are massively significant differences between the Low
and Medium
levels of luck
and the reference category High
. There is also a higher average level
when the village is Homogeneous
, as predicted, but the coefficient is small (3.67) compared to is standard error (2.09), and hence the p-value indicates we cannot conclude it is significantly different from zero.
However, what we really want to do is to ask: do participants increase their level
rating from their own baseline when the village is Homogenous
? To do this, we have to inform the model of the clustered structure of the responses by participant, so that it can fit a random intercept to account for this. Here is where we use the lmer()
function from the lmerTest
package rather than our habitual lm()
function.
## Linear mixed model fit by REML. t-tests use
## Satterthwaite's method [lmerModLmerTest]
## Formula: level ~ luck + heterogeneity + (1 | participant)
## Data: d
##
## REML criterion at convergence: 5513.9
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.42719 -0.62971 0.00556 0.63994 2.58273
##
## Random effects:
## Groups Name Variance Std.Dev.
## participant (Intercept) 167.9 12.96
## Residual 489.2 22.12
## Number of obs: 600, groups: participant, 100
##
## Fixed effects:
## Estimate Std. Error df
## (Intercept) 59.265 2.223 325.954
## luckLow -16.555 2.212 497.000
## luckMedium -11.950 2.212 497.000
## heterogeneityHomogeneous 3.670 1.806 497.000
## t value Pr(>|t|)
## (Intercept) 26.663 < 2e-16 ***
## luckLow -7.485 3.29e-13 ***
## luckMedium -5.403 1.02e-07 ***
## heterogeneityHomogeneous 2.032 0.0427 *
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) luckLw lckMdm
## luckLow -0.498
## luckMedium -0.498 0.500
## htrgntyHmgn -0.406 0.000 0.000
The differences in the syntax are simple:
We now call
lmer()
instead oflm()
.We have added a term
+(1|participant)
to the formulae. This is the random effect term, and it says in effect ’add a separate intercept (represented by 1) for each participant, as identified in the variable calledparticipant
.
As for the output, it looks pretty familiar too. The first thing you will notice is that the coefficients are all the same. Yes, the parameter estimates come from the data in the same way as before: 3.67 for heterogeneity
being Homogeneous
, for example. However, their standard errors are different. This is because the standard errors now take account of the clustering in the data, whereas the standard errors in g1
did not.
In this case, the standard errors are smaller when we fit the Linear Mixed Model, because we are measuring the within-subjects deviations from the participant’s own baseline as the independent variables vary. For what it is worth, this means that the parameter estimate for heterogeneity
has gone from being just non-significant to being just significant. The standard errors for the parameter estimates for the luck
variable have also got smaller, though those were already so significant anyway that this does not change our interpretation.
There are some other additional features in the output of g2
. The output of g1
informed us that the standard error of the residuals was 25.62. Squaring this up, the variance of the residuals would therefore be \(25.62^2 = 656.38\). So this means that the residuals, the bits of discrepancy left over after we fit all the predictors, have a variance of about 656 (their mean is zero by construction).
In g2
, this residual variance is broken down into two parts. There is a variance of 167.9 between individuals; this is an indicator of how much the means of different participants are different from one another. And there is a variance of 489.2 that is described as Residual
. This is how much residual variance is left over after the independent variables have been fitted and the differences in average response between individuals has been accounted for. In other words, it captures the inexplicable variation within individuals that we cannot account for with anything we know about. In this case, since \(168/656 \approx 0.26\), we can say that about 26% of the variation in the dataset not explained by the independent variables was due to individuals having different baseline values of level
, with the remaining 74% representing the unexplained noise within participants.
13.3.1 Reporting the model results
To report the results of our Linear Mixed Model g2
, I would say something like the following:
“We fitted a Linear Mixed Model with level as the outcome variable, and luck and heterogeneity as the predictors, plus a random intercept for participant to account for the repeated measures. Low luck and medium luck elicited significantly reduced level compared to high luck (\(B_{low}\) = -16.56 (se 2.21), t = -7.49, p < 0.001; \(B_{medium}\) = -11.95 (se 2.21), t = -5.40, p < 0.001). Homogeneity elicited marginally significantly higher level than heterogeneity (\(B\) = 3.67 (se 1.81), t = 2.03, p < 0.043). The residual variances of 167.9 and 489.2 indicated that 26% of the remaining variation was between participants, with the remainder being within participants.”
13.4 More complex random effects structures
We have dealt with a case where there was one source of clustering in the data: by participant. But often there will be more sources. For example, imagine a study where participants respond to visual images. Each participant responds to several images, so there is clustering by participant. The images are either faces, or (let’s say) flowers. But there are ten different face images, and ten different flower images. The experimenters will have included this replication in face images and in flower images to avoid over-generalizing on the basis of the response to one particular face image.
So here, responses are clustered both in terms of which participant they belonged to, and which individual image they were a response to. This would be easily captured in a mixed model. The syntax would be something like:
m1 <- lmer(y ~ x + (1|participant) + (1|image), data=data)
. The estimation would automatically sort out the variance due to participant (people being characteristically high or low) and the variance due to image (certain images being more arresting than others), and estimate the effect of the predictor variables on top of this clustering.
In all the cases we have thought about so far, we have just wanted to add an intercept for each cluster (participant or image), so allow for the different baselines of different clusters. But sometimes you also want to estimate a cluster-specific response to some predictor variable. This case, which I will not deal with here, is called a random slopes model. Random slopes models often (though not only) occur when the predictor is time. For example, you might measure the height of children repeatedly as they grow, and want to allow for the fact that each individual child has a characteristic growth rate. Here you would fit each child with not just a random intercept (the children were different heights at the beginning), but also a random slope against time (the children were growing at characteristically different rates).
The syntax for fitting a random slopes model using lmer()
is very similar to our cases so far, except that instead of + (1|participant)
it will be something like + (time|participant)
.
Sometimes there are random effects that you are not sure whether to put in or not. For example, let’s say that as well as some responses being from the same participants, some responses were collected on a particular day, or by a particular experimenter. Should day or experimenter be included as random effects, or not?
This is not always straightforward, but there are some rules of thumb. The first is that if there are very few levels of the potential clustering variable, don’t include a random effect for it. Say there are two experimenters. You might want to include experimenter as a covariate in the non-random part of the model equation, but don’t try to include a random effect. Random effects are better suited for the case where there are many levels of the clustering variable (as is true of participant in the data we looked at, for example). The second rule of thumb is that if your potential random effect variable does have many levels, it’s probably fine (and prudent) to include it in cases where you are not sure. If there is in fact no clustering by that variable in the data, the model including this random effect will produce almost identical results to the one excluding it, and the residual variance associated with that random effect will be close to or exactly zero.
By the way, in R and lmerTest
, if you get a red warning about a boundary (singular) fit, this is probably because your model contains a random effect that accounts for almost zero variation. If you remove that random effect (as you would be justified in doing), the warning will probably go away.
13.5 Generalized and Mixed?
So far in this course we have dealt with with two extensions to the General Linear Model: the Generalized Linear Model for non-continuous outcome variables, and the Linear Mixed Model when there is clustering in the data. You might be asking the question: what if I have both a non-continuous outcome variable, and clustering.
That’s no problem: you need to use both extensions at the same time, making a Generalized Linear Mixed Model. You fit a Generalized Linear Mixed Model with lmerTest
exactly as we have done so far this session. Instead of lmer()
, the function you need is glmer()
, and the additional thing that you need to put in the function call is the specification of a family of models (e.g. binomial, Poisson), exactly as we did in the unit on Generalized Linear Models. So the call will look something like:
s1 <-glmer(y ~ x + (1|participant), family=binomial, data=data)
.
13.6 Summary
In this session, we have: * Discussed the issue of clustering in datasets and how it can lead us to overweight some differences and underweight others;
Introduced the Linear Mixed Model, an extension of the General Linear Model that handles clustering in datasets;
Worked through a practical example where we used a Linear Mixed Model to allow for repeated measures from the same individuals, and so how this changed the results.
Discussed more complex applications such as more than one source of clustering, random slopes, and Generalized Linear Mixed Models.