Chapter 8 Now for Advanced: logistic mixed effects
Here we will consider two examples, one will be quite general - we will study the predictors of lung cancer remission (adapted from UCLA Statistical Consulting Group) and the other one from linguistics - based on a dataset available in the package languageR
which consists of data sets and functions accompanying the book ’Analyzing Linguistic Data: A practical introduction to statistics’, Cambridge University Press, 2007. The latter one will be left for you to try to replicate some of the main findings in the paper. ( see section DIY)
The main stages will be illustrated on a simplistic dataset but you are welcome to explore either Dative or your own data to try what you have learnt later.
#We wil need quite a few packages (some for plots)
library(GGally)
library(reshape2)
library(compiler)
library(parallel)
library(boot)
library(ggplot2)
#Lets load the data
lung_cancer <- read.csv("lung_cancer.csv")
Explore the data a little before getting on with the analysis.You will see that we have quite a lot different variables. Remember that for good model, there is no need to include everything but key predictors should always be in. We can focus on Length of Stay, Experience and Cancer Stage.
#We can use str()
str(lung_cancer)
## 'data.frame': 8525 obs. of 28 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ tumorsize : num 68 64.7 51.6 86.4 53.4 ...
## $ co2 : num 1.53 1.68 1.53 1.45 1.57 ...
## $ pain : int 4 2 6 3 3 4 3 3 4 5 ...
## $ wound : int 4 3 3 3 4 5 4 3 4 4 ...
## $ mobility : int 2 2 2 2 2 2 2 3 3 3 ...
## $ ntumors : int 0 0 0 0 0 0 0 0 2 0 ...
## $ nmorphine : int 0 0 0 0 0 0 0 0 0 0 ...
## $ remission : int 0 0 0 0 0 0 0 0 0 0 ...
## $ lungcapacity: num 0.801 0.326 0.565 0.848 0.886 ...
## $ Age : num 65 53.9 53.3 41.4 46.8 ...
## $ Married : Factor w/ 2 levels "no","yes": 1 1 2 1 1 2 2 1 2 1 ...
## $ FamilyHx : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 2 1 ...
## $ SmokingHx : Factor w/ 3 levels "current","former",..: 2 2 3 2 3 3 1 2 2 3 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 2 2 2 1 2 2 2 ...
## $ CancerStage : Factor w/ 4 levels "I","II","III",..: 2 2 2 1 2 1 2 2 2 2 ...
## $ LengthofStay: int 6 6 5 5 6 5 4 5 6 7 ...
## $ WBC : num 6088 6700 6043 7163 6443 ...
## $ RBC : num 4.87 4.68 5.01 5.27 4.98 ...
## $ BMI : num 24.1 29.4 29.5 21.6 29.8 ...
## $ IL6 : num 3.7 2.63 13.9 3.01 3.89 ...
## $ CRP : num 8.086 0.803 4.034 2.126 1.349 ...
## $ DID : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Experience : int 25 25 25 25 25 25 25 25 25 25 ...
## $ School : Factor w/ 2 levels "average","top": 1 1 1 1 1 1 1 1 1 1 ...
## $ Lawsuits : int 3 3 3 3 3 3 3 3 3 3 ...
## $ HID : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Medicaid : num 0.606 0.606 0.606 0.606 0.606 ...
For main predictors we picked four variables to double check their distribution and to make sure that there are no multicolinearity in predictors. Check that you can do quick checks of how continuous values are grouped by categorical ones - ggpairs()
is quite handy for that.
library(ggplot2)
#We can study our variables in one go and check: distributions and correlations of the main predictors
ggpairs(lung_cancer[, c("CancerStage", "LengthofStay", "Experience")])
#We can also check how the continuous predictors relate to our outcome variable 'remission'
tmp <- melt(lung_cancer[, c("remission", "LengthofStay", "Experience")],
id.vars="remission")
#Plot
ggplot(tmp, aes(factor(remission), y = value, fill=factor(remission))) +
geom_boxplot() +
facet_wrap(~variable, scales="free_y")
# Estimate the model and store results in model_lc
model_lc <- glmer(remission ~ Age + LengthofStay + FamilyHx + CancerStage + CancerStage + Experience + (1 | DID) + (1 | HID),
data = lung_cancer, family = binomial, optimizer='Nelder_Mead') #note that I have updated the optimizer (to avoid warnings)
# Print the mod results:what do you find?
summary(model_lc)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula:
## remission ~ Age + LengthofStay + FamilyHx + CancerStage + CancerStage +
## Experience + (1 | DID) + (1 | HID)
## Data: lung_cancer
##
## AIC BIC logLik deviance df.resid
## 7225.6 7296.1 -3602.8 7205.6 8515
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.9563 -0.4229 -0.1890 0.3629 8.2299
##
## Random effects:
## Groups Name Variance Std.Dev.
## DID (Intercept) 3.8471 1.9614
## HID (Intercept) 0.2419 0.4919
## Number of obs: 8525, groups: DID, 407; HID, 35
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.960353 0.568451 -3.449 0.000564 ***
## Age -0.015968 0.005904 -2.705 0.006836 **
## LengthofStay -0.043030 0.035505 -1.212 0.225524
## FamilyHxyes -1.295706 0.093369 -13.877 < 2e-16 ***
## CancerStageII -0.321924 0.076376 -4.215 2.50e-05 ***
## CancerStageIII -0.857475 0.100025 -8.573 < 2e-16 ***
## CancerStageIV -2.137657 0.162281 -13.173 < 2e-16 ***
## Experience 0.125360 0.026881 4.664 3.11e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) Age LngthS FmlyHx CncSII CnSIII CncSIV
## Age -0.390
## LengthofSty -0.141 -0.319
## FamilyHxyes -0.013 0.103 -0.112
## CancerStgII 0.084 -0.181 -0.186 -0.052
## CancrStgIII 0.139 -0.222 -0.239 -0.051 0.513
## CancerStgIV 0.148 -0.220 -0.194 -0.012 0.357 0.347
## Experience -0.835 -0.009 -0.004 -0.022 -0.002 -0.005 -0.012
## convergence code: 0
## Model failed to converge with max|grad| = 0.858202 (tol = 0.001, component 1)
We can do some analysis that is based on model object. In fact, you can extract even more if you need.
se <- sqrt(diag(vcov(model_lc))) #standard errors
# table of estimates with 95% CI using errors we obtained above
(CI_estimates <- cbind(Est = fixef(model_lc), LL = fixef(model_lc) - 1.96 * se, UL = fixef(model_lc) + 1.96 *
se))
## Est LL UL
## (Intercept) -1.96035276 -3.07451662 -0.846188910
## Age -0.01596794 -0.02753911 -0.004396764
## LengthofStay -0.04303048 -0.11261951 0.026558540
## FamilyHxyes -1.29570603 -1.47870949 -1.112702564
## CancerStageII -0.32192409 -0.47162068 -0.172227491
## CancerStageIII -0.85747504 -1.05352351 -0.661426574
## CancerStageIV -2.13765689 -2.45572684 -1.819586939
## Experience 0.12536019 0.07267364 0.178046738
We can look at the effect in terms of odds and associated CIs.
#To get to the odds, we can take exponential of beta estimates
exp(CI_estimates)
## Est LL UL
## (Intercept) 0.1408087 0.04621196 0.4290470
## Age 0.9841589 0.97283663 0.9956129
## LengthofStay 0.9578822 0.89349056 1.0269144
## FamilyHxyes 0.2737046 0.22793165 0.3286695
## CancerStageII 0.7247532 0.62399016 0.8417877
## CancerStageIII 0.4242319 0.34870691 0.5161145
## CancerStageIV 0.1179308 0.08580081 0.1620927
## Experience 1.1335567 1.07537951 1.1948812
To see the overall variation around the intercepts estimate we can use dotplot() from lattice:
#Try with dotplot from lattice
dotplot(ranef(model_lc, which = "DID", condVar = TRUE), #I specify the ID and whether I want conditional var.
scales = list(y = list(alternating = 0))) # by adjusting this I am less bother about exact ID
## $DID
#Shorter for ranef since we have less hospitals then individuals
dotplot(ranef(model_lc, which = "HID", condVar = TRUE))
## $HID
Lets also vary the slopes for each patient:
model_lc_slope<-glmer(remission ~ Age + LengthofStay + FamilyHx + CancerStage +
Experience + (1 + LengthofStay | DID) + (1 | HID), data = lung_cancer, family = binomial,
nAGQ = 1)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl =
## control$checkConv, : Model failed to converge with max|grad| = 0.077799
## (tol = 0.001, component 1)
summary(model_lc_slope)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula:
## remission ~ Age + LengthofStay + FamilyHx + CancerStage + Experience +
## (1 + LengthofStay | DID) + (1 | HID)
## Data: lung_cancer
##
## AIC BIC logLik deviance df.resid
## 7173.7 7258.4 -3574.9 7149.7 8513
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.5499 -0.4239 -0.1790 0.3642 8.1358
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## DID (Intercept) 0.2873 0.5360
## LengthofStay 0.1498 0.3870 -0.27
## HID (Intercept) 0.5164 0.7186
## Number of obs: 8525, groups: DID, 407; HID, 35
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.881739 0.540500 -1.631 0.102819
## Age -0.015834 0.005929 -2.671 0.007573 **
## LengthofStay -0.193961 0.045356 -4.276 1.90e-05 ***
## FamilyHxyes -1.343919 0.095171 -14.121 < 2e-16 ***
## CancerStageII -0.294438 0.076198 -3.864 0.000111 ***
## CancerStageIII -0.862329 0.101282 -8.514 < 2e-16 ***
## CancerStageIV -2.278712 0.169756 -13.423 < 2e-16 ***
## Experience 0.106943 0.023653 4.521 6.14e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) Age LngthS FmlyHx CncSII CnSIII CncSIV
## Age -0.418
## LengthofSty -0.247 -0.252
## FamilyHxyes -0.024 0.104 -0.072
## CancerStgII 0.105 -0.181 -0.163 -0.054
## CancrStgIII 0.146 -0.221 -0.181 -0.051 0.496
## CancerStgIV 0.119 -0.212 -0.115 0.002 0.335 0.336
## Experience -0.803 -0.002 0.068 -0.020 -0.013 -0.007 0.011
## convergence code: 0
## Model failed to converge with max|grad| = 0.077799 (tol = 0.001, component 1)
We can then plot both intercepts and slopes variability:
dotplot(ranef(model_lc_slope, which = "DID", condVar = TRUE),
scales = list(y = list(alternating = 0)))
## $DID
We can also do something which I think is magical, using package sjPlot
- we can without effort get plots of random effects and CI for predicted probabilities.
library(effects)
library(sjPlot)
#We can plot either random effects
plot_model(model_lc, type='re')
## [[1]]
##
## [[2]]
#Or we can check predicted probabilities
plot_model(model_lc, type='eff')
## $Age
##
## $LengthofStay
##
## $FamilyHx
##
## $CancerStage
##
## $Experience
Lastly, what we may want to do is to evaluate our predictions in terms of overall accuracy:
#Use the models to predict the outcome
predictions_lc <- as.numeric(predict(model_lc, type="response")>0.5)
predictions_lc_slopes<- as.numeric(predict(model_lc_slope, type="response")>0.5)
Now lets investigate the ratio of correctly predicted cases over observed via confusion matrix, you can either use table() or confusionMatrix() from the package ‘caret’.
#Simple table. Note: we ll have to claculate by hand the ratio of correctly predicted cases over the total:
table(predictions_lc, lung_cancer$remission)
##
## predictions_lc 0 1
## 0 5517 929
## 1 487 1592
#Sum the corrected cases and divide by total n=8525
ratio_correct<- (5529+1617)/8525
ratio_correct
## [1] 0.8382405
I personally would use caret
as results will look tidier and more complete.
#Load caret
library(caret)
#Confusion matrix between predicted and observed
confusionMatrix(as.factor(predictions_lc), as.factor(lung_cancer$remission), dnn = c("Prediction", "Reference"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5517 929
## 1 487 1592
##
## Accuracy : 0.8339
## 95% CI : (0.8258, 0.8417)
## No Information Rate : 0.7043
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5799
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9189
## Specificity : 0.6315
## Pos Pred Value : 0.8559
## Neg Pred Value : 0.7658
## Prevalence : 0.7043
## Detection Rate : 0.6472
## Detection Prevalence : 0.7561
## Balanced Accuracy : 0.7752
##
## 'Positive' Class : 0
##
Well, clearly confusion matrix gives us a much nicer output :)
Let us see now whether our final model with varying slopes and varying intercepts is better in predicting remission cases.
#Create confusion matrix with varying slopes model
confusionMatrix(as.factor(predictions_lc_slopes), as.factor(lung_cancer$remission), #specify variables
dnn = c("Prediction", "Reference")) #give labels to the columns
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5501 889
## 1 503 1632
##
## Accuracy : 0.8367
## 95% CI : (0.8287, 0.8445)
## No Information Rate : 0.7043
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5898
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9162
## Specificity : 0.6474
## Pos Pred Value : 0.8609
## Neg Pred Value : 0.7644
## Prevalence : 0.7043
## Detection Rate : 0.6453
## Detection Prevalence : 0.7496
## Balanced Accuracy : 0.7818
##
## 'Positive' Class : 0
##
The improvement is rather tiny if judged by accuracy only. To be sure, lets also run a likelihood ratio test:
#Likelihood ratio test
lrtest(model_lc, model_lc_slope)
## Likelihood ratio test
##
## Model 1: remission ~ Age + LengthofStay + FamilyHx + CancerStage + CancerStage +
## Experience + (1 | DID) + (1 | HID)
## Model 2: remission ~ Age + LengthofStay + FamilyHx + CancerStage + Experience +
## (1 + LengthofStay | DID) + (1 | HID)
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 10 -3602.8
## 2 12 -3574.9 2 55.826 7.542e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The test suggests that the second model may offer an improvement in the fit regardless of marginally small change in the predictive power. After we have performed so many checks we may perhaps just go with the last model as, theoretically, it would also make sense.
Quick Note: The example was partially adapted from UCLA’s worked example of GGLM. You can check more here, where you can also find a bit more on visualisations and few other tricks.