9.7 Logistic regression after MI
The syntax for logistic regression after MI is similar to linear regression after MI, except that now glm()
is used instead of lm()
.
Example 9.2: Using the NSDUH 2019 teaching dataset (see Appendix A.5), what is the association between lifetime marijuana use (mj_lifetime
) and age at first use of alcohol (alc_agefirst
), adjusted for age (demog_age_cat6
), sex (demog_sex
), and income (demog_income
)? Use MI to handle missing data.
NOTE: This example uses the MAR dataset created for the illustration in Section 9.2 (nsduh_mar_rmph.RData
). Those who never used alcohol were removed from the dataset, since we do not want to impute missing ages for them. Also, a random subset of the values for each variable were set to missing, with the probability of missingness for each variable depending on all the other variables but not the variable itself.
First, load the data and view the extent of the missing data.
## mj_lifetime alc_agefirst demog_age_cat6 demog_sex demog_income
## No :343 Min. : 3.0 18-25: 98 Male :380 Less than $20,000:117
## Yes :491 1st Qu.:15.0 26-34:120 Female:392 $20,000 - $49,999:230
## NA's: 9 Median :17.0 35-49:216 NA's : 71 $50,000 - $74,999:118
## Mean :17.6 50-64:195 $75,000 or more :345
## 3rd Qu.:20.0 65+ :173 NA's : 33
## Max. :45.0 NA's : 41
## NA's :49
Next, fit the imputation model.
## Class: mids
## Number of multiple imputations: 20
## Imputation methods:
## mj_lifetime alc_agefirst demog_age_cat6 demog_sex demog_income
## "logreg" "pmm" "polyreg" "logreg" "polyreg"
## PredictorMatrix:
## mj_lifetime alc_agefirst demog_age_cat6 demog_sex demog_income
## mj_lifetime 0 1 1 1 1
## alc_agefirst 1 0 1 1 1
## demog_age_cat6 1 1 0 1 1
## demog_sex 1 1 1 0 1
## demog_income 1 1 1 1 0
Finally, fit the logistic regression on the imputed datasets and pool the results.
NOTE: Only include in the regression model variables that were included in the dataset prior to multiple imputation (see Section 9.4.2).
fit.imp.glm <- with(imp.nsduh,
glm(mj_lifetime ~ alc_agefirst + demog_age_cat6 +
demog_sex + demog_income, family = binomial))
# summary(pool(fit.imp.glm), conf.int = T)
round.summary(fit.imp.glm, digits=3)
## estimate std.error statistic df p.value 2.5 % 97.5 %
## (Intercept) 6.022 0.588 10.239 683.1 0.000 4.867 7.177
## alc_agefirst -0.265 0.028 -9.361 594.4 0.000 -0.320 -0.209
## demog_age_cat626-34 -0.252 0.334 -0.754 716.0 0.451 -0.908 0.404
## demog_age_cat635-49 -0.824 0.298 -2.769 800.8 0.006 -1.408 -0.240
## demog_age_cat650-64 -0.743 0.301 -2.473 773.5 0.014 -1.334 -0.153
## demog_age_cat665+ -1.391 0.307 -4.532 788.9 0.000 -1.993 -0.789
## demog_sexFemale 0.012 0.172 0.067 478.7 0.947 -0.327 0.350
## demog_income$20,000 - $49,999 -0.457 0.271 -1.689 746.0 0.092 -0.989 0.074
## demog_income$50,000 - $74,999 -0.032 0.310 -0.103 760.3 0.918 -0.642 0.577
## demog_income$75,000 or more -0.325 0.259 -1.254 727.3 0.210 -0.834 0.184
Add exponentiate = T
to summary()
or round.summary()
to compute odds ratios, but be careful interpreting the results, as only the estimate and confidence interval are exponentiated; the standard error in the table is still the standard error of the un-exponentiated regression coefficient. As such, to avoid confusion, extract just the estimate, confidence interval, and p-value when exponentiating (and drop the first row since the exponentiated intercept is not of interest).
# summary(pool(fit.imp.glm), conf.int = T,
# exponentiate = T)[-1, c("term", "estimate", "2.5 %", "97.5 %", "p.value")]
round.summary(fit.imp.glm, digits = 3,
exponentiate = T)[-1, c("estimate", "2.5 %", "97.5 %", "p.value")]
## estimate 2.5 % 97.5 % p.value
## alc_agefirst 0.767 0.726 0.811 0.000
## demog_age_cat626-34 0.777 0.403 1.498 0.451
## demog_age_cat635-49 0.439 0.245 0.787 0.006
## demog_age_cat650-64 0.475 0.263 0.858 0.014
## demog_age_cat665+ 0.249 0.136 0.455 0.000
## demog_sexFemale 1.012 0.721 1.419 0.947
## demog_income$20,000 - $49,999 0.633 0.372 1.077 0.092
## demog_income$50,000 - $74,999 0.968 0.526 1.782 0.918
## demog_income$75,000 or more 0.722 0.434 1.202 0.210
9.7.1 Multiple degree of freedom tests
mi.anova()
does not work for logistic regression models, but D1()
and D3()
do. Fit reduced models, each omitting one categorical variable that has more than two levels and then use D1()
to compare the full and reduced models using a Wald test or D3()
for a likelihood ratio test.
For example, to get a Type 3 Wald test and LR test for demog_age_cat6
in Example 9.2:
9.7.2 Predictions
Prediction proceeds in almost the same way as described in Section 9.6.2 – compute estimates within each imputed dataset and then pool the results using Rubin’s rules. The one change is that type = "response"
is added to get estimates that are probabilities.
Example 9.2 (continued): What is the estimated probability (and standard error) of lifetime marijuana use for someone who first used alcohol at age 13 years, is currently age 30 years, male, and has an annual income of <$20,000?
# Prediction data.frame
NEWDAT <- data.frame(alc_agefirst = 13,
demog_age_cat6 = "26-34",
demog_sex = "Male",
demog_income = "Less than $20,000")
# Estimate for each analysis
PREDLIST <- lapply(fit.imp.glm$analyses, predict, newdata=NEWDAT,
se.fit=T, type = "response")
# Extract mean and variance from each analysis
MEAN <- VAR <- rep(NA, length(fit.imp.glm$analyses))
for(i in 1:length(MEAN)) {
MEAN[i] <- PREDLIST[[i]]$fit
VAR[ i] <- (PREDLIST[[i]]$se.fit)^2
}
# Apply Rubin's rules
PRED.POOLED <- pool.scalar(MEAN,
VAR,
nrow(imp.nsduh$data))
# Extract the pooled mean
PRED.POOLED$qbar
## [1] 0.911
## [1] 0.02772