6.23 Exercises

True or false? Binary logistic regression is used for outcomes with three or more possible levels.
True or false? Binary logistic regression can be used to estimate an odds ratio adjusted for confounding due to other variables.
Write a sentence interpreting an odds of 2:1 for an event.
Write a sentence interpreting an odds ratio of 1.90.
Write a sentence interpreting an odds ratio of 2.10.
Write a sentence interpreting an odds ratio of 0.42.
What is the interpretation of the intercept ( $\beta_0$ ) in a binary logistic regression model?
How can you convert the intercept in a binary logistic regression model to a probability?
What is the interpretation of the regression coefficient ( $\beta$ ) for a continuous predictor $X$ in a binary logistic regression model?
What is the interpretation of the regression coefficients for a categorical predictor in a binary logistic regression model?
How can you convert a regression coefficient in a binary logistic regression model to an odds ratio?
What assumption(s) of a linear regression model are not assumption(s) of a binary logistic regression model? What assumption(s) do they have in common?
What statistical method can be used to estimate a risk ratio (or prevalence ratio)?

For Exercises 14 to 25, use the NHANES 2017-2018 examination subsample teaching dataset (nhanes1718_adult_exam_sub_rmph.Rdata, see Appendix A.1). Create a binary version of PHQ-9 representing mild depression (PHQ-9 $\ge$ 5) using the following code prior to answering the questions.

load("Data/nhanes1718_adult_exam_sub_rmph.Rdata")

# Create dichotomized PHQ-9

# "PHQ-9 scores of 5, 10, 15, and 20 represented mild, moderate,
# moderately severe, and severe depression, respectively"
# (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1495268/)

nhanes <- nhanes_adult_exam_sub %>% 
  mutate(depression_mild = factor(phq9 >= 5,
                                  levels = c(F, T),
                                  labels = c("No", "Yes")))
# Check
table(nhanes$phq9, nhanes$depression_mild, useNA = "ifany")
tapply(nhanes$phq9, nhanes$depression_mild, range, na.rm=T)

Create a 2 $\times$ 2 table of mild depression (depression_mild) vs. walk or bicycle (PAQ635, “In a typical week do you walk or use a bicycle for at least 10 minutes continuously to get to and from places?”) and, using the table, compute the odds ratio comparing the odds of mild depression between those who do not and those who do answer “Yes” to the walk or bicycle question.
Create a 2 $\times$ 2 table of mild depression (depression_mild) vs. trouble sleeping (SLQ050, “Have you ever told a doctor or other health professional that you have trouble sleeping?”) and, using the table, compute the odds ratio comparing the odds of mild depression between those who do and those who do not answer “Yes” to the sleeping question.
Suppose you are going to fit a logistic regression where the outcome is mild depression. What probability is glm() modeling? If it is not already, modify the outcome variable so glm() will model P(mild depression = Yes).
Compute the odds ratio comparing the odds of mild depression (depression_mild) between those who do and those who do not answer “Yes” to the sleeping question (SLQ050) using logistic regression, as well as its 95% confidence interval. Assume depression_mild is the outcome.
Do the odds of mild depression (depression_mild) differ between individuals of different income levels (income)? Test the global significance of income using a Type III Wald test and compute the OR, 95% CI, and p-value comparing each possible pair of levels.
Is mild depression (depression_mild) associated with the number of days someone engages in vigorous recreational activities (PAQ655, “In a typical week, on how many days do you do vigorous-intensity sports, fitness or recreational activities?”)? What is the OR comparing mild depression between individuals who differ by 1 day in days of vigorous recreation? What about between those who differ by 5 days?
Is mild depression (depression_mild) significantly associated with trouble sleeping (SLQ050) after adjusting for age (RIDAGEYR), gender (RIAGENDR), income (income), and days someone engages in vigorous recreational activities (PAQ655)? Answer the question and report the AORs, 95% confidence intervals, and p-values. Also, interpret the AOR for trouble sleeping.
Create a forest plot to illustrate the AORs and their 95% CIs for the model from the previous exercise. For each continuous predictor, plot the AOR corresponding to a difference in the predictor equal to its interquartile range (IQR).
Using the model from Exercise 20, what is the predicted prevalence (and 95% CI) of mild depression among those with and without trouble sleeping who are age 40 years, male, earn $25,000 to <$55,000 per year, and do not engage in any days of vigorous recreational activities?
After adjusting for age (RIDAGEYR), gender (RIAGENDR), income (income), and days someone engages in vigorous recreational activities (PAQ655), does the association between mild depression (depression_mild) and trouble sleeping (SLQ050) depend on gender?
Using the model from the previous exercise, test the overall significance of trouble sleeping.
Using the model from Exercise 23, estimate the AOR for mild depression comparing those without and with trouble sleeping separately for males and females (along with their 95% CIs and p-values). Based on the answer to Exercise 23, are these two AORs significantly different from each other?

For Exercises 26 to 27, use the NHANES 2017-2018 fasting subsample teaching dataset (nhanes1718_adult_fast_sub_rmph.Rdata, see Appendix A.1).

Fit a logistic regression model to test the association between the outcome “Ever told had congestive heart failure?” (MCQ160B) and “How often do you snort or stop breathing?” (SLQ040), adjusted for age (RIDAGEYR), gender (RIAGENDR), and income (income). Look at the table of regression coefficients. Do you see any indicators of a problem with quasi- or complete separation?
Check for quasi- or complete separation in the model from the previous exercise, resolve any issues you find, and re-fit the model.

For Exercises 28 to 30, use the 2019 National Survey of Drug Use and Health (NSDUH) teaching dataset (nsduh2019_adult_sub_rmph.RData, see Appendix A.5).

Fit a logistic regression for the outcome substance use treatment (tx_substance_lifetime) vs. the predictor age of first cigarette use (cig_agefirst) and check the linearity assumption. Is the relationship linear?
For the model you fit in the previous exercise, relax the linearity assumption in each of the following three ways (not all at once): (1) log transformation of the predictor, (2) square-root transformation of the predictor, and (3) add a quadratic term for the predictor. Re-check the linearity assumption for each. Which transformation would you choose and why?
Compare the models with log, square-root, and quadratic transformed predictors from the previous exercise based on influential observations and goodness-of-fit (Hosmer-Lemeshow test, calibration plot). Is one preferable to the other?
Suppose you have a dataset with $n = 1000$ observations. In order to avoid overfitting, what is the maximum number of predictors you should include in a logistic regression model if the proportion of observations with the outcome is 0.15? What if the proportion is 0.70? What can you say about the relationship between the sample proportion and the number of predictors for a given sample size?
You are designing a study for which you wish to fit a logistic regression model with seven predictors. Assuming the population prevalence is 0.80, what is the minimum sample size you need to avoid overfitting? What if the population prevalence is 0.35? What can you say about the relationship between the population prevalence and the minimum sample size for a given number of predictors?
The matched case-control teaching dataset nhanes_CC_rmph.Rdata was created from a subset of adults from the 2017-2018 NHANES teaching dataset (Appendix A.1) containing 91 individuals who were ever told they had cancer or malignancy and 799 individuals who were not, matched on gender (RIAGENDR) and income (income). Assess the association between having been told one has cancer or malignancy and smoking status (smoker = Never, Past, or Current), accounting for the matching in the analysis. Compare all three levels of smoking status and give a plausible explanation of the results. NOTE: As in the example in the text, you must first convert the outcome to a 0/1 variable.
Repeat Exercise 20, but this time instead of computing the AOR, compute the adjusted prevalence ratio (APR).

For Exercises 35 and 36, use the 2018 Natality subsample teaching dataset (natality2018_rmph.Rdata, see Appendix A.3).

Is cigarette smoking during pregnancy (CIG_REC) associated with the outcome birthweight category (BWTR4 = Normal, Low birthweight (LBW), or Very low birthweight (VLBW)), adjusted for maternal age (MAGER), maternal education (MEDUC), and the presence of risk factors (risks)?
Check the proportional odds (PO) assumption for the model you fit in the previous exercise.