9.11 Exercises

  1. Explain the distinction between unit and item non-response. Which type of non-response can be handled with multiple imputation (as discussed in this chapter)?

  2. What type of missing data method is used when carrying out a complete case analysis?

  3. What type of missing data method is the default for most regression software?

  4. When imputing missing values, what is wrong with imputing a missing value with the mean of the observed values? How does conditional mean imputation help, and why is it still inadequate?

  5. Suppose you have a dataset with \(N\) cases, with \(n < N\) cases that are complete (no missing values for any variable). After multiple imputation, you have \(M\) complete datasets, each of size \(N\). The “effective sample size” is the correct sample size to use when computing the precision of estimates after MI. Between what two values is the effective sample size?

  6. Suppose you have a dataset with two variables, \(Y\) = hospital readmission within one year of a stroke and \(X\) = annual income, where the readmission information is complete but income is missing for many individuals. For each of the following statements, is the missing data mechanism MCAR, MAR, or MNAR? NOTE: The notation \(P(\textrm{Income is missing} | \textrm{Readmission, Income})\) is read “the probability that income is missing for a patient given their readmission and income values”. It may seem strange to talk about the probability that income is missing given income. But think of it as “how likely is it that, given the actual but possibly unknown value of income, income is missing.” Perhaps those with higher or lower income values are more or less likely to not report their income in a survey.

  1. \(P(\textrm{Income is missing} | \textrm{Readmission, Income}) = P(\textrm{Income is missing} | \textrm{Readmission})\). Given the observed information (readmission status), the probability that income is missing is independent of its value. All patients with the same readmission status have the same chance of having not reported their income, regardless of their actual income.
  2. \(P(\textrm{Income is missing} | \textrm{Readmission, Income}) = P(\textrm{Income is missing})\). The probability that income is missing is independent of the observed information (readmission status) and the incomplete information (income). All patients have the same chance of having not reported their income.
  3. \(P(\textrm{Income is missing} | \textrm{Readmission, Income}) = P(\textrm{Income is missing} | \textrm{Readmission, Income})\). Even after controlling for the observed information (readmission status), the probability that income is missing depends on the (possibly unknown) income value. Even among observations with the same readmission status, the chance that income is missing depends on its value.
  1. When using multiple imputation, how many imputations do you need?

  2. When using multiple imputation, how should you handle a variable that will be transformed in your analysis (e.g., a categorical variable that will be collapsed; a continuous variable that will be transformed using some function)?

For Exercises 9 to 13, use the Natality teaching dataset (natality2018_rmph.Rdata, see Appendix A.3).

  1. Prior to imputation, compute descriptive statistics for father’s race/Hispanic origin (FRACEHISP), education (FEDUC), and age (FAGECOMB), including the number of missing values. Use all available data for each variable (rather than a complete case analysis).

  2. Next, we would like to compute descriptive statistics for father’s race/Hispanic origin (FRACEHISP), education (FEDUC), and age (FAGECOMB) after using multiple imputation. There are a number of auxiliary variables that can be included in the imputation model. The auxiliary variables are non-analysis variables in the dataset that may be correlated with the father’s characteristics and/or correlated with the chance that the father’s characteristics are missing. The following is the full list of analysis and auxiliary variables.

  • Father’s race/Hispanic origin (FRACEHISP)
  • Father’s education (FEDUC)
  • Father’s age (FAGECOMB)
  • Mother’s race/Hispanic origin (MRACEHISP)
  • Mother’s education (MEDUC)
  • Mother’s age (MAGER)
  • Marital Status (DMAR)
  • Prior births now living (PRIORLIVE)
  • Birthweight (g) (DBWT)
  • Month prenatal care began (PRECARE)
  • WIC (WIC)
  • Risk factors reported (risks)
  • Preterm birth (preterm)

In this exercise, visualize the pattern of missing data.

  1. Fit the imputation model with 5 imputations and examine the output. What method was used to impute each variable?

  2. Visualize the imputations for father’s age.

  3. Compute the descriptive statistics after MI. How do these compare to the descriptives before using MI (computed in Question 11) and what does that say about the nature of the missing data?

For Exercises 14 to 20, use the 2020 UN Human Development Data (unhdd2020.rmph.Rdata, see Appendix A.2).

  1. After handling missing data using multiple imputation, fit a regression model to test the association between the outcome “child under 5y mortality (2018, per 1,000 live births)” (mort_lt5) and the predictors “female population with at least some secondary education (2015-2019, % ages 25 and older)” (educ_f), “child malnutrition - stunting (moderate or severe) (2010-2019, % under age 5)” (stunt), and “infants exclusively breastfed (2010-2019, % ages 0-5 months)” (breast). Assume no changes need to be made to any of these variables – you will explore other aspects of this analysis in subsequent questions.

  2. Check the normality, linearity, and constant variance assumptions for the model you fit in the previous Exercise. What assumptions are violated?

  3. Examine a histogram of the outcome. Use a Box-Cox outcome outcome transformation (based on the original data, before imputation), re-fit the imputation model, re-fit the regression model, and re-check the normality, linearity, and constant variance assumptions. Do any problems remain?

  4. Starting with the model you fit in the previous exercise, relax the linearity assumption for stunt and breast using polynomial transformations (e.g., quadratic, cubic, or higher order). Remember to center each variable prior to transformation. Then re-check the normality, linearity, and constant variance assumptions. Do any problems remain?

  5. Redo the previous exercise, this time also including a quadratic for educ_f. Then re-check the normality, linearity, and constant variance assumptions. Do any problems remain?

  6. Using the final model from the previous exercise, predict child mortality for a nation with 40% child malnutrition, 25% infants exclusively breastfed, and 35% female population with at least some secondary education. Compare this to the prediction when these values are 10%, 70%, and 90%. Hint: In your prediction data.frame, enter a value for every term in the model, and the terms in this model were centered and some where squared or cubed.

  7. Using the final model from Exercise 21, expand the imputation and regression models to assess if the quadratic association between female education and mortality depends on countries’ HDI group (hdi_group). Use the transform-then-impute method for including an interaction.

For Exercises 21 to 23, use the NHANES 2017-2018 fasting subsample teaching dataset (nhanes1718_adult_fast_sub_rmph.Rdata, see Appendix A.1). Create a dichotomous version of PHQ-9 representing “at least mild depression” (PHQ-9 \(\ge\) 5) using the following code.

load("Data/nhanes1718_adult_fast_sub_rmph.Rdata")

# Create dichotomized PHQ-9
# "PHQ-9 scores of 5, 10, 15, and 20 represented
#  mild, moderate, moderately severe, and
#  severe depression, respectively"
#  (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1495268/)

nhanes <- nhanes_adult_fast_sub %>% 
  mutate(depression = factor(phq9 >= 5,
                             levels = c(F, T),
                             labels = c("No", "Yes")))

# Check derivation
table(nhanes$phq9, nhanes$depression, exclude = NULL)
tapply(nhanes$phq9, nhanes$depression, range, na.rm=T)
  1. After handling missing data using multiple imputation, fit a regression model to test if the outcome “at least mild depression” is significantly associated with ever told doctor had trouble sleeping (SLQ050) after adjusting for age (RIDAGEYR), gender (RIAGENDR), income (income), and days someone engages in vigorous recreational activities (PAQ655)? Answer the question and report the AOR, 95% confidence interval, and p-value. Also, which other predictors are significantly associated with “at least mild depression”?

  2. Expand the model you fit in the previous exercise to assess whether the association between “at least mild depression” and trouble sleeping depends on gender. Use the stratification method for imputing an interaction. Regardless of the statistical significance of the interaction, estimate the sleep effect at each level of gender.

  3. Assess the goodness of fit of the model from the previous exercise using the Hosmer-Lemeshow test, as well as calibration plots.

  4. For this exercise, use the teaching dataset based on the Framingham Heart Study (fram_time_invar_rmph.rData, see Appendix A.6). After handling missing data using multiple imputation, fit a regression model to test if time to angina differs between participants with different levels of education (EDUC), adjusted for age (AGE) and sex (SEX). The time variable is TIMEAP and the event indicator is ANGINA.