9.1 Introduction
Missing data are common in public health research. Whatever method is used to select individuals for a survey or other study, there may be unit non-response – individuals may decline to participate, or the researcher may not succeed in contacting some individuals. Additionally, among those who do participate, some may drop out of the study early resulting in loss to follow-up. Among individuals who do participate, there may be item non-response – individuals may leave some survey items unanswered, whether due to refusal or lack of knowledge. Similarly, when physical measurements are involved, some individuals may decline to be measured.
For some survey designs, statistical methods can be used to account for unit non-response or loss to follow-up via survey weights. In any case, it is helpful for any research report to include a statement about the response rate (the proportion of individuals selected who participated in the research overall and at each study visit, if longitudinal), characteristics of the non-responders (if known), and how the respondent characteristics differ between those with complete responses and those with item non-response (see Section 9.4.5).
This chapter addresses item non-response, which is referred to from this point forward as missing data. The default approach for handling missing data in regression methods is listwise deletion – remove any case (individual) from the dataset that has a missing value for the outcome or for any predictor. This complete case analysis approach is the method used in previous chapters. Listwise deletion can be done either explicitly by the data analyst or implicitly by the software. In the previous chapters of this text, cases with missing data were explicitly removed prior to computing descriptive statistics so that these statistics would be based on the same sample as the regression.
It would be preferable, however, to use a method that retains all cases in the dataset, making use of their non-missing information, and to do so in a way that leads to consistent estimates of regression coefficients that have the correct estimated precision. “Consistency” is similar to “unbiased”, but not exactly. A consistent estimator is one that gets closer and closer to being unbiased in larger and larger sample sizes.
This chapter introduces a method that is almost always better than a complete case analysis. First introduced by Donald Rubin (Rubin 1987, 1996), multiple imputation has become widely used and recognized as an effective method for handling missing data that is applicable to a wide variety of analyses.
Imputation refers to replacing a missing value with a guess of its true value. When this is done for every missing value, an incomplete dataset becomes a complete dataset. Mean imputation replaces a missing value for a variable with the mean of that variable’s non-missing values. This, however, ignores relationships between variables and, as a result, distorts those relationships. Conditional mean imputation improves upon this by replacing a missing value for a variable with its predicted value using a regression model with that variable as the outcome and the other variables in the dataset as predictors.
Conditional mean imputation is intuitively appealing – it makes sense to replace a missing value with the best guess given the non-missing information. Unfortunately, doing so actually results in a complete dataset that does not accurately reflect reality because the variation in the data is understated. An improvement can be made by adding randomness to the guess – instead of replacing the missing value with the best guess, replace it with the best guess plus or minus some random noise. The correct way to do this is to take a random draw from the proper distribution, one that uses the relationships between all the observed (non-missing) data to predict the missing value, taking into account all sources of uncertainty in this prediction.
Such a single imputation method, however, results in treating imputed values as if they were known and overstating the precision of regression results (too small standard errors, too narrow confidence intervals, and too small p-values). This limitation is overcome by using multiple imputation (MI) – whatever your single imputation method, do it multiple times. Multiple imputation consists of taking, for each missing value, \(M\) random draws from the proper distribution. The result is \(M\) complete datasets.
Next, given the \(M\) complete datasets, fit the regression model of interest \(M\) times, once with each dataset, to get \(M\) sets of estimated regression coefficients and their estimated variances. Finally, Rubin’s rules are used to (1) compute the final estimated regression coefficients by averaging the estimates over the \(M\) sets and (2) compute the final estimated variation of the final estimates by averaging the estimated variances over the \(M\) sets and adding a bit more based on how variable the \(M\) estimates were between imputations.
The second Rubin’s rule corrects the overstatement of precision that results from single imputation. Consider a dataset with \(N\) observations, \(n\) of which have no missing data. A complete case analysis (via listwise deletion) would have a sample size of \(n\), and computation of standard errors would be based on that sample size. Since that ignores the non-missing information in the excluded cases, these standard errors will, in general, be larger than if we had made use of this information. Had we used imputation to complete the data, but then pretended as if there never were any missing data (as in single imputation), standard errors would be based on a sample size of \(N\), and would be too small. MI plus Rubin’s rules lead us to the correct compromise between the two extremes of excluding the missing data altogether and imputing values for it but pretending it was never missing.
The modest goal of this chapter is to present the basics of MI and how to apply them using the mice
package (van Buuren and Groothuis-Oudshoorn 2011, 2023), which includes the mice()
function to create imputed datasets and additional functions to visualize imputed values, carry out analyses on each imputed dataset, and pool results over imputations. Some additional functions will be used from the miceadds
package (Robitzsch, Grund, and Henke 2023), as well.
While the functions in mice
have reasonable default settings, there are many nuances that go into doing MI well. This chapter covers the basics but is simply an introduction. After becoming comfortable with the basics here, you are highly encouraged to learn more and/or seek help from an experienced statistician before carrying out analyses utilizing MI. Also, see Flexible Imputation of Missing Data for a thorough and accessible treatment of MI (van Buuren 2018). An excellent reference for the topic of missing data in general is Little and Rubin (2019). For an overview of R methods for handling missing data, see CRAN Task View: Missing Data (accessed March 7, 2023). Also, in this chapter the examples use the NHANES and NSDUH teaching datasets. These examples are only for illustrating the use of multiple imputation to handle missing data. Multiple imputation of missing complex survey data is more complicated; for example, see Liu et al. (2016) and Quartagno, Carpenter, and Goldstein (2020).