9.3 The imputation model

MI works by drawing values randomly from a distribution that uses the relationships between all the variables, estimated from the observed values, to predict the values for cases with missing data, taking into account all sources of uncertainty in this prediction. mice stands for multivariate imputation by chained equations (van Buuren and Groothuis-Oudshoorn 2011). The mice algorithm starts by predicting one variable’s missing values using a model where that variable is the outcome and all the other variables are the predictors. Given those predictions, move on to the next variable and predict its missing values based on all the others. These regression equations are strung together in a “chain”, hence the name.

Suppose you have 3 variables, \(X\), \(Y\), and \(Z\). Roughly, the mice algorithm is as follows:

  • Randomly impute missing \(X\) values using the regression model \(X \sim Y + Z + \textrm{randomness}\).
  • Given those predictions, impute \(Y\) using the model \(Y \sim X + Z + \textrm{randomness}\).
  • Give those predictions, impute \(Z\) using the model \(Z \sim X + Y + \textrm{randomness}\).
  • Iterate this “chain” until convergence.

For each variable, the regression model used depends on the type of that variable. For example, for a continuous variable, a linear regression might be used. It turns out, however, that using a more flexible method that does not depend on normality so heavily is more generally applicable. The default in the mice() imputation function is to use predictive mean matching (pmm) for continuous variables, logistic regression for binary variables (logreg), multinomial (polytomous) logistic regression for unordered (nominal) categorical variables (polyreg), and ordinal (proportional odds) logistic regression for ordered categorical variables (polr).

If we were to use linear regression to predict a missing value, we would fit a regression model, take the predicted mean value, and add some random noise to account for the uncertainty in the prediction (and in the model itself). Predictive mean matching differs in that, instead of replacing a missing value with the prediction from the model, it replaces it with a randomly chosen observed value, where the choice is among those with predicted values closest to the missing value’s predicted value. Thus, an imputation based on predictive mean matching is always equal to one of the observed values in the dataset.

NOTE: The terms “outcome” and “predictor” have different meanings between the regression model to be fit after MI and the imputation model used to fill in the missing values. In the regression model, there is one outcome variable and the remaining variables are the predictors. In the imputation model, each variable takes a turn at being the outcome in each iteration of the mice algorithm, and acts as a predictor in the models where a different variable is the outcome.

References

van Buuren, Stef, and Karin Groothuis-Oudshoorn. 2011. mice: Multivariate Imputation by Chained Equations in r.” Journal of Statistical Software 45 (3): 1–67. https://doi.org/10.18637/jss.v045.i03.