9.3 The imputation model

MI works by drawing values randomly from a distribution that uses the relationships between all the variables, estimated from the observed values, to predict the values for cases with missing data, taking into account all sources of uncertainty in this prediction. mice stands for multivariate imputation by chained equations (van Buuren and Groothuis-Oudshoorn 2011). The mice algorithm starts by predicting one variable’s missing values using a model where that variable is the outcome and all the other variables are the predictors. Given those predictions, the algorithm moves on to the next variable and predicts its missing values based on all the others. These prediction models are strung together in a “chain”, hence the name.

Suppose you have three variables, \(X\), \(Y\), and \(Z\). Roughly, the mice algorithm is as follows:

  • Randomly impute missing \(X\) values using the model \(X \sim Y + Z + \textrm{randomness}\).
  • Given those predictions, impute \(Y\) using the model \(Y \sim X + Z + \textrm{randomness}\).
  • Given those predictions, impute \(Z\) using the model \(Z \sim X + Y + \textrm{randomness}\).
  • Iterate this “chain” many times; often enough that, ideally, the results converge (are not changing much with additional iterations).

For each variable, the model used depends on the type of that variable. For example, for a continuous variable, a linear regression model could be used. It turns out, however, that using a more flexible method that does not depend on normality so heavily is more generally applicable. The default in the mice() imputation function is to use predictive mean matching (pmm) for continuous variables, logistic regression for binary variables (logreg), multinomial (polytomous) logistic regression for unordered (nominal) categorical variables (polyreg), and ordinal (proportional odds) logistic regression for ordered categorical variables (polr).

If we were to use linear regression to predict a missing value, we would fit the model, compute the predicted mean value, add some random noise to account for the uncertainty in the prediction and in the model itself, and impute using that value. Predictive mean matching differs in that, instead of replacing a missing value with the prediction from the model, it replaces it with a randomly chosen observed value, where the choice is among those with predicted values closest to the missing value’s predicted value. Thus, an imputation based on predictive mean matching is always equal to one of the observed values in the dataset.

NOTE: The terms “outcome” and “predictor” have different meanings between the regression model to be fit after MI and the imputation model used to fill in the missing values. In the regression model, there is one outcome variable and the remaining variables are the predictors. In the imputation model, each variable takes a turn at being the outcome in each iteration of the mice algorithm and acts as a predictor in the models for which a different variable is the outcome.

References

van Buuren, Stef, and Karin Groothuis-Oudshoorn. 2011. mice: Multivariate Imputation by Chained Equations in R.” Journal of Statistical Software 45 (3): 1–67. https://doi.org/10.18637/jss.v045.i03.