Model bias can arisen from various factors including:
- Imputation method
- Missing data mechanism (MCAR vs. MAR)
- Proportion of the missing data
- Information available in the data set
Since the imputed observations are themselves estimates, their values have corresponding random error. But when you put in that estimate as a data point, your software doesn’t know that. So it overlooks the extra source of error, resulting in too-small standard errors and too-small p-values. So multiple imputation comes up with multiple estimates.
Because multiple imputation have a random component, the multiple estimates are slightly different. This re-introduces some variation that your software can incorporate in order to give your model accurate estimates of standard error. Multiple imputation was a huge breakthrough in statistics about 20 years ago. It solves a lot of problems with missing data (though, unfortunately not all) and if done well, leads to unbiased parameter estimates and accurate standard errors. If your rate of missing data is very, very small (2-3%) it doesn’t matter what technique you use.
Remember that there are three goals of multiple imputation, or any missing data technique:
- Unbiased parameter estimates in the final analysis (regression coefficients, group means, odds ratios, etc.)
- accurate standard errors of those parameter estimates, and therefore, accurate p-values in the analysis
- adequate power to find meaningful parameter values significant.
Don’t round off imputations for dummy variables. Many common imputation techniques, like MCMC, require normally distributed variables. Suggestions for imputing categorical variables were to dummy code them, impute them, then round off imputed values to 0 or 1. Recent research, however, has found that rounding off imputed values actually leads to biased parameter estimates in the analysis model. You actually get better results by leaving the imputed values at impossible values, even though it’s counter-intuitive.
Don’t transform skewed variables. Likewise, when you transform a variable to meet normality assumptions before imputing, you not only are changing the distribution of that variable but the relationship between that variable and the others you use to impute. Doing so can lead to imputing outliers, creating more bias than just imputing the skewed variable.
Use more imputations. The advice for years has been that 5-10 imputations are adequate. And while this is true for unbiasedness, you can get inconsistent results if you run the multiple imputation more than once. (Bodner 2008) recommends having as many imputations as the percentage of missing data. Since running more imputations isn’t any more work for the data analyst, there’s no reason not to.
Create multiplicative terms before imputing. When the analysis model contains a multiplicative term, like an interaction term or a quadratic, create the multiplicative terms first, then impute. Imputing first, and then creating the multiplicative terms actually biases the regression parameters of the multiplicative term (Von Hippel 2009)