Chapter 8 Next steps
Preliminary results indicate that XGBoost performs well, relative to donor based methods, in univariate imputation. The models were especially effective at predicting the different classes of multi-class and binary variables. The ability to program an end to end imputation process is an added advantage of XGBoost; it reduces the time taken to implement an imputation method, and presents clients with the option of automating the imputation process. Current donor based methods utilise either closed, or proprietary code, which cannot easily be integrated into open source platforms.
The intention is to build on this work, in two parts:
- Part 1: Complete the analysis carried out on the Census Teaching File, testing the efficacy of neural networks, and replicate the methodologies on social survey data
- Part 2: Produce a report on the efficacy of different model based imputation methods
The advantages of model based approaches are:
- Models can be tailored to a given imputable variable. That is, different models can be used for different imputable variables based on the distribution of the outcome variable, and the nature of its relationship to the auxiliary variables.
- Model based approaches (specifically machine learning methods) can make use of a large volume of auxiliary variables, often avaialble in social surveys.
- Model based approaches could be designed to produce estimates of imputation variance.
The Review of model based imputation methods will cover:
- When imputation is appropriate in a survey dataset, using existing literature to give analysts some guidance on when to impute and when imputation may not be appropriate.
What model based approaches are appropriate given the intended use of the dataset. The following models will be evaluated, with respect to performance metrics and the impact on the distribution of the imputable variable:
- Two statistical approaches: One using a frequentist framework and one using a Bayesian framework
- Two supervised machine learning methods: k-nearest neighbours & XGBoost
- Two unsupervised machine learning methods: Autoencoders & Generative Adverserial Networks
- Two statistical approaches: One using a frequentist framework and one using a Bayesian framework