In the past decades, the attention for missing data has grown and so has the need for researchers to apply suitable methods to deal with missing data. Leading methodologists and statisticians have published papers in leading journals about the problems of missing data and recommended researchers to take missing data seriously. Examples can be found in the British Medical Journal (BMJ, Sterne et al. (2009)), New England Journal of Medicine, (NEJM, R.J. Little et al. (2012)) and the Journal of the American Medical Association (JAMA, Li P (2015)). From our experience, applied researchers still find it difficult to reserve extra time to evaluate the missing data in their study and to find a suitable solution to handle their missing data for their primary data analysis. This book is developed for those researchers who are looking for a solution for a missing data problem or who want to learn more about dealing with missing data. The book is initially developed for a missing data course for epidemiologist, but we feel that applied researchers from other disciplines may also find this book useful. Further, we are active in giving statistical advice in general and more specific about missing data. Because our time for consultations is mostly limited, this practical guide may be useful to help researchers get started with their missing data problem. With this book we provide an overview of the currently available methods to deal with missing data. The use of the methods are thoroughly explained and supported by practical examples. We hope you will enjoy this book and that you find it useful, at that as a result you will use recommended methods to solve your missing data problem.
In this book the software packages SPSS and R play a central role. SPSS is one of the most popular software package for applied researchers, worldwide, to use for statistical analyses. Currently, R is growing in popularity fast and will probably become the most popular software packages for statistical analysis, also for applied researchers.
Both SPSS and R have their own specific characteristics. One of the main differences between SPSS and R is that SPSS works with a click-menu that makes windows appear. In these windows you can drag variables into boxes. Subsequently, by clicking the OK button the statistical analysis procedure you specified will run and the output results are automatically presented in a new window. A limitation of working with SPSS is that you are overloaded with statistical output that may not all be needed to answer your research question. Also, newly developed methods are not immediately available, but have to be included in a new version update by the developers.
R is a software package that needs code to access statistical procedures and you need to be familiar with some R-language in order to use it. However, once you know the language, R can be used for any statistical procedure you can think of. Moreover, R is free and an open source program that anybody can download and use. It enables users to get insight into the statistical calculations that are programmed. Anyone can contribute by developing packages and functions for statistical procedures. Consequently, state-of-the-art statistical methods are very quickly available in R, and if they are not, you can develop them yourself.
In this book the handling of missing data is the main topic. We will show how to apply methods in both software packages SPSS and R. Of note is that multiple imputation methods run with random starting procedures. Both SPSS and R have their own internal random number generators. Therefore, results might slightly differ between the software packages, even when exactly the same procedures are applied. Our intention is not to compare the software packages, both packages are reliable, but we aim to provide readers with an overview of all possible missing data methods.
In this book we use SPSS version 24 and R software version 3.5.1. The R examples will be presented by using the output from RStudio version (Version 1.1.463 – © 2009-2018 RStudio, Inc). RStudio is an integrated development environment (IDE) for R. RStudio includes a wide range of productivity enhancing features and runs on all major platforms.
Notation in this book
Annotation in this book
The name of R packages and libraries are used as published under their original names. The name of R functions is in R code form, like the
Book examples in R code language can be found in grey text boxes, for example: to read in the dataset Backpain 50 missing.sav.
# Activate the foreign package and read in the SPSS dataset library(foreign) dataset <- read.spss(file="data/Backpain 50 missing.sav", to.data.frame=T)
## re-encoding from UTF-8
We first want to thank our students and colleagues in the Amsterdam UMC hospital for their interesting questions about missing data during lectures and statistical consultations.
Sterne, Jonathan AC, Ian R White, John B Carlin, Michael Spratt, Patrick Royston, Michael G Kenward, Angela M Wood, and James R Carpenter. 2009. “Multiple Imputation for Missing Data in Epidemiological and Clinical Research: Potential and Pitfalls.” BMJ: British Medical Journal 338.
Little, R.J., R. D’Agostino, M.L. Cohen, K. Dickersin, S.S. Emerson, J.T. Farrar, C. Frangakis, et al. 2012. “The Prevention and Treatment of Missing Data in Clinical Trials.” New England Journal of Medicine 367 (14): 1355–60.
Li P, Allison DB, Stuart EA. 2015. “Multiple Imputation: A Flexible Tool for Handling Missing Data.” JAMA 314 (18): 1966–7. doi:10.1001/jama.2015.15281.
Xie, Yihui. 2018. Bookdown: Authoring Books and Technical Documents with R Markdown. https://CRAN.R-project.org/package=bookdown.
Allaire, JJ, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley Wickham, Joe Cheng, and Winston Chang. 2018. Rmarkdown: Dynamic Documents for R. https://CRAN.R-project.org/package=rmarkdown.