8.3 Deletion of Data
When it is desirable to use models that are intolerant to missing data, then the missing values must be extricated from the data. The simplest approach for dealing with missing values is to remove entire predictor(s) and/or sample(s) that contain missing values. However, one must carefully consider a number of aspects of the data prior to taking this approach. For example, missing values could be eliminated by removing all predictors that contain at least one missing value. Similarly, missing values could be eliminated by removing all samples with any missing values. Neither of these approaches will be appropriate for all data as can be inferred from the “No Free Lunch” theorem. For some data sets, it may be true that particular predictors are much more problematic than others; by removing these predictors, the missing data issue is resolved. For other data sets, it may be true that specific samples have consistently missing values across the predictors; by removing these samples, the missing data issue is likewise resolved. In practice, however, specific predictors and specific samples contain a majority of the missing information.
Another important consideration is the intrinsic value of samples as compared to predictors. When it is difficult to obtain samples or when the data contain a small number of samples (i.e., rows), then it is not desirable to remove samples from the data. In general, samples are more critical than predictors and a higher priority should be placed on keeping as many as possible. Given that a higher priority is usually placed on samples, an initial strategy would be to first identify and remove predictors that have a sufficiently high amount of missing data. Of course, predictor(s) in question that are known to be valuable and/or predictive of the outcome should not be removed. Once the problematic predictors are removed, then attention can focus on samples that surpass a threshold of missingness.
Consider the Chicago train ridership data as an example. The investigation into these data earlier in the chapter revealed that several of the stations had a vast amount of contiguous missing data (Figure 8.5). Every date (sample) had at least one missing station (predictor), but the degree of missingness across predictors was minimal. However, a handful of stations had excessive missingness. For these stations there is too much contiguous missing data to keep the stations in the analysis set. In this case it would make more sense to remove the handful of stations. It is less clear for the stations on the Red Line as to whether or not to keep these in the analysis sets. Given the high degree of correlation between stations, it was decided that excluding all stations with missing data from all of the analyses in this text would not be detrimental to building an effective model for predicting ridership at the Clark-Lake station. This example illustrates that there are several important aspects to consider when determining the approach for handling missing values.
Besides throwing away data, the main concern with removing samples (rows) of the training set is that it might bias the model that relates the predictors to the outcome. A classic example stems from medical studies. In these studies, a subset of patients are randomly assigned to the current standard of care treatment while another subset of patients are assigned to a new treatment. The new treatment may induce an adverse effect for some patients, causing them to drop out of the study thus inducing missing data for future clinical visits. This kind of missing data is clearly not missing at random and the elimination of these data from an analysis would falsely measure the outcome to be better than it would have if their unfavorable results had been included. That said, Allison (2001) notes that if the data are missing completely at random, this may be a viable approach.