5.4 Data cleaning and imputation
Data cleaning means:
(i) correcting/addressing any mistakes in the data
(ii) organising the data in ways to help the downstream analysis e.g., clearer variable names, factor levels, data transformation
If you’ve encountered data quality problems in your dataset we have some cleaning choices. These are essentially to:
- Ignore
- Replace the problem or missing value with
NA
- Delete
- row-wise (sometimes known as complete-case analysis)
- column/variable-wise e.g., if most of the problems associate with a single or very small number of variables so there is less impact (data-loss) from removing the variable than a large number of impacted rows
- row-wise (sometimes known as complete-case analysis)
- Correct where possible e.g., if you have access to those involved with the primary data collection or where triangulation is possible, however, although ideal, this often difficult especially when dealing with secondary data
- Impute a new but ‘reasonable’ value. Data imputation is sufficiently important that this is a major research topic (see the next section)
Usually ignoring problem/missing data values is quite problematic. This is for at least two reasons:
1. Unchecked/corrected errors will propagate through your analysis
2. You need to understand the reason why data items are invalid otherwise the problem will perpetuate and an opportunity to learn and improve the data collection processes is lost
3. It is likely the mechanism that causes the problem data is non-random leading to bias in your sample
For more details I recommend the short book by (De Jonge and Van Der Loo 2013) based on R.
5.4.1 Data Imputation
Data imputation: is the substitution of an estimated value that is as realistic as possible for a missing or problematic data item. The substituted value is intended to enable subsequent data analysis to proceed.
One of the more common data quality problems is missingness.
For a good overview and description of the different missingness mechanisms see (Scheffer 2002), from a pioneer of the field (Rubin 1976) and the definitive book (Little and Rubin 2002).
The potential benefits of imputation are:
- preventing loss of information which would otherwise result from data cleaning strategies such as complete-case analysis
- reducing possible bias when values are not missing at random (Landerman, Land, and Pieper 1997)
There are many different imputation methods and this is a challenging and ongoing research area, see for example (Little 1988) concerning surveys and (Song and Shepperd 2007) concerning software engineering predictive models. For a more technical overview of imputation methods see (Pigott 2001).
Three very simple imputation methods are:
1. Mean imputation, where the missing value is replaced with the sample mean. This preserves the estimate of central tendency but at the expense of deflating the estimate of variance. This is potentially problematic especially when many values are imputed.
2. Regression-based imputation, where the missing value is replaced by the predicted value from a regression model (often a simple linear regression model) over the complete cases of the sample. This may bias the variance less than mean imputation but still may not be satisfactory. It assumes the data are missing completely at random (MCAR). Typically this isn’t warranted.
3. Hot deck imputation, where the missing value is replaced by a randomly selected value from the sample. This has the advantage of not biasing the variance estimate but still assumes the data are MCAR.
5.4.1.1 Dealing with missingness in R
So minimally we want to make missingness visible. Note there are very important differences between ""
, NA
and NaN
.
NB Missing values cannot be compared, even to themselves, so you can’t use comparison operators to test for the presence of missing values, e.g., numVector[4] == NA
is never TRUE. You must use missing values functions such as is.na()
.
Beyond this, R provides a lot of support for imputation. See (Kabacoff 2015) for a useful chapter entitled “Advanced methods for missing data”.
I outline some basic approaches in R.
- For complete-case analysis, i.e., removing incomplete observations there is a useful function
na.omit()
.
- For some of the simple methods described above Frank Harrell’s {Hmisc} package is extremely useful. The
impute()
function has a default of using a median.
## [1] 0 3 2 8 9 2 NA 4 4 1 5 2
## 1 2 3 4 5 6 7 8 9 10 11 12
## 0 3 2 8 9 2 3* 4 4 1 5 2
In the above example we see that initially missing.vec
has a NA for the 7th element in the vector. After applying the impute()
function with the default of option of using median-imputation we can see it has been replaced with the median=3. This is signified by an asterisk. Using the function option fun=
allows us to use other methods such as hot deck, e.g., impute(missing.vec, "random")
.
References
De Jonge, Edwin, and Mark Van Der Loo. 2013. An Introduction to Data Cleaning with R. https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf; Statistics Netherlands Heerlen.
Kabacoff, Robert. 2015. R in Action: Data Analysis and Graphics with R. 2nd ed. Manning.
Landerman, Lawrence R, Kenneth C Land, and Carl F Pieper. 1997. “An Empirical Evaluation of the Predictive Mean Matching Method for Imputing Missing Values.” Sociological Methods & Research 26 (1): 3–33.
Little, R. J. A. 1988. “Missing-Data Adjustments in Large Surveys.” J. Of Business & Economic Statistics 6 (3): 287–96.
Little, R. J. A., and D. B. Rubin. 2002. Statistical Analysis with Missing Data. Book. 2nd ed. New York: John Wiley & Sons.
Pigott, T. 2001. “A Review of Methods for Missing Data.” Educational Research and Evaluation 7 (4): 353–83.
Rubin, Donald B. 1976. “Inference and Missing Data.” Biometrika 63 (3): 581–92.
Scheffer, Judi. 2002. “Dealing with Missing Data.” In Research Letters in the Information and Mathematical Sciences, 153–60.
Song, Q., and M. Shepperd. 2007. “Missing Data Imputation Techniques.” International Journal of Business Intelligence & Data Mining 2 (3): 261–91.