5.4 Data cleaning and imputation

Data cleaning means two things:
(i) correcting/addressing any mistakes in the data
(ii) organising the data in ways to help the downstream analysis e.g., clearer variable names, factor levels, data transformation

If you’ve encountered data quality problems in your dataset we have some cleaning choices. These are essentially to:

Ignore
Replace the problem or missing value with NA
Delete
- row-wise (sometimes known as complete-case analysis)
- column/variable-wise e.g., if most of the problems associate with a single or very small number of variables so there is less impact (data-loss) from removing the variable than a large number of impacted rows
Correct where possible e.g., if you have access to those involved with the primary data collection or where triangulation is possible, however, although ideal, this often difficult especially when dealing with secondary data
Impute a new but ‘reasonable’ value. Data imputation is sufficiently important that this is a major research topic (see the next section)

Usually ignoring problem/missing data values is quite problematic. This is for at least two reasons:
1. Unchecked/corrected errors will propagate through your analysis
2. You need to understand the reason why data items are invalid otherwise the problem will perpetuate and an opportunity to learn and improve the data collection processes is lost
3. It is likely the mechanism that causes the problem data is non-random leading to bias in your sample

For more details I recommend the short book by (De Jonge and Van Der Loo 2013) based on R.

5.4.1 Data Imputation

Data imputation: is the substitution of an estimated value that is as realistic as possible for a missing or problematic data item. The substituted value is intended to enable subsequent data analysis to proceed.

One of the more common data quality problems is missingness.

For a good overview and description of the different missingness mechanisms see (Scheffer 2002), from a pioneer of the field (Rubin 1976) and the definitive book (Little and Rubin 2002).

The potential benefits of imputation are:

preventing loss of information which would otherwise result from data cleaning strategies such as complete-case analysis
reducing possible bias when values are not missing at random (Landerman, Land, and Pieper 1997)

There are many different imputation methods and this is a challenging and ongoing research area, see for example (Little 1988) concerning surveys and (Song and Shepperd 2007) concerning software engineering predictive models. For a more technical overview of imputation methods see (Pigott 2001).

Three very simple imputation methods are:
1. Mean, median or mode imputation, where the missing value is replaced with the sample mean. This preserves the estimate of central tendency but at the expense of deflating the estimate of variance. This is potentially problematic especially when many values are imputed. NB Median is more robust so generally to be preferred. The mode is used when dealing with categorical data.
2. Regression-based imputation, where the missing value is replaced by the predicted value from a regression model (often a simple linear regression model) over the complete cases of the sample. This may bias the variance less than mean imputation but still may not be satisfactory. It assumes the data are missing completely at random (MCAR). Typically this isn’t warranted.
3. Hot deck imputation, where the missing value is replaced by a randomly selected value from the sample. This has the advantage of not biasing the variance estimate but still assumes the data are MCAR.

5.4.1.1 Dealing with missingness in R

So minimally we want to make missingness visible. Note there are very important differences between "", NA and NaN.

NB Missing values cannot be compared, even to themselves, so you can’t use comparison operators to test for the presence of missing values, e.g., numVector[4] == NA is never TRUE. You must use missing values functions such as is.na().

Beyond this, R provides a lot of support for imputation. See (Kabacoff 2015) for a useful chapter entitled “Advanced methods for missing data”.

I outline some basic approaches in R.

For complete-case analysis, i.e., removing incomplete observations there is a useful function na.omit() or alternatively complete.cases.
You can hand code simple approaches like mean or median imputation. Consider the following example in R code of a simple mean imputation algorithm.

# make a dataframe
df_mn <- data.frame(var1=c(100, NA, NA, 4, 5, 3, 5),
                    var2=c(6, 7, 8, 3, 2, 7, 4))
# mean imputation
df_mn$var1[is.na(df_mn$var1)] <- mean(df_mn$var1, na.rm=TRUE)
df_mn

##    var1 var2
## 1 100.0    6
## 2  23.4    7
## 3  23.4    8
## 4   4.0    3
## 5   5.0    2
## 6   3.0    7
## 7   5.0    4

We can see that the mean imputation method has introduced new values of 28 which considerably influenced by the extreme outlier value of 100. It also has the disadvantage of reducing the vazriance of the variable (column).

A slightly more robust method (to outliers) is median imputation (see below). Here it could be argued that the imputed values (5) are far more typical.

# make a dataframe
df_md <- data.frame(var1=c(100, NA, NA, 4, 5, 3, 5),
                    var2=c(6, 7, 8, 3, 2, 7, 4))
# median imputation
df_md$var1[is.na(df_md$var1)] <- median(df_md$var1, na.rm=TRUE)
df_md

##   var1 var2
## 1  100    6
## 2    5    7
## 3    5    8
## 4    4    3
## 5    5    2
## 6    3    7
## 7    5    4

Incidentally there could be an issue if the variable represents a count and the median results in a .5 value. Why would this be problematic? How could you simply overcome this difficulty?

For some of the simple methods described above Frank Harrell’s {Hmisc} package is extremely useful. The impute() function has a default of using a median.

There are many packages to support advanced imputation techniques e.g., {mice} and {VIM}.

However, we will only quickly look at some very simple methods such as median (for numeric data) and mode (for categorical data). There is a general purpose package {Hmisc} which includes a simple impute() function.

# The first time install the package and uncomment the next R statement
library(Hmisc)

# Let's work with education.DF and make a copy to compare before and after
newDF <- education.DF
# The impute function (2nd argument) defaults to median if you don't specify
# Impute values for NAs in IncomeTax and create a new data frame
newDF$IncomeTax <- impute(education.DF$IncomeTax, median)
newDF

##    Education Sex Salary IncomeTax
## 1        BSc   M  50000     15000
## 2        BSc   M  45000     13500
## 3        BSc   M  25000      7500
## 4        BSC   F      0         0
## 5        BSc  F   25000      7500
## 6        MSc   F  55000     16500
## 7        MSc   M  35000     90000
## 8        MSc   F  90000     27000
## 9        PhD   F  80000     24000
## 10       PhD   M 145000     15000

In the above example we see that initially education.DF has a NA for the 10th element in IncomeTax. After applying the impute() function with the default of option of using median-imputation we can see it has been replaced with the median=15000. This is signified by an asterisk. Using the function option fun= (or the second argument) allows us to use other methods such as hot deck, e.g., impute( . , "random") or force a particular value e.g., impute( . , 42).

References

De Jonge, Edwin, and Mark Van Der Loo. 2013. An Introduction to Data Cleaning with r. https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf; Statistics Netherlands Heerlen.

Kabacoff, Robert. 2015. R in Action: Data Analysis and Graphics with r. 2nd ed. Manning.

Landerman, Lawrence R, Kenneth C Land, and Carl F Pieper. 1997. “An Empirical Evaluation of the Predictive Mean Matching Method for Imputing Missing Values.” Sociological Methods & Research 26 (1): 3–33.

Little, R. 1988. “Missing-Data Adjustments in Large Surveys.” J. Of Business & Economic Statistics 6 (3): 287–96.

Little, R., and D. Rubin. 2002. Statistical Analysis with Missing Data. Book. 2nd ed. New York: John Wiley & Sons.

Pigott, T. 2001. “A Review of Methods for Missing Data.” Educational Research and Evaluation 7 (4): 353–83.

Rubin, Donald B. 1976. “Inference and Missing Data.” Biometrika 63 (3): 581–92.

Scheffer, Judi. 2002. “Dealing with Missing Data.” In Research Letters in the Information and Mathematical Sciences, 153–60.

Song, Q., and M. Shepperd. 2007. “Missing Data Imputation Techniques.” International Journal of Business Intelligence & Data Mining 2 (3): 261–91.