## 17.6 Application

How many imputation:

Usually 5. (unless you have extremely high portion of missing, in which case you probably need to check your data again)

According to Rubin, the relative efficiency of an estimate based on m imputations to infinity imputation is approximately

$(1+\frac{\lambda}{m})^{-1}$

where $$\lambda$$ is the rate of missing data

Example 50% of missing data means an estimate based on 5 imputation has standard deviation that is only 5% wider compared to an estimate based on infinity imputation
($$\sqrt{1+0.5/5}=1.049$$)

library(missForest)
## Loading required package: randomForest
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
##     margin
## Loading required package: foreach
## Loading required package: itertools
## Loading required package: iterators
##
## Attaching package: 'missForest'
## The following object is masked from 'package:VIM':
##
##     nrmse
#load data
data <- iris

#Generate 10% missing values at Random
set.seed(1)
iris.mis <- prodNA(iris, noNA = 0.1)

#remove categorical variables
iris.mis.cat <- iris.mis
iris.mis <- subset(iris.mis, select = -c(Species))

### 17.6.1 Imputation with mean / median / mode

# whole data set
e1071::impute(iris.mis, what = "mean") # replace with mean
e1071::impute(iris.mis, what = "median") # replace with median

# by variables
Hmisc::impute(iris.mis$Sepal.Length, mean) # mean Hmisc::impute(iris.mis$Sepal.Length, median)  # median
Hmisc::impute(iris.mis$Sepal.Length, 0) # replace specific number check accurary library(DMwR) ## Registered S3 method overwritten by 'quantmod': ## method from ## as.zoo.data.frame zoo ## ## Attaching package: 'DMwR' ## The following object is masked from 'package:VIM': ## ## kNN actuals <- iris$Sepal.Width[is.na(iris.mis$Sepal.Width)] predicteds <- rep(mean(iris$Sepal.Width, na.rm=T), length(actuals))
regr.eval(actuals, predicteds)
##       mae       mse      rmse      mape
## 0.2870303 0.1301598 0.3607767 0.1021485

### 17.6.2 KNN

library(DMwR)
# iris.mis[,!names(iris.mis) %in% c("Sepal.Length")]
# data should be this line. But since knn cant work with 3 or less variables, we need to use at least 4 variables.

# knn is not appropriate for categorical variables
knnOutput <-
knnImputation(data = iris.mis.cat,
#k = 10,
meth = "median" # could use "median" or "weighAvg"
)  # should exclude the dependent variable: Sepal.Length
anyNA(knnOutput)
##  FALSE
library(DMwR)
actuals <- iris$Sepal.Width[is.na(iris.mis$Sepal.Width)]
predicteds <- knnOutput[is.na(iris.mis$Sepal.Width), "Sepal.Width"] regr.eval(actuals, predicteds) ## mae mse rmse mape ## 0.2318182 0.1038636 0.3222788 0.0823571 Compared to mape (mean absolute percentage error) of mean imputation, we see almost always see improvements. ### 17.6.3 rpart For categorical (factor) variables, rpart can handle library(rpart) class_mod <- rpart(Species ~ . - Sepal.Length, data=iris.mis.cat[!is.na(iris.mis.cat$Species), ], method="class", na.action=na.omit)  # since Species is a factor, and exclude dependent variable "Sepal.Length"

anova_mod <- rpart(Sepal.Width ~ . - Sepal.Length, data=iris.mis[!is.na(iris.mis$Sepal.Width), ], method="anova", na.action=na.omit) # since Sepal.Width is numeric. species_pred <- predict(class_mod, iris.mis.cat[is.na(iris.mis.cat$Species), ])

### 17.6.6 missForest

• an implementation of random forest algorithm (a non parametric imputation method applicable to various variable types). Hence, no assumption about function form of f. Instead, it tries to estimate f such that it can be as close to the data points as possible.
• builds a random forest model for each variable. Then it uses the model to predict missing values in the variable with the help of observed values.
• It yields out of bag imputation error estimate. Moreover, it provides high level of control on imputation process.
• Since bagging works well on categorical variable too, we don’t need to remove them here.
library(missForest)
#impute missing values, using all parameters as default values
iris.imp <- missForest(iris.mis)
##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!
# check imputed values
# iris.imp$ximp # check imputation error # NRMSE is normalized mean squared error. It is used to represent error derived from imputing continuous values. # PFC (proportion of falsely classified) is used to represent error derived from imputing categorical values. iris.imp$OOBerror
##      NRMSE        PFC
## 0.13631893 0.04477612
#comparing actual data accuracy
iris.err <- mixError(iris.imp$ximp, iris.mis, iris) iris.err ## NRMSE PFC ## 0.1501524 0.0625000 This means categorical variables are imputed with 5% error and continuous variables are imputed with 14% error. This can be improved by tuning the values of mtry and ntree parameter. • mtry refers to the number of variables being randomly sampled at each split. • ntree refers to number of trees to grow in the forest. ### 17.6.7 Hmisc • impute() function imputes missing value using user defined statistical method (mean, max, mean). It’s default is median. • aregImpute() allows mean imputation using additive regression, bootstrapping, and predictive mean matching. 1. In bootstrapping, different bootstrap resamples are used for each of multiple imputations. Then, a flexible additive model (non parametric regression method) is fitted on samples taken with replacements from original data and missing values (acts as dependent variable) are predicted using non-missing values (independent variable). 2. it uses predictive mean matching (default) to impute missing values. Predictive mean matching works well for continuous and categorical (binary & multi-level) without the need for computing residuals and maximum likelihood fit. Note • For predicting categorical variables, Fisher’s optimum scoring method is used. • Hmisc automatically recognizes the variables types and uses bootstrap sample and predictive mean matching to impute missing values. • missForest can outperform Hmisc if the observed variables have sufficient information. Assumption • linearity in the variables being predicted. library(Hmisc) ## Loading required package: survival ## Loading required package: Formula ## ## Attaching package: 'Hmisc' ## The following objects are masked from 'package:base': ## ## format.pval, units # impute with mean value iris.mis$imputed_age <- with(iris.mis, impute(Sepal.Length, mean))

# impute with random value
iris.mis$imputed_age2 <- with(iris.mis, impute(Sepal.Length, 'random')) # could also use min, max, median to impute missing value # using argImpute impute_arg <- aregImpute(~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width + Species, data = iris.mis, n.impute = 5) # argImpute() automatically identifies the variable type and treats them accordingly. ## Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5 Iteration 6 Iteration 7 Iteration 8  impute_arg # R-squares are for predicted missing values. ## ## Multiple Imputation using Bootstrap and PMM ## ## aregImpute(formula = ~Sepal.Length + Sepal.Width + Petal.Length + ## Petal.Width + Species, data = iris.mis, n.impute = 5) ## ## n: 150 p: 5 Imputations: 5 nk: 3 ## ## Number of NAs: ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 11 11 13 24 16 ## ## type d.f. ## Sepal.Length s 2 ## Sepal.Width s 2 ## Petal.Length s 2 ## Petal.Width s 2 ## Species c 2 ## ## Transformation of Target Variables Forced to be Linear ## ## R-squares for Predicting Non-Missing Values for Each Variable ## Using Last Imputations of Predictors ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 0.907 0.660 0.978 0.963 0.993 # check imputed variable Sepal.Length impute_arg$imputed\$Sepal.Length
##     [,1] [,2] [,3] [,4] [,5]
## 19   5.2  5.2  5.2  5.8  5.7
## 21   5.1  5.0  5.1  5.7  5.4
## 31   4.8  5.0  5.2  5.0  4.8
## 35   4.6  4.9  4.9  4.9  4.8
## 49   5.0  5.1  5.1  5.1  5.1
## 62   6.2  5.7  6.0  6.4  5.6
## 65   5.5  5.5  5.2  5.8  5.5
## 67   6.5  5.8  5.8  6.3  6.5
## 82   5.2  5.1  5.7  5.8  5.5
## 113  6.4  6.5  7.4  7.2  6.3
## 122  6.2  5.8  5.5  5.8  6.7

### 17.6.8 mi

1. allows graphical diagnostics of imputation models and convergence of imputation process.
2. uses Bayesian version of regression models to handle issue of separation.
3. automatically detects irregularities in data (e.g., high collinearity among variables).
4. adds noise to imputation process to solve the problem of additive constraints.
library(mi)
# default values of parameters
# 1. rand.imp.method as “bootstrap”
# 2. n.imp (number of multiple imputations) as 3
# 3. n.iter ( number of iterations) as 30
mi_data <- mi(iris.mis, seed = 335)
summary(mi_data)