11.9 Application of Imputation in R

This section demonstrates how to visualize missing data and handle it using different imputation techniques.

Package Algorithm Cont Var Cate Var Diagnostics Complexity Handling Best Use Case Limitations
missForest Random Forest Yes Yes Out-of-bag error (NRMSE, PFC) Handles complex interactions Mixed data types with complex interactions May overfit with small datasets
Hmisc Additive Regression, Bootstrap, Predictive Mean Matching Yes Yes \(R^2\) for imputed values Basic to intermediate complexity Simple datasets with low complexity Limited to simple imputation methods
mi Bayesian Regression Yes Yes Graphical diagnostics,convergence Detects issues like collinearity Datasets with irregularities Computationally intensive for large data
MICE Multivariate Imputation via Chained Equations Yes Yes Density plots, pooling of results Handles variable interactions General-purpose imputation for MAR data Requires proper method selection for variable types
Amelia Bootstrap-based Expectation Maximization (EMB) Yes Limited (requires normality) Diagnostics supported Works well with large/time-series data Time-series or datasets approximating MVN Assumes MVN, requires transformations for non-MVN

11.9.1 Visualizing Missing Data

Visualizing missing data is an essential first step in understanding the patterns and extent of missingness in your dataset.

library(visdat)
library(naniar)
library(ggplot2)

# Visualizing missing data
vis_miss(airquality)


# Missingness patterns using an upset plot
gg_miss_upset(airquality)

# Scatter plot of missing data with faceting
ggplot(airquality, aes(x, y)) +
  geom_miss_point() +
  facet_wrap(~ group)

# Missing values by variable
gg_miss_var(data, facet = group)

# Missingness in relation to factors
gg_miss_fct(x = variable1, fct = variable2)

For more details, read The Missing Book by Nicholas Tierney & Allison Horst.

11.9.2 How Many Imputations?

Usually, 5 imputations are sufficient unless there is an extremely high proportion of missing data. High proportions require revisiting data collection processes.

Rubin’s Rule for Relative Efficiency

According to Rubin, the relative efficiency of an estimate based on \(m\) imputations (relative to infinite imputations) is given by:

\[ \text{Relative Efficiency} = ( 1 + \frac{\lambda}{m})^{-1} \]

where \(\lambda\) is the rate of missing data.

For example, with 50% missing data ($\lambda = 0.5$), the standard deviation of an estimate based on 5 imputations is only about 5% wider than that from infinite imputations:

\[ \sqrt{1 + \frac{0.5}{5}} = 1.049 \]

11.9.3 Generating Missing Data for Demonstration

library(missForest)

# Load the data
data <- iris

# Generate 10% missing values at random
set.seed(1)
iris.mis <- prodNA(iris, noNA = 0.1)

# Remove categorical variables for numeric imputation
iris.mis.cat <- iris.mis
iris.mis <- subset(iris.mis, select = -c(Species))

11.9.4 Imputation with Mean, Median, and Mode

Mean, median, or mode imputation is a simple yet commonly used technique.

# Imputation for the entire dataset
e1071::impute(iris.mis, what = "mean")        # Replace with mean
e1071::impute(iris.mis, what = "median")      # Replace with median

# Imputation by variable
Hmisc::impute(iris.mis$Sepal.Length, mean)    # Replace with mean
Hmisc::impute(iris.mis$Sepal.Length, median)  # Replace with median
Hmisc::impute(iris.mis$Sepal.Length, 0)       # Replace with a specific value

Checking Accuracy

Accuracy can be checked by comparing predictions with actual values.

# Example data
actuals <- iris$Sepal.Width[is.na(iris.mis$Sepal.Width)]
predicteds <- rep(mean(iris$Sepal.Width, na.rm = TRUE), length(actuals))

# Using MLmetrics package
library(MLmetrics)

MAE(predicteds, actuals)
#> [1] 0.2870303
MSE(predicteds, actuals)
#> [1] 0.1301598
RMSE(predicteds, actuals)
#> [1] 0.3607767

11.9.5 K-Nearest Neighbors (KNN) Imputation

KNN is a more sophisticated method, leveraging similar observations to fill in missing values.

library(DMwR2)
knnOutput <- knnImputation(data = iris.mis.cat, meth = "median")
anyNA(knnOutput)  # Check for remaining missing values
#> [1] FALSE
actuals <- iris$Sepal.Width[is.na(iris.mis$Sepal.Width)]
predicteds <- knnOutput[is.na(iris.mis$Sepal.Width), "Sepal.Width"]
# Using MLmetrics package
library(MLmetrics)

MAE(predicteds, actuals)
#> [1] 0.2318182
MSE(predicteds, actuals)
#> [1] 0.1038636
RMSE(predicteds, actuals)
#> [1] 0.3222788

KNN typically improves upon mean or median imputation in terms of predictive accuracy.

11.9.6 Imputation with Decision Trees (rpart)

Decision trees, such as those implemented in rpart, are effective for both numeric and categorical variables.

library(rpart)

# Imputation for a categorical variable
class_mod <- rpart(
  Species ~ . - Sepal.Length,
  data = iris.mis.cat[!is.na(iris.mis.cat$Species), ],
  method = "class",
  na.action = na.omit
)

# Imputation for a numeric variable
anova_mod <- rpart(
  Sepal.Width ~ . - Sepal.Length,
  data = iris.mis[!is.na(iris.mis$Sepal.Width), ],
  method = "anova",
  na.action = na.omit
)

# Predictions
species_pred <- predict(class_mod, iris.mis.cat[is.na(iris.mis.cat$Species), ])
width_pred <- predict(anova_mod, iris.mis[is.na(iris.mis$Sepal.Width), ])

11.9.7 MICE (Multivariate Imputation via Chained Equations)

MICE assumes that the data are Missing at Random (MAR). It imputes data for each variable by specifying an imputation model tailored to the variable type.

11.9.7.1 How MICE Works

For a dataset with variables \(X_1, X_2, \dots, X_k\):

  • If \(X_1\) has missing data, it is regressed on the other variables.

  • This process is repeated for all variables with missing data, using the previously predicted values as needed.

By default:

  • Continuous variables use linear regression.

  • Categorical variables use logistic regression.

11.9.7.2 Methods Available in MICE

  • pmm (Predictive Mean Matching): For numeric variables.
  • logreg (Logistic Regression): For binary variables (2 levels).
  • polyreg (Bayesian polytomous regression): For factor variables (≥2 levels).
  • Proportional Odds Model: For ordered factor variables (≥2 levels).
# Load packages
library(mice)
library(VIM)

# Check missing values pattern
md.pattern(iris.mis)

#>     Sepal.Width Sepal.Length Petal.Length Petal.Width   
#> 100           1            1            1           1  0
#> 15            1            1            1           0  1
#> 8             1            1            0           1  1
#> 2             1            1            0           0  2
#> 11            1            0            1           1  1
#> 1             1            0            1           0  2
#> 1             1            0            0           1  2
#> 1             1            0            0           0  3
#> 7             0            1            1           1  1
#> 3             0            1            0           1  2
#> 1             0            0            1           1  2
#>              11           15           15          19 60

# Plot missing values
aggr(
  iris.mis,
  col = c('navyblue', 'yellow'),
  numbers = TRUE,
  sortVars = TRUE,
  labels = names(iris.mis),
  cex.axis = 0.7,
  gap = 3,
  ylab = c("Missing data", "Pattern")
)

#> 
#>  Variables sorted by number of missings: 
#>      Variable      Count
#>   Petal.Width 0.12666667
#>  Sepal.Length 0.10000000
#>  Petal.Length 0.10000000
#>   Sepal.Width 0.07333333

Imputing Data

# Perform multiple imputation using MICE
imputed_Data <- mice(
  iris.mis,
  m = 5,             # Number of imputed datasets
  maxit = 10,        # Number of iterations
  method = 'pmm',    # Imputation method
  seed = 500         # Random seed for reproducibility
)

Evaluating Imputed Data

# Summary of imputed data
summary(imputed_Data)
#> Class: mids
#> Number of multiple imputations:  5 
#> Imputation methods:
#> Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
#>        "pmm"        "pmm"        "pmm"        "pmm" 
#> PredictorMatrix:
#>              Sepal.Length Sepal.Width Petal.Length Petal.Width
#> Sepal.Length            0           1            1           1
#> Sepal.Width             1           0            1           1
#> Petal.Length            1           1            0           1
#> Petal.Width             1           1            1           0

# Density plot: compare imputed values (red) with observed values (blue)
densityplot(imputed_Data)

Accessing and Using Imputed Data

# Access the complete datasets
completeData1 <- complete(imputed_Data, 1)  # First imputed dataset
completeData2 <- complete(imputed_Data, 2)  # Second imputed dataset

Regression Model with Imputed Dataset

# Fit a regression model using imputed datasets
fit <- with(data = imputed_Data, exp = lm(Sepal.Width ~ Sepal.Length + Petal.Width))

# Combine results of all 5 models
combine <- pool(fit)
summary(combine)
#>           term   estimate  std.error statistic        df      p.value
#> 1  (Intercept)  1.9054698 0.33454626  5.695684 105.12438 1.127064e-07
#> 2 Sepal.Length  0.2936285 0.07011405  4.187870  88.69066 6.625536e-05
#> 3  Petal.Width -0.4742921 0.08138313 -5.827892  46.94941 4.915270e-07

11.9.8 Amelia

Amelia uses a bootstrap-based Expectation-Maximization with Bootstrapping (EMB) algorithm for imputation, making it faster and suitable for cross-sectional and time-series data.

11.9.8.1 Assumptions

  • All variables must follow a Multivariate Normal Distribution (MVN). Transformations may be required for non-normal data.

  • Data must be Missing at Random (MAR).

11.9.8.2 Comparison: Amelia vs. MICE

  • MICE imputes on a variable-by-variable basis using separate models.

  • Amelia uses a joint modeling approach based on MVN.

  • MICE handles multiple data types, while Amelia requires variables to approximate normality.

11.9.8.3 Imputation with Amelia

library(Amelia)
data("iris")

# Seed 10% missing values
set.seed(123)
iris.mis <- prodNA(iris, noNA = 0.1)

# Specify columns and run Amelia
amelia_fit <- amelia(
  iris.mis,
  m = 5,                      # Number of imputations
  parallel = "multicore",     # Use multicore processing
  noms = "Species"            # Nominal variables
)
#> -- Imputation 1 --
#> 
#>   1  2  3  4  5  6  7
#> 
#> -- Imputation 2 --
#> 
#>   1  2  3  4  5
#> 
#> -- Imputation 3 --
#> 
#>   1  2  3  4  5
#> 
#> -- Imputation 4 --
#> 
#>   1  2  3  4  5  6
#> 
#> -- Imputation 5 --
#> 
#>   1  2  3  4  5  6  7  8  9 10

# Access imputed outputs
# amelia_fit$imputations[[1]]

Amelia’s workflow includes bootstrapping multiple imputations to generate robust estimates of means and variances. This process ensures flexibility and speed for large datasets.

11.9.9 missForest

The missForest package provides a robust non-parametric imputation method using the Random Forest algorithm. It is versatile, handling both continuous and categorical variables without requiring assumptions about the underlying functional forms.

Key Features of missForest

  1. Non-Parametric: No assumptions about the functional form.
  2. Variable-Specific Models: Builds a random forest model for each variable to impute missing values.
  3. Error Estimates: Provides out-of-bag (OOB) imputation error estimates.
    • NRMSE (Normalized Root Mean Squared Error) for continuous variables.
    • PFC (Proportion of Falsely Classified) for categorical variables.
  4. High Control: Offers customizable parameters like mtry and ntree.
library(missForest)

# Impute missing values using default parameters
iris.imp <- missForest(iris.mis)

# Check imputed values
# View the imputed dataset
# iris.imp$ximp
# Out-of-bag error estimates
iris.imp$OOBerror
#>      NRMSE        PFC 
#> 0.14004144 0.02877698

# Compare imputed data with original data to calculate error
iris.err <- mixError(iris.imp$ximp, iris.mis, iris)
iris.err
#>      NRMSE        PFC 
#> 0.14420833 0.09090909

11.9.10 Hmisc

The Hmisc package provides a suite of tools for imputing missing data, offering both simple methods (like mean or median imputation) and more advanced approaches like aregImpute.

Features of Hmisc

  1. impute(): Simple imputation using user-defined methods like mean, median, or a random value.

  2. aregImpute():

    • Combines additive regression, bootstrapping, and predictive mean matching.

    • Handles continuous and categorical variables.

    • Automatically recognizes variable types and applies appropriate methods.

Assumptions

  • Linearity in the variables being predicted.

  • Fisher’s optimum scoring is used for categorical variable prediction.

library(Hmisc)

# Impute using mean
iris.mis$imputed_SepalLength <- with(iris.mis, impute(Sepal.Length, mean))

# Impute using random value
iris.mis$imputed_SepalLength2 <- with(iris.mis, impute(Sepal.Length, 'random'))

# Advanced imputation using aregImpute
impute_arg <- aregImpute(
  ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width + Species,
  data = iris.mis,
  n.impute = 5
)
#> Iteration 1 
Iteration 2 
Iteration 3 
Iteration 4 
Iteration 5 
Iteration 6 
Iteration 7 
Iteration 8 

# Check R-squared values for predicted missing values
impute_arg
#> 
#> Multiple Imputation using Bootstrap and PMM
#> 
#> aregImpute(formula = ~Sepal.Length + Sepal.Width + Petal.Length + 
#>     Petal.Width + Species, data = iris.mis, n.impute = 5)
#> 
#> n: 150   p: 5    Imputations: 5      nk: 3 
#> 
#> Number of NAs:
#> Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
#>           17           19           12           16           11 
#> 
#>              type d.f.
#> Sepal.Length    s    2
#> Sepal.Width     s    2
#> Petal.Length    s    2
#> Petal.Width     s    2
#> Species         c    2
#> 
#> Transformation of Target Variables Forced to be Linear
#> 
#> R-squares for Predicting Non-Missing Values for Each Variable
#> Using Last Imputations of Predictors
#> Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
#>        0.895        0.536        0.987        0.967        0.984

# Access imputed values for Sepal.Length
impute_arg$imputed$Sepal.Length
#>     [,1] [,2] [,3] [,4] [,5]
#> 13   4.4  4.9  4.9  5.0  4.9
#> 14   4.8  4.4  5.0  4.5  4.5
#> 23   4.8  5.1  5.1  5.1  4.8
#> 26   5.0  4.8  4.9  4.9  5.0
#> 34   5.0  5.8  6.0  5.7  5.8
#> 39   4.4  4.9  5.0  4.5  4.6
#> 41   5.2  5.1  4.8  5.0  4.8
#> 69   5.8  6.0  6.3  6.0  6.1
#> 72   5.6  5.7  5.7  5.8  6.1
#> 89   6.1  5.7  5.7  5.6  6.9
#> 90   5.5  6.2  5.2  6.0  5.8
#> 91   5.7  6.9  6.0  6.4  6.4
#> 116  5.9  6.8  6.4  6.6  6.9
#> 118  7.9  7.9  7.9  7.9  7.9
#> 135  6.7  6.7  6.7  6.9  6.7
#> 141  7.0  6.3  5.9  6.7  7.0
#> 143  5.7  6.7  5.8  6.3  5.4

Note: While missForest often outperforms Hmisc in terms of accuracy, the latter is useful for datasets with simpler requirements.

11.9.11 mi

The mi package is a powerful tool for imputation, using Bayesian methods and providing rich diagnostics for model evaluation and convergence.

Features of mi

  1. Graphical Diagnostics: Visualize imputation models and convergence.

  2. Bayesian Regression: Handles separation and other issues in data.

  3. Irregularity Detection: Automatically detects issues like high collinearity.

  4. Noise Addition: Adds noise to address additive constraints.

library(mi)

# Perform imputation using mi
mi_data <- mi(iris.mis, seed = 1)

# Summary of the imputation process
summary(mi_data)
#> $Sepal.Length
#> $Sepal.Length$is_missing
#> missing
#> FALSE  TRUE 
#>   133    17 
#> 
#> $Sepal.Length$imputed
#>       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
#> -0.6355172 -0.0703238 -0.0005039 -0.0052716  0.0765631  0.3731257 
#> 
#> $Sepal.Length$observed
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#> -0.90110 -0.47329 -0.04549  0.00000  0.32120  1.23792 
#> 
#> 
#> $Sepal.Width
#> $Sepal.Width$is_missing
#> missing
#> FALSE  TRUE 
#>   131    19 
#> 
#> $Sepal.Width$imputed
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> -2.1083 -0.4216 -0.1925 -0.1940  0.1589  0.7330 
#> 
#> $Sepal.Width$observed
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#> -1.01272 -0.30642 -0.07099  0.00000  0.39988  1.34161 
#> 
#> 
#> $Petal.Length
#> $Petal.Length$is_missing
#> missing
#> FALSE  TRUE 
#>   138    12 
#> 
#> $Petal.Length$imputed
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#> -0.86312 -0.58453  0.23556  0.04176  0.48870  0.77055 
#> 
#> $Petal.Length$observed
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> -0.7797 -0.6088  0.1459  0.0000  0.3880  0.9006 
#> 
#> 
#> $Petal.Width
#> $Petal.Width$is_missing
#> missing
#> FALSE  TRUE 
#>   134    16 
#> 
#> $Petal.Width$imputed
#>       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
#> -0.9116177 -0.0000042  0.2520468  0.1734543  0.5147010  0.8411324 
#> 
#> $Petal.Width$observed
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#> -0.69624 -0.56602  0.08503  0.00000  0.41055  0.86629 
#> 
#> 
#> $Species
#> $Species$crosstab
#>             
#>              observed imputed
#>   setosa          180      16
#>   versicolor      192      11
#>   virginica       184      17
#> 
#> 
#> $imputed_SepalLength
#> $imputed_SepalLength$is_missing
#> [1] "all values observed"
#> 
#> $imputed_SepalLength$observed
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> -0.9574 -0.4379  0.0000  0.0000  0.3413  1.3152 
#> 
#> 
#> $imputed_SepalLength2
#> $imputed_SepalLength2$is_missing
#> [1] "all values observed"
#> 
#> $imputed_SepalLength2$observed
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#> -0.90570 -0.48398 -0.06225  0.00000  0.35947  1.20292