Chapter 1 Exploring and Preprocessing

Before a Machine Learning model can be generated, an exploratory data analysis (EDA) must be performed. Based on this EDA, data is cleaned and prepared for further analysis.
Data cleaning and preparing has, among other issues, to do with missing values, outlier analysis and data formatting.

After importing the data in R first check whether the data type for the different variables is correct.

1.1 Missing values

Missing values are quite common in real world data sets. How to deal with them depends on things like (1) the number of observations with missing values, (2) whether there is pattern in the missing values Different options to deal with missing values:

  • remove observations
  • numerical variables: impute median (or mean)
  • impute value based on knn technique
  • categorical variables: impute “NA” value, in other words add a category “NA”

For a more comprehensive treatment of this topic see Van Buuren (2018).

1.1.1 Variables with no or very little variation

Sometimes variables in a data set have no variation at all, in other words are constant. These variables have to be removed before generating a data model.
Variables with very little variation also can cause trouble when generating a model. This is especially the case when the data is split in a training and a test set, because it is possible that in the training or in the test set all observations have the same value for this variable.

1.2 Exploratory analysis (1): graphs

Some commonly used graphs:

  • numerical variables: histograms; density plots
  • categorical variables: bar plots
  • relations between two variables
    • two numericals: scatter plot
    • one numerical, one categorical: box plots
    • two categoricals: side-by-side bar plots

1.3 Exploratory analysis (2): Summary statistics

  • use summary() function in R

1.4 Outlier analysis

  • detect outliers
  • analyse these outliers
  • decide whether to include them in the further analysis

1.5 Normalization and standardization

For some of the modelling methods numeric features must be normalized or standardised before training the model. This is especially the case when the units used for the features, have impact on the model training process. E.g. models in which distance between observations play an important role, as in KNN models.

Normalization transforms the X-values in the corresponding Z-scores: \(z = \frac{x-mean(X)}{sd(X)}\).

Standardization map the X-values on the interval [-1, +1]: \(x_standardized_ = \frac{x-min(X)}{max(X)-min(X)}\).

Before building a model, in most cases the data are split in a training and a test set, see next chapter. The model is genreated using the data in the training set and the performance of the model is evaluated based on the data in the test set.

The data in the test set are hold back to evaluate how the model performs on data not seen before. Normalization and standardization must therefore be performed before splitting the data in training and test data.
The training data are normalized or standardized using the formula above.
Before a model can be applied on test data, these data should be normalized (standardized) as well. Because test data are not used to train a model, for normalizing (standardizing) the test data, the mean and standard deviation (minimum and maximum) of the training data are used.
In R one option for normalizing (standardizing) the features is writing a function and apply it on the numeric features. A good alternative is to use the preprocess options in the caret package. See code below.

library(caret)
#example normalizing data, using ISLR::Default data set
df <- ISLR::Default

#split in training and test set; 70% in training set
#use caret::createDataPartition()
set.seed(20210416)
train <- createDataPartition(df$default, p=.7, list=F)

df_train <- df[train,]
df_test <- df[-train,]

#normalize the data in the train set
#for standardizing use method="range" in the preProcess() function
preproc <- preProcess(df, method=c("center", "scale"))
#outcome is a preProcess model
#apply this model on the train data using predict() function

df_train_preproc <- predict(preproc, newdata=df_train)

#normalize the data in the test set, based on mean and sd in train data
#by applying the preproc preprocessing model on the test set
df_test_preproc <- predict(preproc, newdata=df_test)

1.6 Unbalanced data set

In some cases, the data set is unbalanced as far as the target variable is concerned. For instance in the ISLR::Default data set, only 3% of the observations fall in the category default==“yes”.

Table 1
Unbalanced Data in ISLR::Default Data Set

df <- ISLR::Default
table(df$default)

  No  Yes 
9667  333 

There are different solutions to deal with this. Upsampling and downsampling are the easiest ones.
As with normalization (standardization) upsampling and downsampling is performed on the training data, not on the test data.

1.6.1 Up Sampling

Up sampling is done by adding sample to the training data by bootstrap sampling form the minor class, until the number of samples in the two classes are equal.
The caret package has a function upSample() to perform up sampling.

Table 2

df <- ISLR::Default

#create training and test set
set.seed(20210419)
train <- createDataPartition(df$default, p=0.7, list=F)

df_train <- df[train,]
df_test <- df[-train,]

#up sampling df_train
set.seed(20210419)
df_train_up <- upSample(df_train, df_train$default)

#standardize df_train_up
preproc <- preProcess(df_train_up, method=c("range"))
df_train_up_prep <- predict(preproc, df_train_up)

#normalize df_test using the mean and sd from df_train_up_prep
df_test_prep <- predict(preproc, df_test)  

Table 3
Data df_train

summary(df_train)
##  default    student       balance           income     
##  No :6767   No :4905   Min.   :   0.0   Min.   :  772  
##  Yes: 234   Yes:2096   1st Qu.: 481.8   1st Qu.:21166  
##                        Median : 823.6   Median :34278  
##                        Mean   : 837.6   Mean   :33353  
##                        3rd Qu.:1168.9   3rd Qu.:43715  
##                        Max.   :2654.3   Max.   :73554

Note. The training set is unbalanced, proportion dedault=“Yes” is the same as in the whole data set, due to the use of caret::createDataPartiion() function


Table 4
Data df_train_up

summary(df_train_up)
##  default    student       balance           income      Class     
##  No :6767   No :8975   Min.   :   0.0   Min.   :  772   No :6767  
##  Yes:6767   Yes:4559   1st Qu.: 802.4   1st Qu.:20058   Yes:6767  
##                        Median :1365.3   Median :32761             
##                        Mean   :1282.9   Mean   :32534             
##                        3rd Qu.:1789.1   3rd Qu.:43243             
##                        Max.   :2654.3   Max.   :73554

Note. After up sampling the data set is balanced; the summary statistics for the predictors changed, especially for the ‘balance’ variable; this indicates that this predictor can distinguish between the two classes; an extra variable Class is added by the caret::upSample() function.

Table 5
Data df_train_up_prep

summary(df_train_up_prep)
##  default    student       balance           income       Class     
##  No :6767   No :8975   Min.   :0.0000   Min.   :0.0000   No :6767  
##  Yes:6767   Yes:4559   1st Qu.:0.3023   1st Qu.:0.2650   Yes:6767  
##                        Median :0.5144   Median :0.4395             
##                        Mean   :0.4833   Mean   :0.4364             
##                        3rd Qu.:0.6740   3rd Qu.:0.5835             
##                        Max.   :1.0000   Max.   :1.0000

Note. The summary statistics for the predictor functions changed, because the values of these predictors are standardised.

Table 6
Data df_test_prep

summary(df_test_prep)
##  default    student       balance           income       
##  No :2900   No :2151   Min.   :0.0000   Min.   :0.03036  
##  Yes:  99   Yes: 848   1st Qu.:0.1815   1st Qu.:0.28835  
##                        Median :0.3103   Median :0.47483  
##                        Mean   :0.3128   Mean   :0.45515  
##                        3rd Qu.:0.4362   3rd Qu.:0.59307  
##                        Max.   :0.9100   Max.   :0.93207

Note. The test set is not balanced, because the model should be evaluated on real data, not on artificial added data; standardisation makes use of the Maximum and Minimum value ion the training data set, that is why Minimum and Maximum in this care van differ from 0 and 1 respectively.