Chapter 1 Exploring and Preprocessing
Before a Machine Learning model can be generated, an exploratory data analysis (EDA) must be performed. Based on this EDA, data is cleaned and prepared for further analysis.
Data cleaning and preparing has, among other issues, to do with missing values, outlier analysis and data formatting.
After importing the data in R first check whether the data type for the different variables is correct.
1.1 Missing values
Missing values are quite common in real world data sets. How to deal with them depends on things like (1) the number of observations with missing values, (2) whether there is pattern in the missing values Different options to deal with missing values:
- remove observations
- numerical variables: impute median (or mean)
- impute value based on knn technique
- categorical variables: impute “NA” value, in other words add a category “NA”
For a more comprehensive treatment of this topic see Van Buuren (2018).
1.1.1 Variables with no or very little variation
Sometimes variables in a data set have no variation at all, in other words are constant. These variables have to be removed before generating a data model.
Variables with very little variation also can cause trouble when generating a model. This is especially the case when the data is split in a training and a test set, because it is possible that in the training or in the test set all observations have the same value for this variable.
1.2 Exploratory analysis (1): graphs
Some commonly used graphs:
- numerical variables: histograms; density plots
- categorical variables: bar plots
- relations between two variables
- two numericals: scatter plot
- one numerical, one categorical: box plots
- two categoricals: side-by-side bar plots
1.3 Exploratory analysis (2): Summary statistics
- use summary() function in R
1.4 Outlier analysis
- detect outliers
- analyse these outliers
- decide whether to include them in the further analysis
1.5 Normalization and standardization
For some of the modelling methods numeric features must be normalized or standardised before training the model. This is especially the case when the units used for the features, have impact on the model training process. E.g. models in which distance between observations play an important role, as in KNN models.
Normalization transforms the X-values in the corresponding Z-scores: \(z = \frac{x-mean(X)}{sd(X)}\).
Standardization map the X-values on the interval [-1, +1]: \(x_standardized_ = \frac{x-min(X)}{max(X)-min(X)}\).
Before building a model, in most cases the data are split in a training and a test set, see next chapter. The model is genreated using the data in the training set and the performance of the model is evaluated based on the data in the test set.
The data in the test set are hold back to evaluate how the model performs on data not seen before. Normalization and standardization must therefore be performed before splitting the data in training and test data.
The training data are normalized or standardized using the formula above.
Before a model can be applied on test data, these data should be normalized (standardized) as well. Because test data are not used to train a model, for normalizing (standardizing) the test data, the mean and standard deviation (minimum and maximum) of the training data are used.
In R one option for normalizing (standardizing) the features is writing a function and apply it on the numeric features. A good alternative is to use the preprocess options in the caret package. See code below.
library(caret)
#example normalizing data, using ISLR::Default data set
df <- ISLR::Default
#split in training and test set; 70% in training set
#use caret::createDataPartition()
set.seed(20210416)
train <- createDataPartition(df$default, p=.7, list=F)
df_train <- df[train,]
df_test <- df[-train,]
#normalize the data in the train set
#for standardizing use method="range" in the preProcess() function
preproc <- preProcess(df, method=c("center", "scale"))
#outcome is a preProcess model
#apply this model on the train data using predict() function
df_train_preproc <- predict(preproc, newdata=df_train)
#normalize the data in the test set, based on mean and sd in train data
#by applying the preproc preprocessing model on the test set
df_test_preproc <- predict(preproc, newdata=df_test)
1.6 Unbalanced data set
In some cases, the data set is unbalanced as far as the target variable is concerned. For instance in the ISLR::Default data set, only 3% of the observations fall in the category default==“yes”.
Table 1
Unbalanced Data in ISLR::Default Data Set
No Yes
9667 333
There are different solutions to deal with this. Upsampling and downsampling are the easiest ones.
As with normalization (standardization) upsampling and downsampling is performed on the training data, not on the test data.
1.6.1 Up Sampling
Up sampling is done by adding sample to the training data by bootstrap sampling form the minor class, until the number of samples in the two classes are equal.
The caret package has a function upSample() to perform up sampling.
Table 2
df <- ISLR::Default
#create training and test set
set.seed(20210419)
train <- createDataPartition(df$default, p=0.7, list=F)
df_train <- df[train,]
df_test <- df[-train,]
#up sampling df_train
set.seed(20210419)
df_train_up <- upSample(df_train, df_train$default)
#standardize df_train_up
preproc <- preProcess(df_train_up, method=c("range"))
df_train_up_prep <- predict(preproc, df_train_up)
#normalize df_test using the mean and sd from df_train_up_prep
df_test_prep <- predict(preproc, df_test)
Table 3
Data df_train
## default student balance income
## No :6767 No :4905 Min. : 0.0 Min. : 772
## Yes: 234 Yes:2096 1st Qu.: 481.8 1st Qu.:21166
## Median : 823.6 Median :34278
## Mean : 837.6 Mean :33353
## 3rd Qu.:1168.9 3rd Qu.:43715
## Max. :2654.3 Max. :73554
Note. The training set is unbalanced, proportion dedault=“Yes” is the same as in the whole data set, due to the use of caret::createDataPartiion() function
Table 4
Data df_train_up
## default student balance income Class
## No :6767 No :8975 Min. : 0.0 Min. : 772 No :6767
## Yes:6767 Yes:4559 1st Qu.: 802.4 1st Qu.:20058 Yes:6767
## Median :1365.3 Median :32761
## Mean :1282.9 Mean :32534
## 3rd Qu.:1789.1 3rd Qu.:43243
## Max. :2654.3 Max. :73554
Note. After up sampling the data set is balanced; the summary statistics for the predictors changed, especially for the ‘balance’ variable; this indicates that this predictor can distinguish between the two classes; an extra variable Class is added by the caret::upSample() function.
Table 5
Data df_train_up_prep
## default student balance income Class
## No :6767 No :8975 Min. :0.0000 Min. :0.0000 No :6767
## Yes:6767 Yes:4559 1st Qu.:0.3023 1st Qu.:0.2650 Yes:6767
## Median :0.5144 Median :0.4395
## Mean :0.4833 Mean :0.4364
## 3rd Qu.:0.6740 3rd Qu.:0.5835
## Max. :1.0000 Max. :1.0000
Note. The summary statistics for the predictor functions changed, because the values of these predictors are standardised.
Table 6
Data df_test_prep
## default student balance income
## No :2900 No :2151 Min. :0.0000 Min. :0.03036
## Yes: 99 Yes: 848 1st Qu.:0.1815 1st Qu.:0.28835
## Median :0.3103 Median :0.47483
## Mean :0.3128 Mean :0.45515
## 3rd Qu.:0.4362 3rd Qu.:0.59307
## Max. :0.9100 Max. :0.93207
Note. The test set is not balanced, because the model should be evaluated on real data, not on artificial added data; standardisation makes use of the Maximum and Minimum value ion the training data set, that is why Minimum and Maximum in this care van differ from 0 and 1 respectively.