11.2 Tuning Training Data Samples

Once we have selected predictors from available attributes of data samples. The second factor that affects the model’s performance is the data samples. We know that we should try to make the most use of data samples for training a model. In practice, a lot of times, we only have a small-sized data sample and we have to use other techniques to enlarge the data sample. However, there is a drawback of over-used the data sample. That is the possibility of the noise and outliers may have been fitted into a model and therefore decreasing the model’s prediction accuracy when used in production. In this section, we use CV to demonstrate the process of setting the right portion of the data sample.

We will use CV to illustrate the technique. So it means to set the proper portion (ratio) between the split of the train dataset and the sample usage in the model’s fitting process.

We will continue to use RF model as our example. We know that the RF model has an overfitting issue. One possibility is we have used an improper portion of the training dataset. Following our predictor selection in the above section, we know that the best RF model is FT_rf.8. We will use FT_rf.8 to demonstrate how to find the best CV settings.

Set Prediction Accuracy Benchmark

The benchmark is the model’s prediction accuracy on the test dataset (unseen data).

# Let's start with a submission of FT_rf.8 to Kaggle 
# to find the difference between model's OOB and the accuracy

# Subset our test records and features
test.submit.df <- test[, c("Sex", "Title", "Fare_pp", "Ticket_class", "Pclass", "Ticket", "Age", "Friend_size", "Deck")]
test.submit.df$Ticket <- as.numeric(test.submit.df$Ticket)

# Make predictions
FT_rf.8.preds <- predict(FT_rf.8, test.submit.df)
table(FT_rf.8.preds)

## FT_rf.8.preds
##   0   1 
## 260 158

# Write out a CSV file for submission to Kaggle
submit.df <- data.frame(PassengerId = test$PassengerId, Survived = FT_rf.8.preds)

write.csv(submit.df, file = "./data/FT_rf.8.csv", row.names = FALSE)

After our submission, we have scores of 0.75598 from Kaggle, but the OOB predicts that we should score 0.8642. We can see there is a big gap in between. The idea is to reduce this gap by adjusting the CV sampling controls.

10 Folds CV Repeat 10 Times

Let’s look into CV using the caret package to see if we can get more accurate estimates by adjusting CV’s sampling parameters. Research has shown that 10-fold CV repeated 10 times is the best place to start. One of the important ideas is that to ensure that the ratio of those values of the response variable (Survived) in each fold matches the overall training set. This is known as stratified CV.

Firstly, We randomly create 10 folds and repeat 10 times with our train dataset by a caret function createMultiFolds. So it has effectively enlarged our train data sample 100 times. This can be seen from the length of the list ‘cv.10.folds’. We will use these settings to train our RF model rf.8. Before we do that, we can also verify the survived ratio in the samples created to see if they are the same or close to the same ratio.

library(caret)
library(doSNOW)

set.seed(2348)
# rf.label is the Survived in the train dataset.
# ? createMultiFolds to find out more. train (891)
cv.10.folds <- createMultiFolds(rf.label, k = 10, times = 10)

# Check stratification
table(rf.label)

## rf.label
##   0   1 
## 549 342

342 / 549

## [1] 0.6229508

table(rf.label[cv.10.folds[[34]]])

## 
##   0   1 
## 494 308

308 / 494

## [1] 0.6234818

We can see that we have produced 100 sample sets and they are in a size of $length(train)*9/10$ and kept the stratification (both has a similar radio around 62.3%). Let us use “repeatedcv” to train our rf.8 model and see the impact of the sampling on the model’s prediction accuracy.

# Set up caret's trainControl object using 10-folds repeated CV
ctrl.1 <- trainControl(method = "repeatedcv", number = 10, repeats = 10, index = cv.10.folds)

Model construction with “10-folds repeated CV” is a very expensive computation. Thanks, R has a package called “doSNOW”, that facilities the use of a multi-core processor and permits parallel computing in a pseudo cluster mode.

## Set up doSNOW package for multi-core training. 
# cl <- makeCluster(6, type = "SOCK")
# registerDoSNOW(cl)
# # Set seed for reproducibility and train
# set.seed(34324)
# 
# FT_rf.8.cv.1 <- train(x = rf.train.8, y = rf.label, method = "rf", tuneLength = 3, ntree = 500, trControl = ctrl.1)
# 
# #Shutdown cluster
# stopCluster(cl)
# save(FT_rf.8.cv.1, file = "./data/FT_rf.8.cv.1.rda")
# Check out results
load("./data/FT_rf.8.cv.1.rda")
# Check out results
FT_rf.8.cv.1

## Random Forest 
## 
## 891 samples
##   9 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 802, 802, 802, 802, 802, 802, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8495937  0.6773587
##   5     0.8511480  0.6818755
##   9     0.8467582  0.6727850
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 5.

The RF model FT_rf.8.cv.1 trained using new data samples (10 sets with each has 802 data records) above is only slightly more pessimistic than the rf.8 OOB prediction since the accuracy reduced from 0.8642 to 0.8511, but not pessimistic enough to the test accuracy, it is 0.75598. However, it clearly demonstrated the impact of the data samples on the model’s performance.

5 Folds CV Repeat 10 Times

Let’s try new data samples with 5-fold CV repeated 10 times.

set.seed(5983)
# cv.5.folds <- createMultiFolds(rf.label, k = 5, times = 10)
# 
# ctrl.2 <- trainControl(method = "repeatedcv", number = 5, repeats = 10, index = cv.5.folds)
# 
# cl <- makeCluster(6, type = "SOCK")
# registerDoSNOW(cl)
# 
# set.seed(89472)
# FT_rf.8.cv.2 <- train(x = rf.train.8, y = rf.label, method = "rf", tuneLength = 3, ntree = 500, trControl = ctrl.2)
# 
# #Shutdown cluster
# stopCluster(cl)
# save(FT_rf.8.cv.2, file = "./data/FT_rf.8.cv.2.rda")
# # Check out results
load("./data/FT_rf.8.cv.2.rda")
# Check out results
FT_rf.8.cv.2

## Random Forest 
## 
## 891 samples
##   9 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 10 times) 
## Summary of sample sizes: 714, 713, 713, 712, 712, 712, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8491649  0.6768403
##   5     0.8444515  0.6675058
##   9     0.8415270  0.6618163
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

We can see that 5-fold CV is a little better. The accuracy is now moved under 85% (0.8491649). The model’s training data set is moved from 9/10 to 4/5, which is around 713 records per fold now.

3 Folds CV Repeat 10 Times

Let us move further to try 3-fold CV repeated 10 times.

set.seed(37596)
# cv.3.folds <- createMultiFolds(rf.label, k = 3, times = 10)
# 
# ctrl.3 <- trainControl(method = "repeatedcv", number = 3, repeats = 10, index = cv.3.folds)
# 
# cl <- makeCluster(6, type = "SOCK")
# registerDoSNOW(cl)
# 
# set.seed(94622)
# FT_rf.8.cv.3 <- train(x = rf.train.8, y = rf.label, method = "rf", tuneLength = 3, ntree = 500, trControl = ctrl.3)
# 
# #Shutdown cluster
# stopCluster(cl)
# 
# save(FT_rf.8.cv.3, file = "./data/FT_rf.8.cv.3.rda")
# # # Check out results
load("./data/FT_rf.8.cv.3.rda")
# Check out results
FT_rf.8.cv.3

## Random Forest 
## 
## 891 samples
##   9 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold, repeated 3 times) 
## Summary of sample sizes: 594, 594, 594, 594, 594, 594, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8387579  0.6529203
##   5     0.8376356  0.6522826
##   9     0.8357651  0.6503612
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

We can see the accuracy has further decreased (0.8387579). Let us also reduced the number of times that the samples are repeated used in the training (repeat times). Let us see if the sample repeat times reduce to 3, if the accuracy can be further reduced.

# set.seed(396)
# cv.3.folds <- createMultiFolds(rf.label, k = 3, times = 3)
# 
# ctrl.4 <- trainControl(method = "repeatedcv", number = 3, repeats = 3, index = cv.3.folds)
# 
# cl <- makeCluster(6, type = "SOCK")
# registerDoSNOW(cl)
# 
# set.seed(9622)
# FT_rf.8.cv.4 <- train(x = rf.train.8, y = rf.label, method = "rf", tuneLength = 3, ntree = 50, trControl = ctrl.4)
# 
# #Shutdown cluster
# stopCluster(cl)
#save(FT_rf.8.cv.4, file = "./data/FT_rf.8.cv.4.rda")
# # # Check out results
load("./data/FT_rf.8.cv.4.rda")
# Check out results
FT_rf.8.cv.4

## Random Forest 
## 
## 891 samples
##   9 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold, repeated 3 times) 
## Summary of sample sizes: 594, 594, 594, 594, 594, 594, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8443696  0.6649206
##   5     0.8436214  0.6659881
##   9     0.8368874  0.6531629
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

We can see the impact of the training samples (numbers and repeated times) used on the RF model’s accuracy. Among of our 4 trials, it appeared that the settings ctrl.3, which is 3-folds and repeated 10 times has the best accuracy. We could continue the trails until we are satisfied. But we will stop here for computation reasons.

One of the conclusion we may draw form the exercises we did is that the best sampling should imic the proportion of the training dataset with the testing dataset. Our Titanic datasets have a proportion of (891:418), which is roughly 2:1. so our sampling should match with this proportions. The 3-folds CV partition, using 2 folds to train and 1 fold to test, matches this ratio. So it is reasonable to believe that the best data sampling is the 3-folds repeated 10 times for our Titanic problem.

Some of you may notice that we have set different mtree values. That is one of the parameters RF model used for generating the prediction. We are going to discuss about them next.