12.4 Further Analysis

The previous section reports the constructed model (e.g. RF_model2) in terms of how it comes about and what was its limitations:

  1. The model RF_model2 is not the best one and it is seriously overfitting. Its performance ont eh test dataset should be improved.

  2. If any further work is planed, then it should start from considering re-engineer Title and Sex since they are the most important predictors in model RF_model2.

This section will demonstrate how to improve a constructed model’s performance. We still use RF_model2 as an example. A good place to start is where it gets things wrong! To spot where things went wrong is difficult from numbers. A good technique is using graph. However model RF_model2 has 500 decision trees. Is is difficult to visualise 500 trees.

Recall that we have a decision tree model model3. It has the same predictors with the RF_model2. We can use this decision tree (see Figure 12.4 to find the place where the things may go wrong.

The simple decision tree of RF_model2

Figure 12.4: The simple decision tree of RF_model2

From the Figure 12.4, we can see that the single place that we got things wrong is the left branch of the first test condition, where the adult male passengers (as “Title = MR”) has 81 passengers being wrongly predicted as survived. This is also confirmed by our model that the error rate of predicting passengers’ survival is higher than the error rate of predicting passengers’ perished. So re-engineer attribute “Title” is a good place to start. This also coincides with the suggestion from the previous section where the importance order of the predictors used in the RF_model2.

Now we will just demonstrate how to further re-engineering Title attribute. The values of Title in the train dataset are as follows:

## 
## Master   Miss     Mr    Mrs  Other 
##     61    260    757    197     34

We can see that there are 34 records in the train dataset which has the value of Other in the Title attribute. It is a good place where further purification can be done.

Let us go back to the raw dataset and abstract title for the name attribute.

## 
##         Capt          Col          Don         Dona           Dr     Jonkheer 
##            1            4            1            1            8            1 
##         Lady        Major       Master         Miss         Mlle          Mme 
##            1            2           61          260            2            1 
##           Mr          Mrs           Ms          Rev          Sir the Countess 
##          757          197            2            8            1            1

It becomes obvious that the value of Title which have been categorised as other is too simplified. We can abstract more information such as gender and age from them. Those information are useful for the prediction. It is also inappropriate to keep them as separate categories since some of them have a small number of instances, use them could lead to overfitting of the model.

further bin or bucket them into a more appropriate category is required. We can do so with the knowledge of nobility, locality (country of origin) and other knowledge such as time (at the beginning of the 20 century). For example, “Dona” and “the Countess” are female nobility equivalent to “Lady”, and “Ms” and “Mlle” are essentially the same with “Miss”; “Mme” is a military title equivalent to “Madame”, so it can be categorised as “Mrs”; “Jonkheer” is an honorific nobility in the Netherlands; and “Don” is title of a university lecturer, they can be categorises as “Sir”; “Col”, “Capt”, and “Major” are military ranks and can be replaced with a more general title “Officer”. With all of these, we can reduce the numbers of title’s category.

## 
##      Dr    Lady  Master    Miss      Mr     Mrs Officer     Rev     Sir 
##       8       3      61     264     757     198       7       8       3

We can convert Title into factor to plot their relations with the value of Survived.

Surival Rates for new.Title

Figure 12.5: Surival Rates for new.Title

We could stop at here since we have purified the Title’s value other with more precise category in terms of semantic meaning. However, we notice that some value still have very small numbers. We should re-categorise those with small numbers category like “Lady” and “Sir” into category with larger number and keep the survive’s ratio as close as possible. We can categorise “Lady” into “Mrs”, “Sir” and “Rev” into “Mr”, For neutral titles like “Dr” and “Officer”, we can categorise them into title “Mr” and “Mrs” according to sex.

## 
##      Dr    Lady  Master    Miss      Mr     Mrs Officer     Rev     Sir 
##       0       0      61     264     782     202       0       0       0

We can check the title against gender to see if any mistakes made.

Surival Rates for re-categorised new.Title

Figure 12.6: Surival Rates for re-categorised new.Title

After re-categorised the small number of title, we only have 4 categories of title. From the plot, we can see their survive radio is matched with the Survive radio of the attribute Sex.

We could use this re-engineered title attributes “New_Title” to re-build RF models. The overall accuracy of the new models should be increased. The following code is an example of showing that. The new model has indeed increased the over model’s prediction accuracy with 0.45%. It is not a lot but it approves the point that features re-engineer is a place to do a model’s performance improvement.

## 
## Call:
##  randomForest(formula = as.factor(Survived) ~ Sex + Fare_pp +      Pclass + New_Title + Age_group + Group_size + Ticket_class +      Embarked, data = RE_data[1:891, ], importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 16.05%
## Confusion matrix:
##     0   1 class.error
## 0 504  45  0.08196721
## 1  98 244  0.28654971

We can further do the same with many other attributes or combination of multiple attributes.