12.4 Further Analysis
The previous section reports the constructed model (e.g. RF_model2
) in terms of how it comes about and what was its limitations:
The model
RF_model2
is not the best one and it is seriously overfitting. Its performance ont eh test dataset should be improved.If any further work is planed, then it should start from considering re-engineer
Title
andSex
since they are the most important predictors in modelRF_model2
.
This section will demonstrate how to improve a constructed model’s performance. We still use RF_model2
as an example. A good place to start is where it gets things wrong! To spot where things went wrong is difficult from numbers. A good technique is using graph. However model RF_model2
has 500 decision trees. Is is difficult to visualise 500 trees.
Recall that we have a decision tree model model3
. It has the same predictors with the RF_model2
. We can use this decision tree (see Figure 12.4 to find the place where the things may go wrong.
From the Figure 12.4, we can see that the single place that we got things wrong is the left branch of the first test condition, where the adult male passengers (as “Title = MR
”) has 81 passengers being wrongly predicted as survived. This is also confirmed by our model that the error rate of predicting passengers’ survival is higher than the error rate of predicting passengers’ perished. So re-engineer attribute “Title
” is a good place to start. This also coincides with the suggestion from the previous section where the importance order of the predictors used in the RF_model2
.
Now we will just demonstrate how to further re-engineering Title
attribute. The values of Title
in the train
dataset are as follows:
##
## Master Miss Mr Mrs Other
## 61 260 757 197 34
We can see that there are 34 records in the train dataset which has the value of Other
in the Title
attribute. It is a good place where further purification can be done.
Let us go back to the raw dataset and abstract title for the name attribute.
##
## Capt Col Don Dona Dr Jonkheer
## 1 4 1 1 8 1
## Lady Major Master Miss Mlle Mme
## 1 2 61 260 2 1
## Mr Mrs Ms Rev Sir the Countess
## 757 197 2 8 1 1
It becomes obvious that the value of Title
which have been categorised as other
is too simplified. We can abstract more information such as gender and age from them. Those information are useful for the prediction. It is also inappropriate to keep them as separate categories since some of them have a small number of instances, use them could lead to overfitting of the model.
further bin or bucket them into a more appropriate category is required. We can do so with the knowledge of nobility, locality (country of origin) and other knowledge such as time (at the beginning of the 20 century). For example, “Dona
” and “the Countess
” are female nobility equivalent to “Lady
”, and “Ms
” and “Mlle
” are essentially the same with “Miss
”; “Mme
” is a military title equivalent to “Madame
”, so it can be categorised as “Mrs
”; “Jonkheer
” is an honorific nobility in the Netherlands; and “Don
” is title of a university lecturer, they can be categorises as “Sir
”; “Col
”, “Capt
”, and “Major
” are military ranks and can be replaced with a more general title “Officer
”. With all of these, we can reduce the numbers of title’s category.
##
## Dr Lady Master Miss Mr Mrs Officer Rev Sir
## 8 3 61 264 757 198 7 8 3
We can convert Title
into factor to plot their relations with the value of Survived
.
We could stop at here since we have purified the Title’s value other with more precise category in terms of semantic meaning. However, we notice that some value still have very small numbers. We should re-categorise those with small numbers category like “Lady
” and “Sir
” into category with larger number and keep the survive’s ratio as close as possible. We can categorise “Lady
” into “Mrs
”, “Sir
” and “Rev
” into “Mr
”, For neutral titles like “Dr
” and “Officer
”, we can categorise them into title “Mr
” and “Mrs
” according to sex.
##
## Dr Lady Master Miss Mr Mrs Officer Rev Sir
## 0 0 61 264 782 202 0 0 0
We can check the title against gender to see if any mistakes made.
After re-categorised the small number of title, we only have 4 categories of title. From the plot, we can see their survive radio is matched with the Survive radio of the attribute Sex
.
We could use this re-engineered title attributes “New_Title
” to re-build RF models. The overall accuracy of the new models should be increased. The following code is an example of showing that. The new model has indeed increased the over model’s prediction accuracy with 0.45%. It is not a lot but it approves the point that features re-engineer is a place to do a model’s performance improvement.
##
## Call:
## randomForest(formula = as.factor(Survived) ~ Sex + Fare_pp + Pclass + New_Title + Age_group + Group_size + Ticket_class + Embarked, data = RE_data[1:891, ], importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 16.05%
## Confusion matrix:
## 0 1 class.error
## 0 504 45 0.08196721
## 1 98 244 0.28654971
We can further do the same with many other attributes or combination of multiple attributes.