12.2 Result Explainition

At results explanation, we will report on the efforts that we have put into solving the “Titanic prediction problem” through a process of Data understanding, Data preprocess, Predictors selection, Model construction, Model cross validation and Model fine tune . The report will provide a summary of the jobs done on the data analytical process.

  • At the “Data understanding” step, we went through individual attributes of both train and test datasets and examined their quality and quantity. We discovered the attributes that have the missing values and some other problems. The discoveries set up the goals for the data preprocess to be accomplished.

  • We typically examined the values of the response variable Survived and its distributions. We recognised from the train datasets that about 2/3 of the passengers have perished. We have also examined relations between the individual attributes and the response variable. We’ve found that some attributes have no direct connections or impact on survival such as Name and Ticket. There is no evidence that someone perished because of her name or her ticket number. However, we’ve found that the name and the Ticket number contain information that can have an impact on survival such as Title and Deck (number, which reflects the location on the ship). This information needs to be abstracted by a technique called “Features Re-engineering”.

  • At the “Data preprocess”, we have filled the missing values using different techniques. For attributes with a small portion of missing values, we filled the missing value with the average value, the random values from artificially generated data samples that have the same statistical characteristics of the attribute, or the values predicted with a machine learning model. For attributes With a large proportion of missing values, we created a new attribute that reflects the present (or absent) of the values. We re-engineered (created) many new attributes, so they can reflect the relations between the attributes and the response variable in a more meaningful and more accurate way.

  • At the “Predictors selection”, we have carried out correlation analyses between individual attributes and the response variable and among the attributes themselves. PCA was used to figure out the most influential attributes despite that the method is mostly used for dimension reduction.

  • Model building” is a key task in any data science project. Titanic prediction is a binary classification problem. It can be addressed with many models including Regression models which are not ideal for a binary classification problem. We have tried the two most commonly used models “Decision tree” and “Random Forest”. We can see that each model has a different prediction performance. During the model construction, the goal was to pursue a higher estimated prediction accuracy since we don’t have access to the model’s prediction accuracy at the production stage. This is problematic because a model can have a higher estimated accuracy during the model construction but has a much lower prediction accuracy while in real use or in production. It is difficult to know whether that is overfitting or underfitting in the model construction stage.

  • Cross Validation” (CV) is a commonly used technique to find and eliminate a model’s overfitting problem. CV uses only a portion of training data in the model construction and uses the rest portion to test the constructed model since the leftover training data has values of the response variable1. So a comparison can be made and the prediction accuracy can be calculated. At the CV stage, We not only validated the models we have constructed but also several different types of models to compare their performance.

  • Most of the models are not in their best state when first build. A model’s performance can be improved with “Fine Tune” of the model’s parameters and the use of the training dataset. We’ve fine-tuned a random forest model. With different trails on the train dataset provided, we have found the proper number of predictors and the actual combination of the predictors to use. The proper proportion of the training data was investigated and discovered.

Going through the entire data analyses process, we have produced our prediction model that is the Random forest. The best one is RF_model2.

  1. It is called label in machine learning↩︎