3.3 The Titanic Problem

The objective of the Titanic problem defined on the Kaggle website as stated in the following:

"The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (i.e. name, age, gender, socio-economic class, etc.)."

The Challenge

The competition is simple: we want you to use the Titanic passenger data (name, age, price of ticket, etc.) to try to predict who will survive and who will die.

The requirement is to predict passengers’ survive. Like many other real data science problems, Prediction is to build a model which takes input data and produce an output. A prediction model is a mathematical formula that takes input from historical facts reflecting past event and produce a output that to make predictions about future or otherwise unknown events. A simple way to understand model is to think a model in the following three ways:

The relationship between input and output can be expressed by some kinds of math formula. It is generally called definable model, the math formula can be as simple as a function of Polynomial expression or as complected as a regression model, or other statistics models.
Some models can not be explicitly expressed with a math formula, instead they are expressed in rules. those are rule-based models.
Other models can not be expressed in a math formula nor in rules. The solution is build a neural networks to do prediction. An Neural Networks can be regard as a “black box”, which takes input and produce output, the internal connections are transparent to users. Machine learning is more focused on models rooted in Neural networks.

Any model fundamentally expresses relationships between inputs and outputs. So as part of understanding the problem, We could interpret that the Kaggle Titanic challenge is to find creditable relationships between input data and out put data (which is survive or not). Once the relationship is found, we can express using either a math formula, a set of rules or a Neural Network model.

The Data

Kaggle competition usually provides competition data. There is a “Data” tab on any competition site. Click on the Data tab at the top of the competition page, you will find the raw data provided and most of time there are brief explanation of the data attributes¹ too.

There are three files in the Titanic Challenge:

train.csv,
test.csv, and
gender_submission.csv.

The training set is supposedly used to build your models. For the training set, it provides the outcome (also known as the “ground truth”) for each passenger. Your model will be based on attributes like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, there is no ground truth for each passenger is provided. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

The data sets has also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

The Submission

Submission at the Titanic competition is equivalent to the requirements on the final report of any data science project. that is one of the questions you need to understand in the beginning of the project.

Titanic competition requires the results need be submitted in the file. The file structure is demonstrated in the “gender_submission.csv”. It is also provided as an example that shows how you should structure your results, which means predictions.

The example submission in “Gender_submission” predicts that all female passengers survived, and all male passengers died. It is clearly biased. Your hypotheses regarding survival will probably be different, which will lead to a different submission file. Properly it is a good idea now to rename the “Gender_submission.csv” file into “My_submission.csv” now. So you know that you have to submit “my_submission.csv” as the final report of your project and the submission indicate your completion of your project.

Do it yourself:

Download data file from Kaggel web site.(https://www.kaggle.com/c/titanic/data)
Unzip it into your working directory.
Rename “Gender_submission.csv” file into “My_submission.csv”.

Make sure your submission should have:

“PassengerId” column containing the IDs of each passenger from test.csv.
“Survived” column (that you will create!) with a “1” for the rows where you think the passenger survived, and a “0” where you predict that the passenger died.

We have used Data Science terminology in here. Data represent objects in natural world. Object’s properties are represented by attributes. That is a data record has a number of attributes representing a natural object with a number of properties. records is also called observations or samples in statistics, property is also called variables, parameters or dimensions↩︎