Summary

The purpose of this book is to provide a hand on practical exercise in doing data science project. Clearly, we cannot cover the complete available methods, models and algorithms for a data science project. The most important thing is to understand the process of doing a data science project. The first step, as indicated by the 6-step process in section 1.3, is “understand the problem”.

We have choose to use Titanic problem to demonstrate the whole data analytical process. However, real world problem is far more complicated than this well defined problem. Most business organizations may not know the exact problem (that is part of reason why they want do data analysis or business analysis) or they know the problem (in general) but the problem can not be expressed explicitly.

I have met a situation that a business organization has created a data center and collected all their business operational data. The boss asked to analyze these data and find:

  1. Is there are problems?
  2. If yes, how to overcome these problems?
  3. If not, how to improve the business operations?

You see, here the problem is how to define the problem? how to convert business problem into data science problem.

For the example, the first problem in the above list needs to know what is the normal or expected performance? How to evaluate the performance? In terms of turn over or profit? In what time scale? It could be a short of profit in the moment but it not causes alarm because the recent investment for developing a new market. At a long run it will have a great ROI (Return on Investment). The second problem demands to identify the cause of the problem and the third to identify the KIP (Key performance Indicators). they are both to identify the relationships between predictor and dependent variables. But they can be completely different sets.

Understand problem is actually more complicated in real world. Until you have completely understood it and turned it into a list of analytical problems you can move to next step.

With the Titanic problem, combining the story and the requirements on the Kaggle website, I would consider these:

  • On April 14 and 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. The overall survival rate is 32%.
  • One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew.
  • Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
  • The story tells us that when they were get on board the lifeboats, they applied a policy of “women and children first” and also “the ship crew are the last”.
  • Sometimes the family was boarding the life boat together and some of the family members were swimming together too.

Those thoughts form some kinds of assumptions in mind. They will guide more detailed data explorations later.