• Do Data Science in 10 Days
    • What is This Book?
    • Structure of the Book
    • What Can This Book Offer You?
    • Notes
    • Acknowledgements
  • 1 Introduction
    • 1.1 What is Data Science?
      • Data Science as Discovery of Data Insight
      • Data Science as Development of Data Product
    • 1.2 What is Data Scientist?
      • The Requisite Skill Set
      • How to Become a Data Scientist?
    • 1.3 Process of Doing Data Science
      • Step 1: Understand the Problem - Define Objectives
      • Step 2: Undertand Data - Knowing your Raw Materials
      • Step 3: Data Preprocess - Get your Data Ready
      • Step 4: Data Analyese - Building Models
      • Step 5: Results Interpretation and Evaluation
      • Step 6: Data Report and Communication
    • 1.4 Tools used in Doing a Data Science Project
      • R
      • Python
      • SQL
      • Hadoop
      • Tableau
      • Weka
    • 1.5 Applications of Data Science
      • Data Science in Healthcare
      • Data Science in E-commerce
      • Data Science in Manufacturing
      • Data Science as Conversational Agents
      • Data Science in Transport
    • 1.6 Data Science Related Terms
      • DataAnalyst and Data Scientist
      • Machine Learning and Data Science
      • Data Mining and Data Science
    • Summary
    • Exercises
  • 2 Get Your Tools Ready
    • 2.1 Brief introductiuon about R and RStudio
      • Features of R Programming
      • R Scripts
      • R Graphical User Interface (RGui)
      • RStudio
    • 2.2 Downlaod and Install R and RStudio
      • R Download and Installation
      • RStudio Download and Installation
      • Familiar with RStudio interface
    • 2.3 Bootsup your RStudio
    • 2.4 Instructions
      • Code
      • Tips
      • Actions
      • Exercise
    • Exercises
  • 3 Understand Problem
    • 3.1 Kaggle Competion
    • 3.2 Titianic at Kaggel
    • 3.3 The Titanic Problem
      • The Challenge
      • The Data
      • The Submission
    • Summary
    • Exercises
  • 4 Understand Data
    • 4.1 Load Data
    • 4.2 Assess Data Quantity
    • 4.3 General Data Attributes Assessment
    • 4.4 Actual Attributes Types Examination
    • 4.5 Actual Data Attributes Value Examination
      • PassengerID
      • Survived
      • Pclass
      • Name
      • Sex
      • Age
      • SibSp
      • Parch
      • Ticket
      • Fare
      • Cabin
      • Embarkded
    • 4.6 Data Recods Level Assessment
    • Summary
    • Exercises
  • 5 Data Preparasion
    • 5.1 General Data Prepartion Tasks
    • 5.2 Dealt with Miss Values
      • Cabin Attribute
      • Age Attribute
      • Fare Attribute
      • Embarked Attribute
    • 5.3 Attribute Re-engineering
      • Title from Name attribute
      • Deck from Cabin attribute
      • Extract ticket class from ticket number
      • Travel in Groups
      • Age in Groups
      • Fare per passenger
    • 5.4 Build Re-engineered Dataset
    • Summary
    • Exercises
  • 6 Data Analysis
    • 6.1 Predictive Data Analysis (PDA)
    • 6.2 Process of Predcitive Data Analysis
      • Predictor Selection
      • Model Construction
      • Model Validation
    • 6.3 Classification as A Specific Prediction
    • Summary
  • 7 Predictor Selection
    • 7.1 Predictor Selection Pricinples
    • 7.2 Attributes Analysis
    • 7.3 Attributes Correlation Analysis
    • 7.4 PCA Analysis
    • Summary
  • 8 Prediction with Decision Trees
    • 8.1 Decision Tree in Hunt’s Algorithm
      • How to Form a Test Condition?
      • How to Determine the Best Split Condition?
    • 8.2 The Simplest Decision Tree for Titanic
    • 8.3 The Decision Tree with Core Predictors
    • 8.4 The Decision Tree with More Predictors
    • 8.5 The Decision Tree with Full Predictors
    • Summary
    • Exercises
  • 9 Titiannic Prediction with Random Forest
    • 9.1 Steps to Build a Random Forest
    • 9.2 Titanic prediciton with a Random Forest
      • Random Forest with Key Predictors
      • Random Forest with More Variables
      • Random Forest with All Variables
      • Comparision the Three Random Forest Models
    • Summary
    • Exercises
  • 10 Model Cross Validation
    • 10.1 Model’s Underfitting and Overfitting
    • 10.2 General Cross Validation Methods
      • Single model Cross Validation
      • General Procedure of CV
      • Cross Validation on Decision Tree Models
      • Cross Validation on Random Forest Models
    • 10.3 Multiple Models Comparison
      • Regression Model for Titanic
      • Support Vector Machine Model for Titanic
      • Neural Network Models
      • Comparision among Different Models
    • Summary
    • Exercises
  • 11 Fine Tune Models
    • 11.1 Tuning a model’s Predictor
    • 11.2 Tuning Training Data Samples
      • Set Prediction Accuracy Benchmark
      • 10 Folds CV Repeat 10 Times
      • 5 Folds CV Repeat 10 Times
      • 3 Folds CV Repeat 10 Times
    • 11.3 Tuning Model’s Parameters
      • Random Search
      • Grid Search
      • Manual Search
    • Summary
    • Exercises
  • 12 Report
    • 12.1 Content of Report
    • 12.2 Result Explainition
    • 12.3 Model Interpretation
      • Model’s Performance Measure
      • Visualise Model’s Prediction
      • Importance of the Model’s Predictors
    • 12.4 Further Analysis
    • Summary
    • Exercises
  • (APPENDIX) Apendix: The R code of the entire project
    • TitanicDataAnalysis1_Understand_Data.R
    • TitanicDataAnalysis_Data+Preprocess.R
    • TitanicDataAnalysis_Model_Construction.R
    • TitanicDataAnalysis_Model_Cross_Validation.R
    • TitanicDataAnalysis_Model_Fine_Tune.R
    • TitanicDataAnalysis_Analyse_Report.R
  • Published with bookdown

Do A Data Science Project in 10 Days

1.6 Data Science Related Terms

There are a slew of terms closely related to data science that we hope to add some clarity around.

DataAnalyst and Data Scientist

Analytics has risen quickly in popular business lingo over the past several years; the term is used loosely, but generally meant to describe critical thinking that is quantitative in nature. Technically, analytics is the “science of analysis” — put another way, the practice of analyzing information to make decisions.

Is “analytics” the same thing as data science? Depends on context. Sometimes it is synonymous with the definition of data science that we have described, and sometimes it represents something else. A data scientist using raw data to build a predictive algorithm falls into the scope of analytics. At the same time, a non-technical business user interpreting pre-built dashboard reports (e.g. GA) is also in the realm of analytics, but does not cross into the skill set needed in data science. Analytics has come to have fairly broad meaning. At the end of the day, as long as you understand beyond the buzzword level, the exact semantics don’t matter much.

“Analyst” is somewhat of an ambiguous job title that can represent many different types of roles (data analyst, marketing analyst, operations analyst, financial analyst, etc). What does this mean in comparison to data scientist?

Data Scientist: Specialty role with abilities in math, technology, and business acumen. Data scientists work at the raw database level to derive insights and build data product. Analyst: This can mean a lot of things. Common thread is that analysts look at data to try to gain insights. Analysts may interact with data at both the database level or the summarized report level.

Thus, “analyst” and “data scientist” is not exactly synonymous, but also not mutually exclusive. Here is our interpretation of how these job titles map to skills and scope of responsibilities:

Machine Learning and Data Science

Machine learning is a term closely associated with data science. It refers to a broad class of methods that revolve around data modeling to (1) algorithmically make predictions, and (2) algorithmically decipher patterns in data.

Machine learning for making predictions — Core concept is to use tagged data to train predictive models. Tagged data means observations where ground truth is already known. Training models means automatically characterizing tagged data in ways to predict tags for unknown data points. E.g. a credit card fraud detection model can be trained using a historical record of tagged fraud purchases. The resultant model estimates the likelihood that any new purchase is fraudulent. Common methods for training models range from basic regressions to complex neural nets. All follow the same paradigm known as supervised learning.

Machine learning for pattern discovery — Another modeling paradigm known as unsupervised learning tries to surface underlying patterns and associations in data when no existing ground truth is known (i.e. no observations are tagged). Within this broad category of methods, the most commonly used are clustering techniques, which algorithmically detect what are the natural groupings that exist in a data set. For example, clustering can be used to programmatically learn the natural customer segments in a company’s user base. Other unsupervised methods for mining underlying characteristics include: principal component analysis, hidden markov models, topic models, and more.

Not all machine learning methods fit neatly into the above two categories. For example, collaborative filtering is a type of recommendations algorithm with elements related to both supervised and unsupervised learning. Contextual bandits are a twist on supervised learning where predictions get adaptively modified on-the-fly using live feedback.

This wide-ranging breadth of machine learning techniques comprise an important part of the data science toolbox. It is up to the data scientist to figure out which tool to use in different circumstances (as well as how to use the tool correctly) in order to solve analytically open-ended problems.

Data Mining and Data Science

Raw data can be unstructured and messy, with information coming from disparate data sources, mismatched or missing records, and a slew of other tricky issues. Data munging is a term to describe the data wrangling to bring together data into cohesive views, as well as the janitorial work of cleaning up data so that it is polished and ready for downstream usage. This requires good pattern-recognition sense and clever hacking skills to merge and transform masses of database-level information. If not properly done, dirty data can obfuscate the ‘truth’ hidden in the data set and completely mislead results. Thus, any data scientist must be skillful and nimble at data munging in order to have accurate, usable data before applying more sophisticated analytical tactics.