8 Introduction to Data Science at Scale

Last Part, we discussed statistical tests including t-test, chi-square test, ks-test, & ANOVA on a couple features in our analytic data set, we compared and contrasted these p-values and traced the downstream effect of the feature in a simple logistic regression. This illustrates the importance of Exploratory Data Analysis. However, today’s data-scientist has to contend with data-sets consisting of hundreds or thousands of features to analyze. In this Part we will consider the scalability of our analytics - how do we perform Exploratory Data Analysis on hundreds of features in a dataset?

In define three new classes of features arising from Demographics (Section 12.1), Labs (Section 12.2), and Examination (Section 12.3) Domains. Most of this is a review of use of case_when; however, the Labs do present us with some challenges worth examination in Section 12.2.1.

Our primary discussion will be around learning how to functionalize some of our processes using R, we have over 100 columns to analyze and copying and pasting code is ineffective.

In Chapter 9 we will

  • introduce several concepts including enquo and variable resolution with !!.
  • showcase the comparedf and tableby functions in arsenal
  • use purrr and furrr to iterate and speed up functions

In Chapter 10 as we continue to analyze the data, we will

  • discuss variants of normal dplyr functions with their _at brethren: mutate_at; summarise_at, filter_at, and others.
  • discuss missing data analytics & mean value imputation.
  • showcase a few different packages that look for correlated features and discuss why we want to look for correlated features.
  • review Principal Component Analysis and k-means clustering as means of data reduction.
  • give many examples of how to easily define useful functions to aide in your analysis
  • showcase DataExplorer, skimer, GGplot as packages for to assist with automation of EDA tasks