8 Introduction to Data Science at Scale
Last Part, we discussed statistical tests including t-test, chi-square test, ks-test, & ANOVA on a couple features in our analytic data set, we compared and contrasted these p-values and traced the downstream effect of the feature in a simple logistic regression. This illustrates the importance of Exploratory Data Analysis. However, today’s data-scientist has to contend with data-sets consisting of hundreds or thousands of features to analyze. In this Part we will consider the scalability of our analytics - how do we perform Exploratory Data Analysis on hundreds of features in a dataset?
In define three new classes of features arising from Demographics (Section 12.1), Labs (Section 12.2), and Examination (Section 12.3) Domains. Most of this is a review of use of
case_when; however, the Labs do present us with some challenges worth examination in Section 12.2.1.
Our primary discussion will be around learning how to functionalize some of our processes using
R, we have over 100 columns to analyze and copying and pasting code is ineffective.
In Chapter 9 we will
- introduce several concepts including
enquoand variable resolution with
- showcase the
furrrto iterate and speed up functions
In Chapter 10 as we continue to analyze the data, we will
- discuss variants of normal
dplyrfunctions with their
filter_at, and others.
- discuss missing data analytics & mean value imputation.
- showcase a few different packages that look for correlated features and discuss why we want to look for correlated features.
- review Principal Component Analysis and k-means clustering as means of data reduction.
- give many examples of how to easily define useful functions to aide in your analysis
GGplotas packages for to assist with automation of EDA tasks