8 Introduction to Data Science at Scale
Last Part, we discussed statistical tests including t-test, chi-square test, ks-test, & ANOVA on a couple features in our analytic data set, we compared and contrasted these p-values and traced the downstream effect of the feature in a simple logistic regression. This illustrates the importance of Exploratory Data Analysis. However, today’s data-scientist has to contend with data-sets consisting of hundreds or thousands of features to analyze. In this Part we will consider the scalability of our analytics - how do we perform Exploratory Data Analysis on hundreds of features in a dataset?
In define three new classes of features arising from Demographics (Section 12.1), Labs (Section 12.2), and Examination (Section 12.3) Domains. Most of this is a review of use of case_when
; however, the Labs do present us with some challenges worth examination in Section 12.2.1.
Our primary discussion will be around learning how to functionalize some of our processes using R
, we have over 100 columns to analyze and copying and pasting code is ineffective.
In Chapter 9 we will
- introduce several concepts including
enquo
and variable resolution with!!
. - showcase the
comparedf
andtableby
functions inarsenal
- use
purrr
andfurrr
to iterate and speed up functions
In Chapter 10 as we continue to analyze the data, we will
- discuss variants of normal
dplyr
functions with their_at
brethren:mutate_at
;summarise_at
,filter_at
, and others. - discuss missing data analytics & mean value imputation.
- showcase a few different packages that look for correlated features and discuss why we want to look for correlated features.
- review Principal Component Analysis and k-means clustering as means of data reduction.
- give many examples of how to easily define useful functions to aide in your analysis
- showcase
DataExplorer
,skimer
,GGplot
as packages for to assist with automation of EDA tasks