Overview

In this course we will review some of the tools of the trade, namely, R’s tidyverse (Wickham and Grolemund 2017; Winter 2019) - a collection of R packages designed with a common framework to aide in common data wrangling and data management tasks.

Data Wrangling is one subset set of skills within the Data Science Process. We will carefully investigate how decisions made while collecting and preparing the data have down-stream effects on model performance.

Analysis is worthless if it goes un-communicated. Stakeholders need regular up-to-the-date information to act upon. Luckily, RMarkdown (Xie, Dervieux, and Riederer 2020; Xie, Allaire, and Grolemund 2018), with the knitr (Xie 2022, 2015)and shiny packages can make R seamlessly integrate into reporting tasks including:

RStudio V 1.4 Features a Visual Markdown Editor, which is very nice if you want to work on editing reports or documents and enjoy the “what-you-see-is-what-you-get” type interface over code. I do find flipping between both to be handy if I’m wondering what it might look like before rendering a document, it’s been a time saver in developing this course!

Finally, Combined with R’s modeling capabilities the entire data science process: from data ingestion to modeling to package development and version control can all be managed nicely with an RStudio Console.

We will go through what some might call a boilerplate pass - and walk through how to get started with these various tools to solve common data questions.

The primary resource is an SQLite database made from downloading various files from the NHANES data source:

The goal here is to try to give that experience of connecting and working with a to a database with R. Collecting data from a potentially database, running statistical analyses, and making inferences as to which features would perform well when fitting predictive models.

PART I - Welcome to R

In the first part of the text, we will cover getting started with R, we will install packages that we will utilize throughout the rest of the book, and we will introduce the tidyverse.

  • 1 Welcome to R
  • 2 - The Tidyverse

PART II - Feature Engineering

In this part we will define a few features, targets, and other data-points of interest including: Gender, Age, Diabetic status, Age at Diabetes. We will breifly use these features to discuss ploting with ggplot2 in R.

  • 3 - Feature Engineering
  • 4 - The Anatomy of ggplot

PART III - Exploratory Data Analysis

We use our data-set with the few features we defined in the last part and review statistical tests such as the t-test, ks-test, and chi-square test and ANOVA. We will showcase the relationship between p-values of statistical tests and corresponding and model accuracy we discuss two factor classification with:

  • 6 - A single continuous feature
  • 7 - categorical features and interactions

PART IV - Data Analytics at Scale

It’s unrealistic that we will have only 3 or 4 features to review, we need to understand how to make R work for us.

We have provided features over three domains of interest, they include:

  • 12.1 - Demographic - Feature Engineering
  • 12.2 - Labs - Feature Engineering
  • 12.3 - Examination - Feature Engineering

For the most part, these features are mapped in with similar methods we utilized in Part II; however, there are some issues when dealing with the Lab data.

We will utilize these domains to create a new analytic data-set with hundreds of columns and then discuss

  • 9 - Functional dbplyr, purrr, and furrr
  • 10 - Exploratory Data Analysis at Scale
  • 11 - Packages for Automated Exploratory Data Analysis

PART V - Factors, Time, & Text

  • 13 - Diabetic monitoring; time-series classification

PART VI - Communicate Results

A large aspect of Data Science is communication of results to stakeholders, in this Part we will introduce Shiny and flex_dashboards as well as discuss options when we knit an R markdown file.

  • 14 - Shiny
  • flex_dashboard

PART VII - Package it up

  • Make a package