Preface

This textbook is currently in DRAFT form and will be updated frequently.

The objective of this textbook is to provide an approachable introduction to the knowledge, skills, and abilities of modern data scientists. Data-driven problem solving need not be restricted to the realm of advanced mathematics or expert computer programming. Data science can and should be practiced by all. In this text, we unveil the methods and tools applied by data scientists to solve real-world problems in a variety of domains. However, the content is accessible for anyone with the intellectual curiosity to ask and answer data-driven questions. This book is for you!

Author

Kristopher A. Pruitt is an associate teaching professor of applied mathematics at the University of Colorado in Boulder. His practical experience spans government and industry as a scientific analyst for the United States Air Force (USAF) and a data scientist for NBC Sports. Kris has been an undergraduate educator in operations research, statistics, and data science since 2005. In 2018, he was the first mathematician in the history of the USAF Academy to be selected by the graduating class as the winner of the William H. Heiser Award for educational excellence. Kris is passionate about making data science accessible and inclusive for all.

Usage

Though well-suited for self-study, this text is designed to support a one-semester (3 credit-hours) introductory course in the fundamentals of data science. The delivery assumes no significant pre-requisite course work outside of high school-level algebra and geometry. However, a modest amount of data literacy and computer proficiency will ease the transition into the content. The R programming language is employed for the technical aspects of the material and the associated code is included in the text where appropriate. The original course for which this text was designed included a separate weekly lab (1 credit-hour) that focused specifically on teaching R and RStudio. This format permits roughly 40 lectures to focus on key concepts and problem solving, complemented by 15 recitations to focus on computing skills. If the separate lab is not an option, some prior self-study in R, RStudio, and the tidyverse library is highly recommended.

Overview

This text is organized into the five chapters described below. At the end of each chapter, we list learning objectives, provide practical exercises, and recommend resources for further study. Data sets are drawn from built-in R packages, companion files, and online resources. Examples and case studies assume the use of RStudio either on a local machine or cloud service.

Chapter 1 introduces the unique discipline of data science. Beginning with a brief history, the chapter proceeds with a discussion of the common knowledge, skills, and abilities expected of a modern data scientist. In order to apply these competencies, we then introduce the 5A Method for solving data-driven problems. This problem-solving framework is demonstrated repeatedly throughout the text using real-world data.

Chapter 2 describes the process of data acquisition. We begin with an introduction to the terminology of data structures with a focus on defining variable and data types. Then we describe the various sources and formats from which data is imported. Here the primary focus is existing data from built-in tables, local files, or online applications. However, we also include a section on simulated data. After acquisition, we present a suite of tools available for organizing, summarizing, aggregating, and cleaning data. Collectively known as data wrangling, these tools prepare raw data for analysis.

Chapter 3 is the first of three chapters introducing the types of problems data scientists solve: exploratory, inferential, and predictive. For exploratory analyses, we begin with a discussion of data summary including descriptive statistics and visual graphics. We then apply these competencies to visualize the distributions and associations among variables. The presentation of distributions includes single and multi-variate cases while highlighting bar charts, histograms, box plots, heat maps, contour plots, word clouds, and choropleth maps. The sections regarding associations include linear and nonlinear relationships with a concentration on contingency tables, scatter plots (linear, nonlinear, and clustered), line graphs, and biplots.

Chapter 4 focuses on inferential analyses. We begin with the requisite discussions of sampling, bias, and point estimation. Then the chapter proceeds with an introduction to the concepts of confidence intervals and hypothesis testing. For confidence intervals, we start with a description of sampling distributions and then demonstrate their use by estimating individual parameters and the differences between parameters. For hypothesis tests, we introduce null distributions and then apply them to tests of one, two, and many parameter(s). We illustrate both inferential methods for three parameter types: proportions, means, and slopes.

Chapter 5 completes the text with an introduction to predictive analyses. We start the chapter by distinguishing the various types of predictive models and by emphasizing the importance of model validation. The remaining sections are split between regression and classification models. For regression, we discuss the prediction accuracy of simple and multiple linear regression models based on root mean squared error. For classification, we present the accuracy of simple and multiple logistic regression models based on classification error rate. In both cases, we present key modeling enhancements such as categorical predictors and interactions. Finally we introduce two common non-parametric approaches: decision trees and $K$ -nearest neighbors.

Chapters 3 through 5 are largely driven by case studies using the 5A Method for problem solving. Each new concept is introduced in the context of a research question and demonstrated with data using RStudio. The analysis methods are primarily computational, with limited focus on theoretical foundations. The intent with this approach is two-fold. Firstly, we wish to introduce a vast and exciting range of data science applications to the widest possible audience without the often intimidating presence of dense mathematical notation and complex formulas. Secondly, we emphasize the careful formulation of interesting problems and the clear communication of insightful solutions while leveraging modern technology for the analysis in between. Ideally, this text inspires students to pursue a deeper understanding of the theory that supports the computational methods.

Acknowledgements

TBD

Data Science in Action