Preface

This textbook is currently in DRAFT form and will be updated frequently.

The objective of this textbook is to provide an approachable introduction to the knowledge, skills, and abilities of modern data scientists. Data-driven problem solving need not be restricted to the realm of advanced mathematics or expert computer programming. Data science can and should be practiced by all. In this text, we unveil the methods and tools applied by data scientists to solve real-world problems in a variety of domains. However, the content is accessible for anyone with the intellectual curiosity to ask and answer data-driven questions. This book is for you!

Authors

Kristopher A. Pruitt is an associate teaching professor of applied mathematics at the University of Colorado in Boulder. His practical experience spans government and industry as a scientific analyst for the United States Air Force (USAF) and a data scientist for NBC Sports. Kris has been an undergraduate educator in operations research, statistics, and data science since 2005. In 2018, he was the first mathematician in the history of the USAF Academy to be selected by the graduating class as the winner of the William H. Heiser Award for educational excellence. Kris is passionate about making data science accessible and inclusive for all.

Usage

Though well-suited for self-study, this text is designed to support a one-semester (3 credit-hours) introductory course in the fundamentals of data science. The delivery assumes no significant pre-requisite course work outside of high school-level algebra. However, a modest amount of data literacy and computer proficiency will ease the transition into the content. The R programming language is employed for the technical aspects of the material and the associated code is included in the text where appropriate. The original course for which this text was designed included a separate weekly lab (1 credit-hour) that focused specifically on teaching R and RStudio. This format permitted 43 lectures to focus on key concepts and problem solving, complemented by 15 recitations to focus on computing skills. If the separate lab is not an option, some prior self-study in R, RStudio, and the tidyverse library is highly recommended.

Overview

This text is organized into the five chapters described below. At the end of each chapter, we list learning objectives, provide practical exercises, and recommend resources for further study. Data sets are drawn from built-in R packages, companion files, and online resources. Examples and case studies assume the use of RStudio either on a local machine or cloud service.

Chapter 1 introduces the unique discipline of data science. Beginning with a brief history, the chapter proceeds with a discussion of the common knowledge, skills, and abilities expected of a modern data scientist. In order to apply these competencies, we then introduce the 5A Method for solving data-driven problems. Finally, the chapter concludes with an introduction to the terminology of data structures with a focus on defining variable and data types.

Chapter 2 describes the process of data acquisition. First we describe the various sources and formats from which data is imported. Here the primary focus is the import of existing data from built-in tables, local files, or online applications. However, we also include a section on simulated data. After acquisition, we present a suite of tools available for organizing, summarizing, aggregating, and cleaning data. Collectively known as data wrangling, these tools prepare raw data for analysis.

Chapter 3 is the first of three chapters introducing the types of problems data scientists solve: exploratory, inferential, and predictive. For exploratory analyses, we begin with a discussion of best practices in data visualization. We then apply these competencies to visualize the distributions and associations among variables. The presentation of distributions includes single and multi-variate cases while highlighting bar charts, histograms, box plots, and heat maps. The sections regarding associations include linear and nonlinear relationships with a concentration on scatter plots and line graphs.

Chapter 4 focuses on inferential statistics. We begin with the requisite discussions of sampling, bias, and estimation. Then the chapter proceeds with an introduction to the concepts of confidence intervals and hypothesis testing. We demonstrate both inferential methods for three parameter types: proportion, mean, and slope. For proportions and means we investigate individual parameters, the difference between two parameters, and the differences between many parameters. For the slope parameter we limit the presentation to a single slope, but discuss both linear and logistic regression.

Chapter 5 completes the text with an introduction to predictive modeling. We start the chapter by distinguishing the various types of predictive models and by emphasizing the importance of training and testing sets. The remaining sections are split between regression and classification models. For regression, we compare multiple linear regression and regression trees based on root mean squared error. For classification, we compare multiple logistic regression and classification trees based on classification error rate.

Chapters 3 through 5 are largely driven by case studies using the 5A Method for problem solving. Each new concept is introduced in the context of a research question and demonstrated with data using RStudio. The analysis methods are primarily computational, with limited focus on theoretical foundations. The intent with this approach is two-fold. Firstly, we wish to introduce a vast and exciting range of data science applications to the widest possible audience without the often intimidating presence of dizzying mathematical notation and complex formulas. Secondly, we emphasize the careful formulation of interesting problems and the clear communication of insightful solutions while leveraging modern technology for the analysis in between. Ideally, this text inspires students to pursue a deeper understanding of the theory that supports the computational methods.

Acknowledgements

TBD

Data Science for All