Data Science in Action
An application-focused introduction to data-driven problem solving
2024-08-15
Preface
This textbook is currently in DRAFT form and will be updated frequently.
The objective of this textbook is to provide an approachable introduction to the knowledge, skills, and abilities of modern data scientists. Data-driven problem solving need not be restricted to the realm of advanced mathematics or expert computer programming. Data science can and should be practiced by all. In this text, we unveil the methods and tools applied by data scientists to solve real-world problems in a variety of domains. However, the content is accessible for anyone with the intellectual curiosity to ask and answer data-driven questions. This book is for you!
Usage
Though well-suited for self-study, this text is designed to support a one-semester (3 credit-hours) introductory course in the fundamentals of data science. The delivery assumes no significant pre-requisite course work outside of high school-level algebra and geometry. However, a modest amount of data literacy and computer proficiency will ease the transition into the content. The R programming language is employed for the technical aspects of the material and the associated code is included in the text where appropriate. The original course for which this text was designed included a separate weekly lab (1 credit-hour) that focused specifically on teaching R and RStudio. This format permits roughly 40 lectures to focus on key concepts and problem solving, complemented by 15 recitations to focus on computing skills. If the separate lab is not an option, some prior self-study in R, RStudio, and the tidyverse library is highly recommended.
Overview
This text is organized into the five chapters described below. At the end of each chapter, we list learning objectives, provide practical exercises, and recommend resources for further study. Data sets are drawn from built-in R packages, companion files, and online resources. Examples and case studies assume the use of RStudio either on a local machine or cloud service.
Chapter 1 introduces the unique discipline of data science. Beginning with a brief history, the chapter proceeds with a discussion of the common knowledge, skills, and abilities expected of a modern data scientist. In order to apply these competencies, we then introduce the 5A Method for solving data-driven problems. This problem-solving framework is demonstrated repeatedly throughout the text using real-world data.
Chapter 2 describes the process of data acquisition. We begin with an introduction to the terminology of data structures with a focus on defining variable and data types. Then we describe the various sources and formats from which data is imported. Here the primary focus is existing data from built-in tables, local files, or online applications. However, we also include a section on simulated data. After acquisition, we present a suite of tools available for organizing, summarizing, aggregating, and cleaning data. Collectively known as data wrangling, these tools prepare raw data for analysis.
Chapter 3 is the first of three chapters introducing the types of problems data scientists solve: exploratory, inferential, and predictive. For exploratory analyses, we begin with a discussion of best practices in data visualization. We then apply these competencies to visualize the distributions and associations among variables. The presentation of distributions includes single and multi-variate cases while highlighting bar charts, histograms, box plots, word clouds, heat maps, and contour plots. The sections regarding associations include linear and nonlinear relationships with a concentration on contingency tables, scatter plots, line graphs, and biplots.
Chapter 4 focuses on inferential statistics. We begin with the requisite discussions of sampling, bias, and estimation. Then the chapter proceeds with an introduction to the concepts of confidence intervals and hypothesis testing. For confidence intervals, we start with a description of sampling distributions and then demonstrate their use by estimating individual parameters and the differences between parameters. For hypothesis tests, we introduce null distributions and then apply them to tests of one, two, and many parameter(s). We illustrate both inferential methods for three parameter types: proportions, means, and slopes.
Chapter 5 completes the text with an introduction to predictive modeling. We start the chapter by distinguishing the various types of predictive models and by emphasizing the importance of model validation. The remaining sections are split between regression and classification models. For regression, we compare the prediction accuracy of multiple linear regression and decision trees based on root mean squared error. For classification, we compare the accuracy of multiple logistic regression and \(k\)-nearest neighbors based on classification error rate. In both cases, we present key modeling enhancements such as categorical predictors and interactions.
Chapters 3 through 5 are largely driven by case studies using the 5A Method for problem solving. Each new concept is introduced in the context of a research question and demonstrated with data using RStudio. The analysis methods are primarily computational, with limited focus on theoretical foundations. The intent with this approach is two-fold. Firstly, we wish to introduce a vast and exciting range of data science applications to the widest possible audience without the often intimidating presence of dense mathematical notation and complex formulas. Secondly, we emphasize the careful formulation of interesting problems and the clear communication of insightful solutions while leveraging modern technology for the analysis in between. Ideally, this text inspires students to pursue a deeper understanding of the theory that supports the computational methods.