Data Science for All
2023-09-24
Preface
This textbook is currently in DRAFT form and will be updated frequently.
The objective of this textbook is to provide an approachable introduction to the knowledge, skills, and abilities of modern data scientists. Data-driven problem solving need not be restricted to the realm of advanced mathematics or expert computer programming. Data science can and should be practiced by all. In this text, we unveil the methods and tools applied by data scientists to solve real-world problems in a variety of domains. However, the content is accessible for anyone with the intellectual curiosity to ask and answer data-driven questions. This book is for you!
Usage
Though well-suited for self-study, this text is designed to support a one-semester (3 credit-hours) introductory course in the fundamentals of data science. The delivery assumes no significant pre-requisite course work outside of high school-level algebra. However, a modest amount of data literacy and computer proficiency will ease the transition into the content. The R programming language is employed for the technical aspects of the material and the associated code is included in the text where appropriate. The original course for which this text was designed included a separate weekly lab (1 credit-hour) that focused specifically on teaching R and RStudio. This format permitted 43 lectures to focus on key concepts and problem solving, complemented by 15 recitations to focus on computing skills. If the separate lab is not an option, some prior self-study in R, RStudio, and the tidyverse library is highly recommended.
Overview
This text is organized into the five chapters described below. At the end of each chapter, we list learning objectives, recommend resources for further study, and provide practical exercises. Data sets are drawn from built-in R packages, companion files, and online resources. Examples and case studies assume the use of RStudio either on a local machine or cloud service.
Chapter 1 introduces the unique discipline of data science. Beginning with a brief history, the chapter proceeds with a discussion of the common knowledge, skills, and abilities expected of a modern data scientist. In order to apply these competencies, we then introduce the 5A Method for solving data-driven problems. Finally, the chapter concludes with a description of the types of statistical models data scientists might be expected to develop as part of the problem solving process.
Chapter 2 describes how data is structured, imported, wrangled, and visualized. First we introduce the characteristics of tidy data and distinguish between types of variables and data. Next we describe the various sources and formats in which data is imported or simulated. After acquisition, we present a suite of tools available for organizing, summarizing, aggregating, and cleaning data. Lastly, we introduce the key characteristics of data visualizations.
Chapter 3 is the first of three chapters introducing the types of problems data scientists solve: exploratory, inferential, and predictive. For exploratory analyses, we begin with a discussion of the distributions and associations among categorical variables. Relevant topics include contingency tables for grouped factors and sentiment analysis for free text. We then proceed with the distributions and associations among numerical variables. For these sections, topics include geospatial and time series analyses, as well as linear and logistic regression. The chapter concludes with a brief introduction to unsupervised learning, including clustering and dimension reduction.
Chapter 4 focuses on inferential statistics. We begin with the requisite discussions of sampling, bias, and distributions. Then the chapter proceeds with an introduction to the concepts of confidence intervals and hypothesis testing. The remainder of the chapter demonstrates confidence intervals and hypothesis testing for three parameter types: proportion, mean, and slope. For proportions and means we investigate individual parameters, the difference in two parameters, and the differences between many parameters. For the slope of a regression line we limit the presentation to a single slope and provide additional discussion of technical conditions.
Chapter 5 completes the text with an introduction to predictive modeling. We start the chapter by distinguishing training and testing data, including a careful examination of the bias-variance trade-off. We then explore the related concept of cross-validation before delving into model development. The sections on predictive modeling are split into regression and classification. For regression, we employ multiple linear regression and decision trees. For classification, we apply multiple logistic regression and \(K\)-Nearest Neighbors.
Chapters 3 through 5 are largely driven by case studies using the 5A Method for problem solving. Each new concept is introduced in the context of a research question and demonstrated with data using RStudio. The analysis methods are primarily computational, with limited focus on theoretical foundations. The intent with this approach is two-fold. Firstly, we wish to introduce a vast and exciting range of data science applications to the widest possible audience without the often intimidating presence of dizzying mathematical notation and complex formulas. Secondly, we emphasize the careful formulation of interesting problems and the clear communication of insightful solutions while leveraging modern technology for the analysis in between. Ideally, this text inspires students to pursue a deeper understanding of the theory that supports the computational methods.