This book provides a gentle introduction to data science for students of any discipline with little or no background in data analysis or computer programming. Based on notions of representation, measurement, and modeling, we examine key data types and data structures, and then learn to clean, summarize, transform, and visualize data to communicate our conclusions.

The main limitation of this book is that focuses on rectangular data: Data that can be represented in the rows and columns of tables. What may seem a second limitation — the absence of sophisticated modeling and fancy machine learning techniques — may actually be a feature. Rather than focusing on fancy methods and the latest hypes, we emphasize basic concepts, solid skills, and simple tools.

In today’s educational environment, students of statistics and elementary computer science are often overwhelmed by complicated techniques and flashy tools. Given the availability of statistical software and machine learning platforms, it is tempting to get carried away by academic techniques with awe-inspiring acronyms and pedigrees. But whenever these methods and tools become intransparent to us, we cannot check the validity of our results and detect or fix potential errors. Thus, our main goals here are to promote a deeper understanding of data-related tasks, introduce some tools for tackling them, and suggest possible ways of communicating our solutions in a transparent fashion. By reflecting on the relations between representations, tasks, and tools, this book and course promote data literacy and cultivate reproducible research practices that precede and enable any practical uses of programming or statistics.

This book is still being written and revised. It currently serves as a scaffold for a curriculum that is filled with content as we go along.

Disclaimer: This course contains ADILT1 content.