Fisgard Lighthouse


It is routinely noted that the Pareto principle applies to data science—80% of one’s time is spent on data collection and preparation, and the remaining 20% on the “fun stuff” like modeling, data visualization, and communication.

There is no shortage of material—text books, journal articles, blog posts, online courses, podcasts, etc.— about the 20%. This modest book hopes to serve as an introduction and wayfinder for readers seeking to understand core elements of the other 80%.

0.1 You, the reader

It is assumed that the reader of this book will have a working knowledge of the fundamental data manipulation functions in R (whether base or tidyverse or packages beyond those), or another programming language. If you can filter for specific values in the variables and select the columns you want, know the difference between a character string and a numeric value ("1" or 1), then we’re on our way.

If you don’t possess that knowledge yet, I would recommend that you work through R for Data Science by Hadley Wickham and Garrett Grolemund (Wickham and Grolemund 2016). This book, freely avaible at, will give you a running start.


I would like to acknowledge everyone who has contributed to the books, articles, blog posts, and R packages cited within.

Some important details

0.1.2 Source code

The source code for this ebook can be found at this github repository:

This book is written in Markdown, using the {bookdown} package (Xie 2020), and published to the web at

0.1.3 Cover image

The cover image is a wayfinder close to my home: Fisgard Lighthouse, marking the entrance to Esquimalt Harbour in Victoria, British Columbia, Canada. (Location:

The photo was taken by Jeff Hitchcock, and was downloaded from; that site notes that the image is licensed under the Creative Commons license Attribution 2.0 Generic (CC BY 2.0).


Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science. O’Reilly Media.

Xie, Yihui. 2020. Bookdown: Authoring Books and Technical Documents with R Markdown.