Data Preparation: Essential Steps Before & After Analysis
Finding Your Way With R and Other Tools
It is routinely noted that the Pareto principle applies to data science—80% of one’s time is spent on data collection and preparation, and the remaining 20% on the “fun stuff” like modeling, data visualization, and communication.
There is no shortage of material—text books, journal articles, blog posts, online courses, podcasts, etc.— about the 20%. This modest book hopes to serve as an introduction and wayfinder for readers seeking to understand core elements of the other 80%.
0.1 You, the reader
It is assumed that the reader of this book will have a working knowledge of the fundamental data manipulation functions in R (whether base or tidyverse or packages beyond those), or another programming language. If you can filter for specific values in the variables and select the columns you want, know the difference between a character string and a numeric value (
1), then we’re on our way.
If you don’t possess that knowledge yet, I would recommend that you work through R for Data Science by Hadley Wickham and Garrett Grolemund (Wickham and Grolemund 2016). This book, freely avaible at r4ds.had.co.nz, will give you a running start.
I would like to acknowledge everyone who has contributed to the books, articles, blog posts, and R packages cited within.
Some important details
This work by Martin Monkman is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 Canada License.
0.1.2 Source code
The source code for this ebook can be found at this github repository: https://github.com/MonkmanMH/data_preparation_with_r
0.1.3 Cover image
The cover image is a wayfinder close to my home: Fisgard Lighthouse, marking the entrance to Esquimalt Harbour in Victoria, British Columbia, Canada. (Location: https://www.openstreetmap.org/#map=16/48.4307/-123.4477)
The photo was taken by Jeff Hitchcock, and was downloaded from flickr.com; that site notes that the image is licensed under the Creative Commons license Attribution 2.0 Generic (CC BY 2.0).