The Data Preparation Journey
Finding Your Way With R
2024-02-25
Preface
Welcome to The Data Preparation Journey: Finding Your Way With R, a book published with CRC Press as part of The Data Science Series.
This is a work-in-progress; the most recent update is 2024-02-25.
It is routinely noted that the Pareto principle applies to data science—80% of one’s time is spent on data collection and preparation, and the remaining 20% on the “fun stuff” like modelling, data visualization, and communication.
There is no shortage of material—textbooks, journal articles, blog posts, online courses, podcasts, etc.—about the 20%. That’s not to say that there is no material for the other 80%. But it is scattered, found across technique-specific articles and domain-specific books, along with Stack Overflow questions and miscellaneous blog posts. This book serves as a travel guide: an introduction and wayfinder through some of the scattered resources for readers seeking to understand the core elements of data preparation. It is hoped that, like a lighthouse, it will both guide you in the right direction and keep you from running aground.
The book will introduce the principles of data preparation, framed in a systematic approach that follows a typical data science or statistical workflow. With that context, readers will then work through practical solutions to resolving problems in data using the statistical and data science programming language R. These solutions will include examples of complex real-world data.
In Exploratory Data Analysis, Tukey writes “the analyst of data needs both tools and understanding. The purpose of this book is to provide some of each.” (Tukey 1977, 1) It is my modest hope that this book also provides you both tools and understanding.
You, the reader
You might be an academic, working in the physical sciences, social sciences, or humanities, who is (or will be) analyzing data as part of your research. You might be working in a business setting, where important decisions are being made based on the insights you draw from the data collected as part of interactions with customers. As a public servant, you might be creating the evidence a government or other public agency is using to inform policy and program decisions. The principles and practices described in this book will apply no matter the context.
It is assumed that the reader of this book will have a working knowledge of the fundamental data manipulation functions in R (whether base or tidyverse or packages beyond those) or another programming language that supports that work. If you can filter for specific values in the variables and select the columns you want, know the difference between a character string and a numeric value ("1"
or 1
), and can create a new variable as the result of a manipulation of others, then we’re on our way.
This book leans heavily on R Markdown, particularly when it comes to describing documentation and the packages of the tidyverse. Familiarity with both will be very helpful.
If you don’t possess that knowledge yet, I would recommend that you work through R for Data Science (2nd edition) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund (Wickham, Çetinkaya-Rundel, and Grolemund 2023). This book, freely available at 4ds.hadley.nz, will give you a running start.
While some of the topics covered here may be similar to what you’ll find in R for Data Science and other introductory books and similar resources, it is hoped that the examples in this book add more context and expose you to greater technical challenges.
Outline
The first three chapters of this book provide some foundations, elements of the data preparation process that will help guide our thinking and our work, including data documentation (or recordkeeping).
Chapters 4 through 10 cover importing data from a variety of sources that are commonly encountered, including plain-text, Excel, statistical software formats, PDF files, internet sources, and databases.
Chapters 11 and 12 tackle finding problems in our data, and then dealing with those problems.
Finally Chapter 13 presents a short summary and poses the question, “Where to from here?”
Acknowledgements
I would like to acknowledge everyone who has contributed to the books, articles, blog posts, and R packages cited within. As well, thanks to my current colleagues at MNP, my former colleagues at BC Stats, and my colleagues and students at the University of Victoria’s Business Intelligence & Data Analytics program. The enthusiasm of this community of people—some I know well and others around the world I have never met—has helped sustain my own interest, and without that interest I wouldn’t have written this book.
Particular thanks to Julie Hawkins and Emily Riederer, both of whom provided valuable feedback on early drafts, and through their critiques made this book better than it would have otherwise been.
Some important details
License
This work by Martin Monkman is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license.
Data in this book
This book draws on three sources of data:
R packages, such as {palmerpenguins} and {Lahman}, which bundle data ready for use.
Open source data that is freely available on the web.
Mock or synthetic data that was created specifically for this book.
The various data files in the second and third groups are bundled in the R package {dpjr} (Monkman 2023).
To download and install the {dpjr} package, you will need the {remotes} package:
# download and install "remotes"
install.packages("remotes")
# download and install "dpjr"
remotes::install_github("monkmanmh/dpjr")
Once the package is downloaded, the function dpjr::dpjr_data(<filename>)
can be used to generate the path to the data file, independent of the specific location on the user’s computer.
For example, to read the CSV file “mpg.csv”:
The {dpjr} package website is here: https://monkmanmh.github.io/dpjr/
The data files used in the {dpjr} package are covered by various open licenses; details can be found at the “Data licenses” page at the package website.
Source code
The source code for this ebook can be found at this github repository: https://github.com/MonkmanMH/data_preparation_journey
This book is written in Markdown, using the {bookdown} package (Xie 2021), and published to the web at bookdown.org.
Cover image
The cover image is a wayfinder close to my home: Fisgard Lighthouse, marking the entrance to Esquimalt Harbour in Victoria, British Columbia, Canada. (Location: 48.4307, -123.4477)
The cover photo is by jennyt, and used with permission.