Data Wrangling Recipes in R
Preface
WHY? Data Wrangling Recipes in R
To save time. Many students and staff spend days and weeks wrestling with R code to prepare data for analysis. This resource is a valuable addition to conventional resources that are structured around function types. It illustrates the workflow, highlighting which functions are important to know and how to piece them together. Example coding is given to overcome many ways in which data can be messy, making practical tasks far easier to deliver.
One PhD supervisor spent ages before they realised that the lack of an appropriate ID variable was the reason why their student couldn’t merge their datasets; too obvious for them to consider. Yet this oversight is understandable from the student perspective, with the overwhelming need to pay attention to so many things when learning R. Hence the value of the list of preparation required prior to merging data for analysis, with instructions on how to undertake each step.
A passionate enthusiasm and many days of effort went into providing easily modifiable ways to pivot data, applicable to a wide variety of different examples. Clustered datasets can thus be changed between having many rows per cluster, and one row per cluster with different versions of each variable for each cluster member or repeated measurement. Both formats are useful for different analyses, tables and graphs; being able to readily switch between formats is a massive advantage.
Yet BUGS are a part of life. Frustration and error messages are an inevitable part of coding. There is no escape from this. Learning to code in R implies learning strategies to avoid and overcome errors.
Feedback
“Exceptional and very useful. some of the commands and tricks I learned today could save me a few days of work each” Abrar Alturkistani, PhD student, School of Public Health.
“Really helpful practical tips for those annoying things that often aren’t covered in other courses I have attended” Dr David Salmon, post-doc, School of Public Health.
“Fabulous . really practical guidance. makes the hard work of restructuring and preparing data significantly easier … I haven’t seen this focused content presented in one place in such an easy to access form before.” Dr Thomas Woodcock, senior research fellow, School of Public Health, module lead for quantitative improvements in health.
“Congratulations on the handbook, it looks very thorough, and I am sure it will be a fantastic resource” Bethan Cracknell-Daniels, PhD student who was part of team helping G/MPH/MSc students deliver their dissertations in R.
Acknowledgements
I am grateful to Tristan Naidoo for turning this handbook from word into bookdown format, and for his checking of some of the coding. Any remaining errors are my own. Chapters 9, 11, and 12 and have co-author Tristan Naidoo. I am grateful to Prof Victoria Cornelius of ISTDA module for inspiring the anaemia dataset that is used in several chapters of this book; the actual dataset used here is my own creation. I am grateful to many students in the department of public health, Imperial College London. Teaching them data management, in both R and in stata, I learnt many lessons that enabled me to improve my teaching approach.
Introduction
Data Wrangling Recipes in R: Practical demonstrations of easily-modifiable code sequences that prepares messy data for analysis.
This book is designed to demonstrate what is required when preparing data for analysis, with modifiable code so that you can readily achieve each step. For any project, work through from Chapters 1 to 9, with the index and headings intended to make it easy to find what is relevant for your project. If you are lucky, only a few subsections of each chapter may be relevant. Chapters demonstrate how to overcome many different ways in which data can be messy.
In case you need to combine datasets, there is a section on merging in R. Unlike other resources, this gives important considerations and choices when merging datasets. A checklist of requirements to prepare your data for merging is given (with instructions given), so that you can readily achieve results. There are checks that overcome risks in R to make sure there is no accidental duplication of observations.
The section on restructuring/ pivoting datasets (& creating summary statistics) are useful for clustered data, especially when there are repeated measurements over time for each person or measurements taken on each eye or tooth (clustered within person). Other types of clustered data are where people are clustered within families, within regions or according to school attended or whose doctor’s list people are on. Even seasoned R users are impressed by the simplicity of the approach that I teach, based on days of effort to co-produce easy-to-modify code.
“Really helpful practical tips for those annoying things that often aren’t covered in other courses I have attended” Dr David Salmon, post-doc, School of Public Health.
This book is not intended to teach absolute beginners. Other online resources are available for that purpose. However, if you have learnt a little but aren’t confident you’ve got some core principles of how R work, the following may be useful: Teaching R “assign” fundamentals: YouTube. The Appendix also gives some core principles of R, useful for beginners.
It you struggle to write directory names (& hence to read in datasets), then Chapter @ref(ch3_4) gives a method to read in datasets that avoids typing directory names, but rather allows you to copy and paste the directory name.
Tidyverse suite of packages for data management are of consistent high-quality, make preparing data easier and work well together The tidyverse package is also valuable for tabulating data and for producing publication-quality highly-flexible graphs. You need to install it to use many of the commands in this book, as follows:
install.packages(“tidyverse”) # downloads tidyverse onto computer – do this only once
library(tidyverse) # makes the relevant packages available – required each time R/ RStudio is opened
When you load the tidyverse
, the following packages are loaded and many more: ggplot2
, dplyr
, tidyr
, readr
, purr
, tibble
, stringr
, and forecats
. For more information on the packages tidyverse
loads see https://www.tidyverse.org/packages/.
For any internet searches for data management in R, it is worthwhile including “tidyverse” as a search term.
Experienced R coders, who can already prepare data for analysis, may find the following useful
Accomplished R users may find that the chapter on reshaping/ restructuring clustered data sets between long and wide format makes this task far easier. Substantial effort went into creating/ focusing on a method that aligns with recipe-book style instructions.
Within each section, there are tips and hints on easy ways of doing things, such as using lubridate for dates.
There is information on overcoming many different ways in which data can be messy.
The section on merging datasets highlights key considerations and essential checks in the process.
Creating data sets of summary statistics is additionally covered.
Pointing those you supervise (students and staff) towards these resources may save substantial supervisor time too.
Data Wrangling Recipes in R: Hilary Watt. PCPH, Imperial College London. Thanks to Tristan Naidoo for converting to bookdown format.
For corrections, suggestions for improvements: https://imperial.eu.qualtrics.com/jfe/form/SV_1TfB4mJvYNlj8x0