Preface

WHY? Data Wrangling Recipes in R

To save time. Many students and staff spend days and weeks wrestling with R code to prepare data for analysis. This resource is a valuable addition to conventional resources that are structured around function types. It illustrates the workflow, highlighting which functions are important to know and how to piece them together. Example coding is given to overcome many ways in which data can be messy, making practical tasks far easier to deliver.

One PhD supervisor spent ages before they realised that the lack of an appropriate ID variable was the reason why their student couldn’t merge their datasets; too obvious for them to consider. Yet this oversight is understandable from the student perspective, with the overwhelming need to pay attention to so many things when learning R. Hence the value of the list of preparation required prior to merging data for analysis, with instructions on how to undertake each step.

A passionate enthusiasm and many days of effort went into providing easily modifiable ways to pivot data, applicable to a wide variety of different examples. Clustered datasets can thus be changed between having many rows per cluster, and one row per cluster with different versions of each variable for each cluster member or repeated measurement. Both formats are useful for different analyses, tables and graphs; being able to readily switch between formats is a massive advantage.

Yet BUGS are a part of life. Frustration and error messages are an inevitable part of coding. There is no escape from this. Learning to code in R implies learning strategies to avoid and overcome errors.

Feedback

“Exceptional and very useful. some of the commands and tricks I learned today could save me a few days of work each” Abrar Alturkistani, PhD student, School of Public Health.

“Really helpful practical tips for those annoying things that often aren’t covered in other courses I have attended” Dr David Salmon, post-doc, School of Public Health.

“Fabulous . really practical guidance. makes the hard work of restructuring and preparing data significantly easier … I haven’t seen this focused content presented in one place in such an easy to access form before.” Dr Thomas Woodcock, senior research fellow, School of Public Health, module lead for quantitative improvements in health.

“Congratulations on the handbook, it looks very thorough, and I am sure it will be a fantastic resource” Bethan Cracknell-Daniels, PhD student who was part of team helping G/MPH/MSc students deliver their dissertations in R.

Acknowledgements

I am grateful to Tristan Naidoo for turning this handbook from word into bookdown format, and for his checking of some of the coding. Any remaining errors are my own. Chapters 9, 11, and 12 and have co-author Tristan Naidoo. I am grateful to Prof Victoria Cornelius of ISTDA module for inspiring the anaemia dataset that is used in several chapters of this book; the actual dataset used here is my own creation. I am grateful to many students in the department of public health, Imperial College London. Teaching them data management, in both R and in stata, I learnt many lessons that enabled me to improve my teaching approach.

Authorship

Hilary Watt has a passion for finding innovative language and images to teach conceptual understanding of statistics. She tied these in with the agenda to steer away from common poor standards of statistical interpretation in research literature. Along with Dr Renata Medeiros Mirra, she leads research into language that medical statistics educators use in practice, with fascinating insights into merits of the spectrum of different ways that educators word these interpretations. She leads statistics masterclasses (teaching R and stata software and statistical interpretation), open to all staff and post-graduate (taught and doctoral) students in the school of public health, Imperial College London. She co-supervises four PhD/ fellowship students. She is statistical co-applicant on a randomised controlled drug trial in Alzheimers disease, assessing whether attention-enhancing drugs are a useful supplement to current memory-enhancing drugs for symptomatic improvement. As NIHR RDS adviser, she has successfully supported several teams to gain research funding and fellowships, including two fellowships at professorial level. Hilary is a member of PCPH, School of Public Health, Imperial College London.

For more details, see: https://www.imperial.ac.uk/people/h.watt

For teaching videos and Science comedy videos, see: https://www.youtube.com/channel/UCzFWQHuAzQiFtAl5cNKCqQA

For her teaching publication in Int J Epidemiology, see: https://pubmed.ncbi.nlm.nih.gov/32710113/

Tristan Naidoo is a Statistician who completed his undergraduate degree and his MSc in South Africa before moving to Imperial College London to pursue his PhD. He is interested in the intersection of Natural Language Processing and Public Health.

The authors assert their moral right to be identified as the author of this book. All rights reserved.

Introduction

Data Wrangling Recipes in R: Practical demonstrations of easily-modifiable code sequences that prepares messy data for analysis.

This book is designed to demonstrate what is required when preparing data for analysis, with modifiable code so that you can readily achieve each step. For any project, work through from Chapters 1 to 9, with the index and headings intended to make it easy to find what is relevant for your project. If you are lucky, only a few subsections of each chapter may be relevant. Chapters demonstrate how to overcome many different ways in which data can be messy.

In case you need to combine datasets, there is a section on merging in R. Unlike other resources, this gives important considerations and choices when merging datasets. A checklist of requirements to prepare your data for merging is given (with instructions given), so that you can readily achieve results. There are checks that overcome risks in R to make sure there is no accidental duplication of observations.

The section on restructuring/ pivoting datasets (& creating summary statistics) are useful for clustered data, especially when there are repeated measurements over time for each person or measurements taken on each eye or tooth (clustered within person). Other types of clustered data are where people are clustered within families, within regions or according to school attended or whose doctor’s list people are on. Even seasoned R users are impressed by the simplicity of the approach that I teach, based on days of effort to co-produce easy-to-modify code.

“Really helpful practical tips for those annoying things that often aren’t covered in other courses I have attended” Dr David Salmon, post-doc, School of Public Health.

This book is not intended to teach absolute beginners. Other online resources are available for that purpose. However, if you have learnt a little but aren’t confident you’ve got some core principles of how R work, the following may be useful: Teaching R “assign” fundamentals: YouTube. The Appendix also gives some core principles of R, useful for beginners.

It you struggle to write directory names (& hence to read in datasets), then Chapter @ref(ch3_4) gives a method to read in datasets that avoids typing directory names, but rather allows you to copy and paste the directory name.

Tidyverse suite of packages for data management are of consistent high-quality, make preparing data easier and work well together The tidyverse package is also valuable for tabulating data and for producing publication-quality highly-flexible graphs. You need to install it to use many of the commands in this book, as follows:

install.packages(“tidyverse”)     # downloads tidyverse onto computer – do this only once

library(tidyverse)      # makes the relevant packages available – required each time R/ RStudio is opened

When you load the tidyverse, the following packages are loaded and many more: ggplot2, dplyr, tidyr, readr, purr, tibble, stringr, and forecats. For more information on the packages tidyverse loads see https://www.tidyverse.org/packages/.

For any internet searches for data management in R, it is worthwhile including “tidyverse” as a search term.

Experienced R coders, who can already prepare data for analysis, may find the following useful

  • Accomplished R users may find that the chapter on reshaping/ restructuring clustered data sets between long and wide format makes this task far easier. Substantial effort went into creating/ focusing on a method that aligns with recipe-book style instructions.

  • Within each section, there are tips and hints on easy ways of doing things, such as using lubridate for dates.

  • There is information on overcoming many different ways in which data can be messy.

  • The section on merging datasets highlights key considerations and essential checks in the process.

  • Creating data sets of summary statistics is additionally covered.

  • Pointing those you supervise (students and staff) towards these resources may save substantial supervisor time too.


Data Wrangling Recipes in R: Hilary Watt. PCPH, Imperial College London. Thanks to Tristan Naidoo for converting to bookdown format.

For corrections, suggestions for improvements: https://imperial.eu.qualtrics.com/jfe/form/SV_1TfB4mJvYNlj8x0