Chapter 11 Data Wrangling (emphasis on tidy data)

If 80% of the data scientist’s job is data cleaning, perhaps that is the job. (Source: Anon.)

Cleaning data 🧼🧽 pic.twitter.com/MMCJkTYmgL
— Chelsea Parlett-Pelleriti ((ChelseaParlett?)) January 26, 2020

https://twitter.com/ChelseaParlett/status/1221251025983565824?s=20

11.1 Introduction

Data is rarely in condition to use it…there’s invariably something amiss. Data wrangling (a.k.a. data carpentry) is the process of getting it ready for analysis.

And all too often, data are messy. There are rows with no contents, colour-coded cells, and inconsistent values.

One important way that data can be cleaned is to ensure that the structure is tidy. What do we mean by tidy data?

There are three interrelated rules which make a dataset tidy: * Each variable must have its own column. * Each observation must have its own row. * Each value must have its own cell.

And

Why ensure that your data is tidy? There are two main advantages:

There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.

There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.

(from (Wickham and Grolemund 2016))

This won’t solve things like inconsistent values and colour-coded cells, but it will solve some other messiness.

For more about the principles of tidy data, see Hadley Wickham’s article “Tidy data”, in The Journal of Statistical Software (Wickham 2014)

11.1.1 Other tidyverse references

Karl Broman and Kara Woo, “Data organization in spreadsheets” (github page with source manuscript) – application of tidy principles to spreadsheets.

see also Karl Broman’s tutorial, [“Data organization: organizing data in spreadsheets)

Bruno Rodriguez, Modern R with the tidyverse

Jesse Sadler, [Excel vs R: A Brief Introduction to R (With examples using dplyr and ggplot](http://kbroman.org/dataorg/](https://www.jessesadler.com/post/excel-vs-r/) (2017-10-02)

11.2 Theory and methods

Stat 545: Data wrangling, exploration, and analysis with R – course materials associated with the University of British Columbia’s Statistics 545 course. Prepared in large part by Dr. Jenny Bryan.

11.2.1 Joins

the steps in a left join pic.twitter.com/YSFAdxFl1D
— 🔎Julia Evans🔍 ((b0rk?)) October 16, 2019

11.3 Tools

11.3.1 `{datapasta}`

Vignette: How to Datapasta

11.3.2 `{janitor}`

11.3.3 `{vtable}`

CRAN page: vtable: Variable Table – Automatically generates HTML variable documentation including variable names, labels, classes, value labels (if applicable), value ranges, and summary statistics.

Reference page

Twitter thread by Nick Huntington-Klein, 2019-03-24

11.4 The tidyverse

The tidyverse:

The tidyverse R packages on github

11.4.1 `{dplyr}`

package

CRAN: dplyr: A Grammar of Data Manipulation

github: hadley/dplyr

articles

Introduction to dplyr, part of the UBC STAT545: Data wrangling, exploration, and analysis with R course materials
Gary Hutson, 2018-05-24, DPLYR: A Beginners Guide

Isabella R. Ghement, 2019-07-18, group_split() function – twitter thread with a short example

Francois Romain, 2019-07-10, n() cool #dplyr things, presentation at UseR2019, Toulouse

Garrick Aden-Buie, Tidy Animated Verbs – “Animations of tidyverse verbs using R, the tidyverse, and gganimate.” Good visual demonstrations the various types of joins.

GitHub repo for tidyexplain

SQL Joins Explained

11.4.2 `{forcats}`

reference page

Working with factors

Be the boss of your factors

Emily Robinson, Categorical data in the tidyverse {link to DataCamp course removed}

11.4.3 `{purrr}`

reference page

CRAN: purrr: Functional Programming Tools

tutorials

Jenny Bryan, purrr tutorial

including a section on the remarkable list columns, i.e. a column within a dataframe that contains a general vector, that is a list of values, as opposed to an atomic vector.
and here’s the related “Putting square pegs in round holes: using list-cols in your dataframe”

“Iteration” in R for Data Science (Wickham and Grolemund 2016)

Emorie D Beck, Intro to purrr

Sharon Machlis, R Tip: Access nested list items with purrr {video}

A purrr tutorial – Cascadia-R, 2017-06-03

Charlotte Wickham, purr tutorial – github

11.4.4 more about tidy data

Hadley Wickham & Garrett Grolemund, R for Data Science
Hadley Wickham
- Tidy data and tidy tools (video of presentation, December 2011)
Garrett Grolemund
- Data Tidying (part of Data Science with R)
Chester Ismay and Ted Laderas, A gRadual-intRoduction to the tidyverse

11.5 Working with dates

Updated Turing Test concept:
A spreadsheet of dates, hand-entered by interns more than a decade ago, featuring such well-known time formats as "1996ish", "1941/xd01944", "1955?" and "WWII."
I'm not worried about AI until someone shows me the algorithm that can make sense of this. pic.twitter.com/IhzofigX2b
— Brooke Watson (/@/brookLYNevery1) January 19, 2018

11.5.1 `{lubridate}`

Do more with dates and times in R

tidyverse page

11.5.2 `{anytime}`

{anytime}: Convert Any Input to Parsed Date or Datetime – vignette

11.5.3 Tidy evaluation

programming with {dplyr}

Edwin Thoen, 2017-08-25 Tidy evaluation, most common actions

11.5.4 Tidy text

If you’re going to undertake text mining and natural language processing, your text (i.e. your data) needs to be tidy. Fortunately, there’s an R package for that: tidytext.

See the companion chapter on the topics of Text Analysis and Text Mining.

-30-

References

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (1): 1–23. https://doi.org/10.18637/jss.v059.i10.

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science. O’Reilly Media. https://r4ds.had.co.nz/.