Chapter 11 Data Wrangling (emphasis on tidy data)

If 80% of the data scientist’s job is data cleaning, perhaps that is the job. (Source: Anon.)

https://twitter.com/ChelseaParlett/status/1221251025983565824?s=20

11.1 Introduction

Data is rarely in condition to use it…there’s invariably something amiss. Data wrangling (a.k.a. data carpentry) is the process of getting it ready for analysis.

And all too often, data are messy. There are rows with no contents, colour-coded cells, and inconsistent values.

One important way that data can be cleaned is to ensure that the structure is tidy. What do we mean by tidy data?

There are three interrelated rules which make a dataset tidy: * Each variable must have its own column. * Each observation must have its own row. * Each value must have its own cell.

And

Why ensure that your data is tidy? There are two main advantages:

  1. There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.

  2. There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.

(from (Wickham and Grolemund 2016))

This won’t solve things like inconsistent values and colour-coded cells, but it will solve some other messiness.

For more about the principles of tidy data, see Hadley Wickham’s article “Tidy data”, in The Journal of Statistical Software (Wickham 2014)

11.1.1 Other tidyverse references

Karl Broman and Kara Woo, “Data organization in spreadsheets” (github page with source manuscript) – application of tidy principles to spreadsheets.

  • see also Karl Broman’s tutorial, [“Data organization: organizing data in spreadsheets)

Bruno Rodriguez, Modern R with the tidyverse

Jesse Sadler, [Excel vs R: A Brief Introduction to R (With examples using dplyr and ggplot](http://kbroman.org/dataorg/](https://www.jessesadler.com/post/excel-vs-r/) (2017-10-02)

11.2 Theory and methods

Stat 545: Data wrangling, exploration, and analysis with R – course materials associated with the University of British Columbia’s Statistics 545 course. Prepared in large part by Dr. Jenny Bryan.

11.2.1 Joins


11.3 Tools

11.3.1 {datapasta}

Vignette: How to Datapasta

11.3.2 {janitor}

11.3.3 {vtable}

CRAN page: vtable: Variable Table – Automatically generates HTML variable documentation including variable names, labels, classes, value labels (if applicable), value ranges, and summary statistics.

Reference page

Twitter thread by Nick Huntington-Klein, 2019-03-24


11.4 The tidyverse

The tidyverse:

The tidyverse R packages on github

11.4.1 {dplyr}

package

CRAN: dplyr: A Grammar of Data Manipulation

github: hadley/dplyr

articles

Isabella R. Ghement, 2019-07-18, group_split() function – twitter thread with a short example

Francois Romain, 2019-07-10, n() cool #dplyr things, presentation at UseR2019, Toulouse

Garrick Aden-Buie, Tidy Animated Verbs – “Animations of tidyverse verbs using R, the tidyverse, and gganimate.” Good visual demonstrations the various types of joins.

SQL Joins Explained

11.4.2 {forcats}

reference page

Working with factors

Be the boss of your factors

Emily Robinson, Categorical data in the tidyverse {link to DataCamp course removed}

11.4.3 {purrr}

reference page

CRAN: purrr: Functional Programming Tools

tutorials

Jenny Bryan, purrr tutorial

“Iteration” in R for Data Science (Wickham and Grolemund 2016)

Emorie D Beck, Intro to purrr

Sharon Machlis, R Tip: Access nested list items with purrr {video}

A purrr tutorial – Cascadia-R, 2017-06-03

Charlotte Wickham, purr tutorial – github


11.4.4 more about tidy data

11.5 Working with dates

11.5.3 Tidy evaluation

  • programming with {dplyr}

Edwin Thoen, 2017-08-25 Tidy evaluation, most common actions

11.5.4 Tidy text

If you’re going to undertake text mining and natural language processing, your text (i.e. your data) needs to be tidy. Fortunately, there’s an R package for that: tidytext.

See the companion chapter on the topics of Text Analysis and Text Mining.

-30-

References

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (1): 1–23. https://doi.org/10.18637/jss.v059.i10.
Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science. O’Reilly Media. https://r4ds.had.co.nz/.