Chapter 11 Data Wrangling (emphasis on tidy data)
If 80% of the data scientist’s job is data cleaning, perhaps that is the job. (Source: Anon.)
Cleaning data 🧼🧽 pic.twitter.com/MMCJkTYmgL
— Chelsea Parlett-Pelleriti ((ChelseaParlett?)) January 26, 2020
https://twitter.com/ChelseaParlett/status/1221251025983565824?s=20
11.1 Introduction
Data is rarely in condition to use it…there’s invariably something amiss. Data wrangling (a.k.a. data carpentry) is the process of getting it ready for analysis.
And all too often, data are messy. There are rows with no contents, colour-coded cells, and inconsistent values.
One important way that data can be cleaned is to ensure that the structure is tidy. What do we mean by tidy data?
There are three interrelated rules which make a dataset tidy: * Each variable must have its own column. * Each observation must have its own row. * Each value must have its own cell.
And
Why ensure that your data is tidy? There are two main advantages:
There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.
There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.
(from (Wickham and Grolemund 2016))
This won’t solve things like inconsistent values and colour-coded cells, but it will solve some other messiness.
For more about the principles of tidy data, see Hadley Wickham’s article “Tidy data”, in The Journal of Statistical Software (Wickham 2014)
11.1.1 Other tidyverse references
Karl Broman and Kara Woo, “Data organization in spreadsheets” (github page with source manuscript) – application of tidy principles to spreadsheets.
- see also Karl Broman’s tutorial, [“Data organization: organizing data in spreadsheets)
Bruno Rodriguez, Modern R with the tidyverse
Jesse Sadler, [Excel vs R: A Brief Introduction to R (With examples using dplyr and ggplot](http://kbroman.org/dataorg/](https://www.jessesadler.com/post/excel-vs-r/) (2017-10-02)
11.2 Theory and methods
Stat 545: Data wrangling, exploration, and analysis with R – course materials associated with the University of British Columbia’s Statistics 545 course. Prepared in large part by Dr. Jenny Bryan.
11.2.1 Joins
the steps in a left join pic.twitter.com/YSFAdxFl1D
— 🔎Julia Evans🔍 ((b0rk?)) October 16, 2019
11.3 Tools
11.3.1 {datapasta}
Vignette: How to Datapasta
11.3.3 {vtable}
CRAN page: vtable: Variable Table – Automatically generates HTML variable documentation including variable names, labels, classes, value labels (if applicable), value ranges, and summary statistics.
Twitter thread by Nick Huntington-Klein, 2019-03-24
11.4 The tidyverse
The tidyverse R packages on github
11.4.1 {dplyr}
package
CRAN: dplyr: A Grammar of Data Manipulation
github: hadley/dplyr
articles
Introduction to dplyr, part of the UBC STAT545: Data wrangling, exploration, and analysis with R course materials
Gary Hutson, 2018-05-24, DPLYR: A Beginners Guide
Isabella R. Ghement, 2019-07-18, group_split()
function – twitter thread with a short example
Francois Romain, 2019-07-10, n() cool #dplyr things, presentation at UseR2019, Toulouse
Garrick Aden-Buie, Tidy Animated Verbs – “Animations of tidyverse verbs using R, the tidyverse, and gganimate.” Good visual demonstrations the various types of joins.
- GitHub repo for tidyexplain
11.4.2 {forcats}
Working with factors
Emily Robinson, Categorical data in the tidyverse {link to DataCamp course removed}
11.4.3 {purrr}
CRAN: purrr: Functional Programming Tools
tutorials
Jenny Bryan, purrr tutorial
including a section on the remarkable list columns, i.e. a column within a dataframe that contains a general vector, that is a list of values, as opposed to an atomic vector.
and here’s the related “Putting square pegs in round holes: using list-cols in your dataframe”
“Iteration” in R for Data Science (Wickham and Grolemund 2016)
Emorie D Beck, Intro to purrr
Sharon Machlis, R Tip: Access nested list items with purrr {video}
A purrr tutorial – Cascadia-R, 2017-06-03
Charlotte Wickham, purr tutorial – github
11.4.4 more about tidy data
Hadley Wickham & Garrett Grolemund, R for Data Science
Hadley Wickham
Garrett Grolemund
- Data Tidying (part of Data Science with R)
Chester Ismay and Ted Laderas, A gRadual-intRoduction to the tidyverse
11.5 Working with dates
Updated Turing Test concept:
— Brooke Watson (/@/brookLYNevery1) January 19, 2018
A spreadsheet of dates, hand-entered by interns more than a decade ago, featuring such well-known time formats as "1996ish", "1941/xd01944", "1955?" and "WWII."
I'm not worried about AI until someone shows me the algorithm that can make sense of this. pic.twitter.com/IhzofigX2b
11.5.2 {anytime}
{anytime}: Convert Any Input to Parsed Date or Datetime – vignette
11.5.3 Tidy evaluation
- programming with {dplyr}
Edwin Thoen, 2017-08-25 Tidy evaluation, most common actions
11.5.4 Tidy text
If you’re going to undertake text mining and natural language processing, your text (i.e. your data) needs to be tidy. Fortunately, there’s an R package for that: tidytext
.
See the companion chapter on the topics of Text Analysis and Text Mining.
-30-