BSDS - Data Wrangling Process

library(tidyverse)

Introduction

The process of data wrangling often seems very situation dependent and there isn’t a unifying process. This isn’t completely true. The process can be thought of as having four distinct steps.

Step Result Name Description
Import data_raw Get the data into R somehow. The structure of the data is just however it came in.
Tidy data_tidy Restructure the data so that each row is an observation, and each column is a variable.
Clean data Correct variable types, consistent and useful labels, validation and correction.
Use data_small Sub-setting the full data into a smaller set that addresses a particular question.

Data cleaning scripts can be broken into three chunks with a script such as

# Script for reading in Flagstaff Weather Data
Weather_raw <- read_csv('data-raw/Flagstaff_Temp.csv')

Weather_tidy <- Weather_raw %>% ...

Weather <- Weather_tidy %>% ...

In the above script the ... represent a a pipeline of commands to process the data. The clean data is the one that I want to save and then subsequently use in any analysis. A good method for saving data sets in R is creating an image of the cleaned data, which can be done using the save.image() command. To store a cleaned data set, ensure only that data is loaded into the local environment and save the image of the R session. This allows data to be quickly imported with the structure you saved for the analysis.

In many real world examples, the data wrangling work is concentrated in only one or two of the first three steps. Typically you might be able to skip one or more steps, but it can be useful to develop a consistent naming convention such as above so that it is clear where the skipped steps are and what step you are on.

When working in large research groups, it is inevitably only a few people understand the data import and cleaning and all of the other researchers are interested in building models. If the data owner handles the cleaning process and saves it in a format that make it easy to work with, plenty of frustration will be saved.

Import

Obviously getting the data into R is an obvious first step. Often this is a simple step of reading a .csv file, but more complicated scenarios such as messy excel files, importing data tables from a database, or pulling data out of web pages or pdf documents are common. Another case is when the data is spread across a bunch of files (e.g. one excel file per month of data) and we need to read multiple files and join them together before proceeding to the next step.

Tidying

The terminology of “tidy data” popularized by Hadley Wickham and his introduction to the concept lives in a vignette in the tidyr package.

Data is usually described as “tidy” if it satisfies the following:

  1. Each row represents an observation.
  2. Each column represents a variable.
  3. Each table of data represents a different type of observational unit.

The difficult part is recognizing what constitutes an observation and what constitutes a variable. In particular is considering what attributes should be applicable to every single observation. If your data has a large number of NA values, that is a symptom of storing the data in a messy (non-tidy) format.

Consider an address book with email addresses, phone numbers, and other contact information. Different people have several different types of contact information, it would be a bad idea to have one row per person because then we’d need a column for work email, personal email, home phone, work phone, cell phone, twitter handle, reddit user name, etc. Instead, store the information with a single row representing a particular contact.

## # A tibble: 5 × 3
##   Person Type       Value           
##   <chr>  <chr>      <chr>           
## 1 Robert Work Phone 555-555-0001    
## 2 Roy    Cell Phone 555-555-0002    
## 3 Bob    Twitter    @Bob            
## 4 Bianca Work Email email@email.com 
## 5 Derek  Github     dereksonderegger

For a more challenging example, suppose we have grade book where we’ve stored students scores for four different homework assignments.

##      name HW.1 HW.2 HW.3 HW.4
## 1  Alison    8    5    8    4
## 2 Brandon    5    3    6    9
## 3 Charles    9    7    9   10

In this case we are considering each row to represent a student and each variable represents homework score. An alternative representation would be for each row to represent a single score.

## # A tibble: 12 × 3
##    name    Assignment Score
##    <chr>   <chr>      <dbl>
##  1 Alison  HW.1           8
##  2 Alison  HW.2           5
##  3 Alison  HW.3           8
##  4 Alison  HW.4           4
##  5 Brandon HW.1           5
##  6 Brandon HW.2           3
##  7 Brandon HW.3           6
##  8 Brandon HW.4           9
##  9 Charles HW.1           9
## 10 Charles HW.2           7
## 11 Charles HW.3           9
## 12 Charles HW.4          10

Either representation is fine in this case, because each student should have the same number of assignments. If combining grade books from multiple semesters of teaching, the first option won’t work because sometimes the number of assign projects are not the same. So the tidy version of the data would be to have a table scores where each row represents a single assignment from a particular student.

Cleaning

The cleaning step is usually highly dependent on the data set content. This step involves

  1. Making sure every variable has the right type. Dates as dates, numerical instead of a string.
  2. Fix factor labels and sort order.
  3. Verify numeric values are reasonable. Examining summary statistics and/or histograms for each column.
  4. Create any calculated variables we need.

These steps are interchangeable and should be continued until all data has been investigated with properly assigned levels and units.

Use

In the previous three steps, we tried to keep all of the data present and not filter anything out. In this step we transition to using data to build a much deeper understanding. In the simplest case, we just take the full data set and pass it into a graph or statistical model. But in a more complicated situation, we might want to filter data and focus on a particular subset. For example, we might make a graph for a subgroup comparing two covariates.

In this step, the data manipulation is often to just filter the final cleaned up data. You may want to consider many different small filtered sets and can be convenient to pipe in changes to a graphing or modeling function.

tidy.scores %>%
  filter %>%
  ggplot(...) + geom()

Thinking

Why is it good to separate the steps in our thinking?

  1. Organizing data wrangling code in a systematic manner encourages proper documentation of key decisions made during the process.

  2. By separating data wrangling steps from the analysis, it becomes easier to focus on verification issues and initial exploration. This separation also aids in understanding the data more thoroughly by emphasizing how best to store and structure it.

  3. Ensuring data is cleaned correctly in a single step reduces the risk of errors in subsequent analyses. Relying on a single, tidy data set prevents the common mistake of omitting crucial cleaning steps across multiple analyses. If additional issues are discovered, only one script needs modification to include the fix.

  4. Utilizing packages makes it simple to share data analysis projects globally. By posting the project on platforms like GitHub, anyone can access, download, and understand it thanks to the standardized R package structure. Separating data import, tidying, and cleaning from the analysis within the package also improves collaboration by keeping the analysis organized and accessible for collaborators.