Chapter 1 Data Wrangling

1.1 Introduction

The process of data wrangling often seems very situation dependent and there isn’t a unifying process. However this isn’t completely true. The process can be thought of as having four distinct steps.

Step Description
Import Get the data into software somehow. The structure of the data is just however it came in.
Tidy Restructure the data so that each row is an observation, and each column is a variable.
Clean Correct variable types, consistent and useful labels, validation and correction.
Use Sub-setting the full data into a smaller set that addresses a particular question.

In many real world examples, the data wrangling work is concentrated in only one first three steps. Typically you might be able to skip one or more steps, but pedagogically it is helpful to

1.1.1 Tidying

The terminology of “tidy data” popularized by Hadley Wickham and his introduction to the concept lives in a vignette in the R tidyr package. I highly recommend reading Hadley Wickhams’s introduction as well as what I present here.

Data is usually described as “tidy” if it satisfies the following:

  1. Each row represents an observation.
  2. Each column represents a variable.
  3. Each table of data represents a different type of observational unit.

The difficult part is recognizing what constitutes an observation and what constitutes a variable. Often I like to think that the observations represent a noun and each now has multiple variables that adjectives that describe the noun. In particular I think that the attributes should be applicable to every single observation. If your data has a large number of NA values, that is a symptom of storing the data in a messy (non-tidy) format.

Suppose I have an address book where I keep email addresses, phone numbers, and other contact information. However, because different people have several different types of contact information, it would be a bad idea to have one row per person because then we’d need a column for work email, personal email, home phone, work phone, cell phone, twitter handle, reddit user name, etc. Instead, store the information with a single row representing a particular contact.

Person Type Value
Derek Work Email
Derek Cell Phone 970-867-5309
Derek Twitter @D_Sonderegger
Derek Github dereksonderegger
Mom Home Phone 555-867-5309

For a more challenging example, suppose we have grade book where we’ve stored students scores for four different homework assignments.

name HW.1 HW.2 HW.3 HW.4
Alison 8 5 8 4
Brandon 5 3 6 9
Charles 9 7 9 10

In this case we are considering each row to represent a student and each variable represents homework score. An alternative representation would be for each row to represent a single score.

name Assignment Score
Alison HW.1 8
Brandon HW.1 5
Charles HW.1 9
Alison HW.2 5
Brandon HW.2 3
Charles HW.2 7
Alison HW.3 8
Brandon HW.3 6
Charles HW.3 9
Alison HW.4 4
Brandon HW.4 9
Charles HW.4 10

Either representation is fine in this case, because each student should have the same number of assignments. However, if I was combining grade books from multiple times I’ve taught the course, the first option won’t work because sometimes I assign projects and sometimes not. So the tidy version of the data would be to have a table scores where each row represents a single assignment from a particular student.

1.1.2 Cleaning

The cleaning step is usually highly dependent on the data set content. This step involves

  1. Making sure every variable has the right type. For example, make sure that dates are dates, not character strings.
  2. Fix factor labels and sort order.
  3. Verify numeric values are reasonable.
  4. Create (some) calculated variables

Most of our data frame manipulation tools are designed to work with tidy data. As a result, cleaning is most easily done after the data set structure has been tidied. Therefore,I recommend first performing the reshaping tidying step and then perform the cleaning.

1.1.3 Use

In this step, we concentrate on the research question of interest. Often this involves calculating summary statistics, calculating selecting only particular rows of data, and finally creating graphs and models.

1.2 Tableau

Tableau comes with two pieces of software, Tableau Prep and Tableau. The prep software is intended to support the data tidying, cleaning, and summarization steps. Then we save the result to a file and open the cleaned up data in Tableau to do the visualization.

This course has access to both Tableau and Tableau Prep through an academic license.

  1. Go to https://www.tableau.com/products/prep/download to download both products.
  2. Select each product download link to get started. When prompted, enter your school email address for Business E-mail and enter the name of your school for Organization.
  3. Activate with your product key which will be distributed via email or on Bblearn.

This license will work through the semester. You may continue using Tableau after the class is over by individually requesting their own one-year license through the Tableau for Students program.

1.3 Importing data

Often data will be stored in a Comma Separated Values file that has a .csv file suffix. These types of files store data tables, and use a comma to separate the columns. Spreadsheet programs will recognize these files and show columns for Student Name, Exam 1, Exam 2 and the Final Exam. However, if you open these files using a text reader program, they’ll look something like this:

Student Name,Exam 1,Exam 2,Final Exam
Alice,87,87,81
Bob,91,88,85
Charlie,88,79,92

Below is a video showing how to open up this same data in Tableau Prep.

The import step gives a rotated view of the data and generally that doesn’t bother me, but for purposes of show what the imported data looks like, we’ll add a cleaning step that doesn’t do anything but will show the data. In general I won’t keep this step in my cleaning workflow, but it is handy to know how to do this.

1.4 Tidying or Reshaping

Often source data is stored in a structure that isn’t useful for subsequent analysis. When the data isn’t a format of one observation per row and one variable per column, we need to fix those issues before proceeding.

1.4.1 Gather

When data is in the “wide” format, we might need to convert it into “long” format. For an example, suppose we have a gradebook with a few students.

Wide Format
Name Exam 1 Exam 2 Final Exam
Alison 87 87 81
Bob 91 88 85
Charlie 88 79 92

What we want to do is turn this data frame from a wide data frame into a long data frame. In MS Excel and Tableau, this is called pivoting. Essentially I’d like to create a data frame with three columns: Name, Assessment, and Score. That is to say that each homework datum really has three pieces of information: who it came from, which homework it was, and what the score was. It doesn’t conceptually matter if I store it as 3 rows of 4 columns or 12 rows so long as there is a way to identify how a student scored on a particular homework. So we want to reshape the three Exam columns into just two columns (Assessment and Score).

Long Format
Name Assessment Score
Alison Exam 1 87
Bob Exam 1 91
Charlie Exam 1 88
Alison Exam 2 87
Bob Exam 2 88
Charlie Exam 2 79
Alison Final Exam 81
Bob Final Exam 85
Charlie Final Exam 92

1.4.2 Spread

There are times that we need to convert a “long” dataset into “wide” format. This is exactly the reverse function of the gather. Our example will be to undo the gather step from the previous. Again in Tableau, this is called pivoting, but we’ll make sure that we are turning rows into columns.

The tricky part is what variable goes into which pivot field. The top box specifies what the new column headers will be and the bottom box specifies what the cell values will be.

1.4.3 Column Splitting

Sometimes we have columns where we actually have two pieces of information stored and we have to split the single column into multiple columns.

For example, we might have data about across the country, where the city name obnoxiously includes the state. It would be good to split the Name column into City and State columns. For example,

Name Population
Phoenix, AZ 1660272
Flagstaff, AZ 73964
Tucson, AZ 545975

1.4.4 New Column via Calculation

Often we need to take a column and create some sort of calculation. For example, if we are given some information (say student height) in inches, we might want to calculate their height in centimeters by multiplying each height value by \(2.54\) and naming the result Height (cm).

Student Heights
Grade Gender Height Height (cm)
1st Grade Male 44 111.8
1st Grade Male 42 106.7
1st Grade Female 42 106.7
1st Grade Female 40 101.6
12th Grade Male 70 177.8
12th Grade Male 68 172.7
12th Grade Male 69 175.3
12th Grade Female 64 162.6
12th Grade Female 65 165.1

The issue in Tableau is to make sure you remember how to specify a column in a formula. The syntax is [column_name] and then you can use that quantity in whatever formula you want.

1.4.5 Aggregation

Given a data set we also want to be able to calculate summary for various columns. For example, we might want to calculate the mean height for each grade level, or perhaps each gender within each grade level. Notice that we end up with a data set with fewer rows.

summarise() regrouping output by ‘Grade’ (override with .groups argument)

Student Heights: Summarized
Grade Gender Avg Height
12th Grade Female 64.5
12th Grade Male 69
1st Grade Female 41
1st Grade Male 43

1.4.6 New Column with Aggregation

There are many cases where we need to create a calculated column, but the calculation involves some aggregated information. For, if have the number of males and females in each grade and we want to calculate the percentage of males and females in each grade. To do this we need to aggregate data to count the number of students in each grade and then divide the number of males or females in each grade by the total number of students in the grade.

Student Heights: Summarized
Grade Gender Grade_Gender_n
12th Grade Female 2
12th Grade Male 3
1st Grade Female 2
1st Grade Male 2
Grade Gender Grade_Gender_n Grade_n Proportion
12th Grade Female 2 5 0.4
12th Grade Male 3 5 0.6
1st Grade Female 2 4 0.5
1st Grade Male 2 4 0.5

Anytime you need to include an aggregation value in formula, we need to specify any grouping information for calculating the aggregate. The code for this is to specify the grouping variable and the aggregation function and column. The code is: {FIXED [Group]: SUM[Value]}

1.4.7 Unions

One way that we could “squish” two data sets together is if they have the same columns and we essentially just want to add new rows to a data set.

In this example we have two tables with identical column names and types and we just want to stack the two tables to create a single table that contains all the rows of data.

Parents
PersonID Type Handle
P0001 email
P0001 Twitter @D_Sonderegger
P0002 email
Kids
PersonID Type Handle
P0003 email
P0003 Twitter @UnicornsRule

Taking a union of these two data sets just squishes them together vertically.

Parents & Kid Unioned Together
PersonID Type Handle
P0001 email
P0001 Twitter @D_Sonderegger
P0002 email
P0003 email
P0003 Twitter @UnicornsRule

1.4.8 Joins

There are many cases where we have two or more data sets that are related and we need to squish them together to make a table with more columns. For example we might have a data set People that contains a unique numeric code for each person, first and last names, and birth dates. We might also have a data set of Contacts which has the person’s unique numeric code and the contact type and value.

People
PersonID First.Name Birthday
P0001 Derek 01/10/1976
P0002 Aubrey 02/14/1980
P0004 Casey 02/14/2015
Contact Information
PersonID Type Handle
P0001 email
P0001 Twitter @D_Sonderegger
P0002 email

The PersonID column links the two tables. But when we now have to ask what we should do about Casey. There are really two options:

Join Type Description
Inner Join Only include rows for individuals that have information in both tables.
Full Join Include a row for for every row in either of the tables being joined. In this case, it results in a row Casey being included, but the contact information being left empty.

A join squishes the two data sets together horizontally while making sure the rows match up because of one matching column that occurs in each data set. The inner join only returns rows that have data in both data sets. Joining, by = “PersonID”

Inner Join
PersonID First.Name Birthday Type Handle
P0001 Derek 01/10/1976 email
P0001 Derek 01/10/1976 Twitter @D_Sonderegger
P0002 Aubrey 02/14/1980 email

The outer join returns all possible rows that occur in either data set. Notice that Casey has a row in the People data set but doesn’t have a corresponding row in the Contacts table. So his row in the outer join has missing data for the contact information. Joining, by = “PersonID”

Outer Join
PersonID First.Name Birthday Type Handle
P0001 Derek 01/10/1976 email
P0001 Derek 01/10/1976 Twitter @D_Sonderegger
P0002 Aubrey 02/14/1980 email
P0004 Casey 02/14/2015 NA NA

1.5 Exercises

  1. We are given information about the maximum daily temperature from a weather station in Flagstaff, AZ. The file is available at the GitHub site that this book is hosted on. You can find it https://github.com/dereksonderegger/141/raw/master/data-raw/FlagMaxTemp.csv. This file is in a wide format, where each row represents a month and the columns 1, 2, …, 31 represent the day of the month the observation was made.
    1. Convert data set to the long format where the data has only four columns: Year, Month, Day, Tmax.
    2. Calculate the average monthly maximum temperature for each Month in the dataset (So there will be 365 mean maximum temperatures). There are several days that have missing data, represented by the value NA. If you have a problem taking the mean with them in there, you likely haven’t converted the temperature from a character string to a numeric value.
    3. Convert the average month maximums back to a wide data format where each line represents a year and there are 12 columns of temperature data (one for each month) along with a column for the year. There will be a couple of months that still have missing data because the weather station was out of commission for those months and there was NO data for the entire month.
    4. Save this table as a .csv file.
  2. A common task is to take a set of data that has multiple categorical variables and create a table of the number of cases for each combination. An introductory statistics textbook contains a dataset summarizing student surveys from several sections of an intro class. The two variables of interest for us are Gender and Year which are the students gender and year in college.
    1. Download the dataset as a .csv file from the following website: http://www.lock5stat.com/datasets/StudentSurvey.csv

    2. Notice that there are two survey responses that did not give their year in school. Perform a filter step to remove any row with the value of null in the Year column.

    3. Produce a data set with eight rows that contains the number of responses for each gender:year combination.

    4. Using a pivot command, produce a table of the number of responses in the following form:

      Gender First Year Sophmore Junior Senior
      Female
      Male
    5. Save this table as a .csv file.

  3. Read Chapter 2 in The Truthful Art.
    1. On page 46, figure 2.2 asks you to list aspects of the figure that “seem fishy”. Before reading the rest of the chapter, list the things that seem fishy. After you’ve read the rest of the chapter, go back and examine the graph and reflect on why you didn’t notice certain deceptive issues on the first reading of the graph.
    2. On page 55, figure 2.11 asks which chart is more aesthetically pleasing? Which do you think is more attractive and why?