Chapter 1 Data Wrangling
1.1 Introduction
The process of data wrangling often seems very situation dependent and there isn’t a unifying process. However this isn’t completely true. The process can be thought of as having four distinct steps.
Step | Description |
---|---|
Import | Get the data into software somehow. The structure of the data is just however it came in. |
Tidy | Restructure the data so that each row is an observation, and each column is a variable. |
Clean | Correct variable types, consistent and useful labels, validation and correction. |
Use | Sub-setting the full data into a smaller set that addresses a particular question. |
In many real world examples, the data wrangling work is concentrated in only one first three steps. Typically you might be able to skip one or more steps, but pedagogically it is helpful to
1.1.1 Tidying
The terminology of “tidy data” popularized by Hadley Wickham and his introduction to the concept lives in a vignette in the R tidyr
package. I highly recommend reading Hadley Wickhams’s introduction as well as what I present here.
Data is usually described as “tidy” if it satisfies the following:
- Each row represents an observation.
- Each column represents a variable.
- Each table of data represents a different type of observational unit.
The difficult part is recognizing what constitutes an observation and what constitutes a variable. Often I like to think that the observations represent a noun and each now has multiple variables that adjectives that describe the noun. In particular I think that the attributes should be applicable to every single observation. If your data has a large number of NA
values, that is a symptom of storing the data in a messy (non-tidy) format.
Suppose I have an address book where I keep email addresses, phone numbers, and other contact information. However, because different people have several different types of contact information, it would be a bad idea to have one row per person because then we’d need a column for work email, personal email, home phone, work phone, cell phone, twitter handle, reddit user name, etc. Instead, store the information with a single row representing a particular contact.
Person | Type | Value |
---|---|---|
Derek | Work Email | derek.sonderegger@nau.edu |
Derek | Cell Phone | 970-867-5309 |
Derek | @D_Sonderegger | |
Derek | Github | dereksonderegger |
Mom | Home Phone | 555-867-5309 |
For a more challenging example, suppose we have grade book where we’ve stored students scores for four different homework assignments.
name | HW.1 | HW.2 | HW.3 | HW.4 |
---|---|---|---|---|
Alison | 8 | 5 | 8 | 4 |
Brandon | 5 | 3 | 6 | 9 |
Charles | 9 | 7 | 9 | 10 |
In this case we are considering each row to represent a student and each variable represents homework score. An alternative representation would be for each row to represent a single score.
name | Assignment | Score |
---|---|---|
Alison | HW.1 | 8 |
Brandon | HW.1 | 5 |
Charles | HW.1 | 9 |
Alison | HW.2 | 5 |
Brandon | HW.2 | 3 |
Charles | HW.2 | 7 |
Alison | HW.3 | 8 |
Brandon | HW.3 | 6 |
Charles | HW.3 | 9 |
Alison | HW.4 | 4 |
Brandon | HW.4 | 9 |
Charles | HW.4 | 10 |
Either representation is fine in this case, because each student should have the same number of assignments. However, if I was combining grade books from multiple times I’ve taught the course, the first option won’t work because sometimes I assign projects and sometimes not. So the tidy version of the data would be to have a table scores
where each row represents a single assignment from a particular student.
1.1.2 Cleaning
The cleaning step is usually highly dependent on the data set content. This step involves
- Making sure every variable has the right type. For example, make sure that dates are dates, not character strings.
- Fix factor labels and sort order.
- Verify numeric values are reasonable.
- Create (some) calculated variables
Most of our data frame manipulation tools are designed to work with tidy data. As a result, cleaning is most easily done after the data set structure has been tidied. Therefore,I recommend first performing the reshaping tidying step and then perform the cleaning.
1.1.3 Use
In this step, we concentrate on the research question of interest. Often this involves calculating summary statistics, calculating selecting only particular rows of data, and finally creating graphs and models.
1.2 Tableau
Tableau comes with two pieces of software, Tableau Prep
and Tableau
. The prep software is intended to support the data tidying, cleaning, and summarization steps. Then we save the result to a file and open the cleaned up data in Tableau to do the visualization.
This course has access to both Tableau and Tableau Prep through an academic license.
- Go to https://www.tableau.com/products/prep/download to download both products.
- Select each product download link to get started. When prompted, enter your school email address for Business E-mail and enter the name of your school for Organization.
- Activate with your product key which will be distributed via email or on Bblearn.
This license will work through the semester. You may continue using Tableau after the class is over by individually requesting their own one-year license through the Tableau for Students program.
1.3 Importing data
Often data will be stored in a Comma Separated Values file that has a .csv
file suffix. These types of files store data tables, and use a comma to separate the columns. Spreadsheet programs will recognize these files and show columns for Student Name
, Exam 1
, Exam 2
and the Final Exam
. However, if you open these files using a text reader program, they’ll look something like this:
1,Exam 2,Final Exam
Student Name,Exam 87,87,81
Alice,91,88,85
Bob,88,79,92 Charlie,
Below is a video showing how to open up this same data in Tableau Prep.
The import step gives a rotated view of the data and generally that doesn’t bother me, but for purposes of show what the imported data looks like, we’ll add a cleaning step that doesn’t do anything but will show the data. In general I won’t keep this step in my cleaning workflow, but it is handy to know how to do this.
1.4 Tidying or Reshaping
Often source data is stored in a structure that isn’t useful for subsequent analysis. When the data isn’t a format of one observation per row and one variable per column, we need to fix those issues before proceeding.
1.4.1 Gather
When data is in the “wide” format, we might need to convert it into “long” format. For an example, suppose we have a gradebook with a few students.
Name | Exam 1 | Exam 2 | Final Exam |
---|---|---|---|
Alison | 87 | 87 | 81 |
Bob | 91 | 88 | 85 |
Charlie | 88 | 79 | 92 |
What we want to do is turn this data frame from a wide data frame into a long data frame. In MS Excel and Tableau, this is called pivoting. Essentially I’d like to create a data frame with three columns: Name
, Assessment
, and Score
. That is to say that each homework datum really has three pieces of information: who it came from, which homework it was, and what the score was. It doesn’t conceptually matter if I store it as 3 rows of 4 columns or 12 rows so long as there is a way to identify how a student scored on a particular homework. So we want to reshape the three Exam columns into just two columns (Assessment and Score).
Name | Assessment | Score |
---|---|---|
Alison | Exam 1 | 87 |
Bob | Exam 1 | 91 |
Charlie | Exam 1 | 88 |
Alison | Exam 2 | 87 |
Bob | Exam 2 | 88 |
Charlie | Exam 2 | 79 |
Alison | Final Exam | 81 |
Bob | Final Exam | 85 |
Charlie | Final Exam | 92 |
1.4.2 Spread
There are times that we need to convert a “long” dataset into “wide” format. This is exactly the reverse function of the gather. Our example will be to undo the gather step from the previous. Again in Tableau, this is called pivoting, but we’ll make sure that we are turning rows into columns.
The tricky part is what variable goes into which pivot field. The top box specifies what the new column headers will be and the bottom box specifies what the cell values will be.
1.4.3 Column Splitting
Sometimes we have columns where we actually have two pieces of information stored and we have to split the single column into multiple columns.
For example, we might have data about across the country, where the city name obnoxiously includes the state. It would be good to split the Name
column into City
and State
columns. For example,
Name | Population |
---|---|
Phoenix, AZ | 1660272 |
Flagstaff, AZ | 73964 |
Tucson, AZ | 545975 |
1.4.4 New Column via Calculation
Often we need to take a column and create some sort of calculation. For example, if we are given some information (say student height) in inches, we might want to calculate their height in centimeters by multiplying each height value by \(2.54\) and naming the result Height (cm)
.
Grade | Gender | Height | Height (cm) |
---|---|---|---|
1st Grade | Male | 44 | 111.8 |
1st Grade | Male | 42 | 106.7 |
1st Grade | Female | 42 | 106.7 |
1st Grade | Female | 40 | 101.6 |
12th Grade | Male | 70 | 177.8 |
12th Grade | Male | 68 | 172.7 |
12th Grade | Male | 69 | 175.3 |
12th Grade | Female | 64 | 162.6 |
12th Grade | Female | 65 | 165.1 |
The issue in Tableau is to make sure you remember how to specify a column in a formula. The syntax is [column_name]
and then you can use that quantity in whatever formula you want.
1.4.5 Aggregation
Given a data set we also want to be able to calculate summary for various columns. For example, we might want to calculate the mean height for each grade level, or perhaps each gender within each grade level. Notice that we end up with a data set with fewer rows.
summarise()
regrouping output by ‘Grade’ (override with .groups
argument)
Grade | Gender | Avg Height |
---|---|---|
12th Grade | Female | 64.5 |
12th Grade | Male | 69 |
1st Grade | Female | 41 |
1st Grade | Male | 43 |
1.4.6 New Column with Aggregation
There are many cases where we need to create a calculated column, but the calculation involves some aggregated information. For, if have the number of males and females in each grade and we want to calculate the percentage of males and females in each grade. To do this we need to aggregate data to count the number of students in each grade and then divide the number of males or females in each grade by the total number of students in the grade.
Grade | Gender | Grade_Gender_n |
---|---|---|
12th Grade | Female | 2 |
12th Grade | Male | 3 |
1st Grade | Female | 2 |
1st Grade | Male | 2 |
Grade | Gender | Grade_Gender_n | Grade_n | Proportion |
---|---|---|---|---|
12th Grade | Female | 2 | 5 | 0.4 |
12th Grade | Male | 3 | 5 | 0.6 |
1st Grade | Female | 2 | 4 | 0.5 |
1st Grade | Male | 2 | 4 | 0.5 |
Anytime you need to include an aggregation value in formula, we need to specify any grouping information for calculating the aggregate. The code for this is to specify the grouping variable and the aggregation function and column. The code is: {FIXED [Group]: SUM[Value]}
1.4.7 Unions
One way that we could “squish” two data sets together is if they have the same columns and we essentially just want to add new rows to a data set.
In this example we have two tables with identical column names and types and we just want to stack the two tables to create a single table that contains all the rows of data.
PersonID | Type | Handle |
---|---|---|
P0001 | derek.sonderegger@nau.edu | |
P0001 | @D_Sonderegger | |
P0002 | aubrey.sonderegger@yahoo.com |
PersonID | Type | Handle |
---|---|---|
P0003 | Elise.Sonderegger@gmail.com | |
P0003 | @UnicornsRule |
Taking a union of these two data sets just squishes them together vertically.
PersonID | Type | Handle |
---|---|---|
P0001 | derek.sonderegger@nau.edu | |
P0001 | @D_Sonderegger | |
P0002 | aubrey.sonderegger@yahoo.com | |
P0003 | Elise.Sonderegger@gmail.com | |
P0003 | @UnicornsRule |
1.4.8 Joins
There are many cases where we have two or more data sets that are related and we need to squish them together to make a table with more columns. For example we might have a data set People
that contains a unique numeric code for each person, first and last names, and birth dates. We might also have a data set of Contacts
which has the person’s unique numeric code and the contact type and value.
PersonID | First.Name | Birthday |
---|---|---|
P0001 | Derek | 01/10/1976 |
P0002 | Aubrey | 02/14/1980 |
P0004 | Casey | 02/14/2015 |
PersonID | Type | Handle |
---|---|---|
P0001 | derek.sonderegger@nau.edu | |
P0001 | @D_Sonderegger | |
P0002 | aubrey.sonderegger@yahoo.com |
The PersonID
column links the two tables. But when we now have to ask what we should do about Casey. There are really two options:
Join Type | Description |
---|---|
Inner Join | Only include rows for individuals that have information in both tables. |
Full Join | Include a row for for every row in either of the tables being joined. In this case, it results in a row Casey being included, but the contact information being left empty. |
A join squishes the two data sets together horizontally while making sure the rows match up because of one matching column that occurs in each data set. The inner join only returns rows that have data in both data sets. Joining, by = “PersonID”
PersonID | First.Name | Birthday | Type | Handle |
---|---|---|---|---|
P0001 | Derek | 01/10/1976 | derek.sonderegger@nau.edu | |
P0001 | Derek | 01/10/1976 | @D_Sonderegger | |
P0002 | Aubrey | 02/14/1980 | aubrey.sonderegger@yahoo.com |
The outer join returns all possible rows that occur in either data set. Notice that Casey has a row in the People
data set but doesn’t have a corresponding row in the Contacts
table. So his row in the outer join has missing data for the contact information.
Joining, by = “PersonID”
PersonID | First.Name | Birthday | Type | Handle |
---|---|---|---|---|
P0001 | Derek | 01/10/1976 | derek.sonderegger@nau.edu | |
P0001 | Derek | 01/10/1976 | @D_Sonderegger | |
P0002 | Aubrey | 02/14/1980 | aubrey.sonderegger@yahoo.com | |
P0004 | Casey | 02/14/2015 | NA | NA |
1.5 Exercises
- We are given information about the maximum daily temperature from a weather station in Flagstaff, AZ. The file is available at the GitHub site that this book is hosted on. You can find it https://github.com/dereksonderegger/141/raw/master/data-raw/FlagMaxTemp.csv. This file is in a wide format, where each row represents a month and the columns 1, 2, …, 31 represent the day of the month the observation was made.
- Convert data set to the long format where the data has only four columns:
Year
,Month
,Day
,Tmax
. - Calculate the average monthly maximum temperature for each Month in the dataset (So there will be 365 mean maximum temperatures). There are several days that have missing data, represented by the value
NA
. If you have a problem taking the mean with them in there, you likely haven’t converted the temperature from a character string to a numeric value. - Convert the average month maximums back to a wide data format where each line represents a year and there are 12 columns of temperature data (one for each month) along with a column for the year. There will be a couple of months that still have missing data because the weather station was out of commission for those months and there was NO data for the entire month.
- Save this table as a .csv file.
- Convert data set to the long format where the data has only four columns:
- A common task is to take a set of data that has multiple categorical variables and create a table of the number of cases for each combination. An introductory statistics textbook contains a dataset summarizing student surveys from several sections of an intro class. The two variables of interest for us are
Gender
andYear
which are the students gender and year in college.Download the dataset as a
.csv
file from the following website: http://www.lock5stat.com/datasets/StudentSurvey.csvNotice that there are two survey responses that did not give their year in school. Perform a filter step to remove any row with the value of
null
in theYear
column.Produce a data set with eight rows that contains the number of responses for each gender:year combination.
Using a
pivot
command, produce a table of the number of responses in the following form:Gender First Year Sophmore Junior Senior Female Male Save this table as a .csv file.
- Read Chapter 2 in The Truthful Art.
- On page 46, figure 2.2 asks you to list aspects of the figure that “seem fishy”. Before reading the rest of the chapter, list the things that seem fishy. After you’ve read the rest of the chapter, go back and examine the graph and reflect on why you didn’t notice certain deceptive issues on the first reading of the graph.
- On page 55, figure 2.11 asks which chart is more aesthetically pleasing? Which do you think is more attractive and why?