Chapter 4 Data Manipulation
4.1 Introduction
The process of data wrangling often seems very situation dependent and there isn’t a unifying process. However this isn’t completely true. The process can be thought of as having four distinct steps.
Step | Description |
---|---|
Import | Get the data into software somehow. The structure of the data is just however it came in. |
Tidy | Restructure the data so that each row is an observation, and each column is a variable. |
Clean | Correct variable types, consistent and useful labels, validation and correction. |
Use | Sub-setting the full data into a smaller set that addresses a particular question. |
In many real world examples, the data wrangling work is concentrated in only one of the first three steps. Typically you might be able to skip one or more steps, but pedagogically it is helpful to be aware of the full process.
4.1.1 Import
Importing your data into your software is usually easy, and usually falls into one of the following scenarios.
- The data is stored in a file with a known format, such as a comma separated values file or an Excel document.
- The data is accessible on-line and the user has to do some work to figure out how to request the data, and ultimately downloads the data as a file with a nice file format, as in scenario 1.
- The data lives in a database and you need to log onto the database sign in. This is something that often happens in professional settings, but for this course, we’ll not worry about this situation.
One of the most common file formats we will encounter is the comma separated values
file, which has the file extension .csv
. These files look like this:
Student, HW1, HW2, HW3
Sarah Anderson, 10, 9.5, 10
Bob Barker, 9, 10, 8
In this case we see that the data is organized as you might see in an Excel file with the first row of information being the column names, with each column separated with commas. The second row in the file is actually the first row of data, again with commas separating the columns.
Typically your computer will show a preview of the file as if it were in spread sheet and hide the commas from you. If you open the file, the operating system will open it with a spreadsheet program and again, not show the commas. If you want to see the raw file, you should specify to open it with a simple text editor. On a daily basis for viewing and editing the data, I’m happy to use my spread sheet program, but it is useful to know the underlying file format and how to view it if there is a problem with the import.
4.1.2 Tidying
The terminology of “tidy data” popularized by Hadley Wickham and his introduction
to the concept lives in a
vignette
in the R tidyr
package. I highly recommend reading Hadley Wickham’s introduction
as well as what I present here.
The above graphs are courtesy of @Alison_Horst.1
Data is usually described as “tidy” if it satisfies the following:
- Each row represents an observation.
- Each column represents a variable.
- Each table of data represents a different type of observational unit.
The difficult part is recognizing what constitutes an observation and what constitutes a variable. Often I like to think that the observations represent a noun and each now has multiple variables that adjectives that describe the noun. In particular I think that the attributes should be applicable to every single observation. If your data has a large number of missing values, that is a symptom of storing the data in a messy (non-tidy) format.
Suppose I have an address book where I keep email addresses, phone numbers, and other contact information. However, because different people have several different types of contact information, it would be a bad idea to have one row per person because then we’d need a column for work email, personal email, home phone, work phone, cell phone, twitter handle, reddit user name, etc. Instead, store the information with a single row representing a particular contact.
Person | Type | Value |
---|---|---|
Derek | Work Email | derek.sonderegger@nau.edu |
Derek | Cell Phone | 970-867-5309 |
Derek | @D_Sonderegger | |
Derek | Github | dereksonderegger |
Mom | Home Phone | 555-867-5309 |
Dad | Home Phone | 555-867-5309 |
Dad | json@aol.com | |
Dad | TikTok | olddesertdude |
For a more challenging example, suppose we have grade book where we’ve stored students scores for four different homework assignments. In this representation, each row represents the information about a single.
name | HW.1 | HW.2 | HW.3 | HW.4 |
---|---|---|---|---|
Alison | 8 | 5 | 8 | 4 |
Brandon | 5 | 3 | 6 | 9 |
Charles | 9 | 7 | 9 | 10 |
In this case we are considering each row to represent a student and each variable represents homework score. An alternative representation would be for each row to represent a single score.
name | Assignment | Score |
---|---|---|
Alison | HW.1 | 8 |
Brandon | HW.1 | 5 |
Charles | HW.1 | 9 |
Alison | HW.2 | 5 |
Brandon | HW.2 | 3 |
Charles | HW.2 | 7 |
Alison | HW.3 | 8 |
Brandon | HW.3 | 6 |
Charles | HW.3 | 9 |
Alison | HW.4 | 4 |
Brandon | HW.4 | 9 |
Charles | HW.4 | 10 |
Either representation is fine in this case, because each student should have the
same number of assignments. However, if I was combining grade books from multiple
times I’ve taught the course, the first option won’t work because sometimes I
assign projects and sometimes not. So the tidy version of the data would be to
have a table Scores
where each row represents a single assignment from a particular
student.
4.1.3 Cleaning
The cleaning step is usually highly dependent on the data set content. This step involves:
- Making sure every variable has the right type. For example, make sure that dates are dates, not character strings.
- Fix factor labels and sort order.
- Verify numeric values are reasonable.
- Create (some) calculated variables
Most of our data frame manipulation tools are designed to work with tidy data. As a result, cleaning is most easily done after the data set structure has been tidied. Therefore,I recommend first performing the reshaping tidying step and then perform the cleaning.
4.1.4 Use
In this step, we concentrate on the research question of interest. Often this involves calculating summary statistics, calculating selecting only particular rows of data, and finally creating graphs and models.
4.2 Fundamental Actions
Most actions we need to perform can be done with a sequence of the following “fundamental actions.” These actions allow us to create or modify columns, calculate summary statistics for one or more columns, reshape the table, and combine multiple tables into a single table.
4.2.1 Sorting
Name | Grade | Gender | Height |
---|---|---|---|
Adam | 1st Grade | Male | 42 |
Bob | 2nd Grade | Male | 45 |
Carrie | 2nd Grade | Female | 43 |
Darlene | 1st Grade | Female | 39 |
Given the data set above, we might want to sort the information by any of the
Name
, Grade
, or Height
columns.
4.2.2 Subsetting
Sub-setting is removing either some set of rows or columns. For example, we might
want to create a data set that only contains the columns for Height
and Name
columns for just the 1st Grade children but sorted by Height
:
Height | Name |
---|---|
39 | Darlene |
42 | Adam |
4.2.3 New Column via from Existing Column(s)
Often we need to take a column and create some sort of calculation. For example, if we are given some information (say student height) in inches, we might want to calculate their height in centimeters by multiplying each height value by \(2.54\) and naming the result Height (cm)
.
Name | Grade | Gender | Height | Height (cm) |
---|---|---|---|---|
Adam | 1st Grade | Male | 42 | 106.7 |
Bob | 2nd Grade | Male | 45 | 114.3 |
Carrie | 2nd Grade | Female | 43 | 109.2 |
Darlene | 1st Grade | Female | 39 | 99.06 |
4.2.4 Aggregation
We need to be able to calculate summary statistics like counting the number of observations in a group or calculating the mean of some column. For example we might want to calculate the number of students in each grade as well as the average height of students in each of the grades.
Grade | Number | Avg_Height |
---|---|---|
1st Grade | 2 | 40.5 |
2nd Grade | 2 | 44 |
We call this action an aggregation step because we are combining (aka aggregating) information from multiple rows. Notice that in the above example, we actually did an aggregation for each grade level, and then “squished” together those individual grade level results.
4.2.5 Unions
One way that we could “squish” two data sets together is if they have the same columns and we essentially just want to add new rows to a data set.
In this example we have two tables with identical column names and types and we just want to stack the two tables to create a single table that contains all the rows of data.
PersonID | Type | Handle |
---|---|---|
P0001 | derek.sonderegger@nau.edu | |
P0001 | @D_Sonderegger | |
P0002 | aubrey.sonderegger@yahoo.com |
PersonID | Type | Handle |
---|---|---|
P0003 | Elise.Sonderegger@gmail.com | |
P0003 | @UnicornsRule |
Taking a union of these two data sets just squishes them together vertically.
PersonID | Type | Handle |
---|---|---|
P0001 | derek.sonderegger@nau.edu | |
P0001 | @D_Sonderegger | |
P0002 | aubrey.sonderegger@yahoo.com | |
P0003 | Elise.Sonderegger@gmail.com | |
P0003 | @UnicornsRule |
4.2.6 Joins
There are many cases where we have two or more data sets that are related and we
need to squish them together to make a table with more columns. For example we
might have a data set People
that contains a unique numeric code for each person,
first and last names, and birth dates. We might also have a data set of Contacts
which has the person’s unique numeric code and the contact type and value.
PersonID | First.Name | Birthday |
---|---|---|
P0001 | Derek | 01/10/1976 |
P0002 | Aubrey | 02/14/1980 |
P0004 | Casey | 02/14/2015 |
PersonID | Type | Handle |
---|---|---|
P0001 | derek.sonderegger@nau.edu | |
P0001 | @D_Sonderegger | |
P0002 | aubrey.sonderegger@yahoo.com |
The PersonID
column links the two tables. But when we now have to ask what we
should do about Casey. There are really two options:
Join Type | Description |
---|---|
Inner Join | Only include rows for individuals that have information in both tables. |
Full Join | Include a row for for every row in either of the tables being joined. In this case, it results in a row Casey being included, but the contact information being left empty. |
A join squishes the two data sets together horizontally while making sure the rows match up because of one matching column that occurs in each data set. The inner join only returns rows that have data in both data sets.
PersonID | First.Name | Birthday | Type | Handle |
---|---|---|---|---|
P0001 | Derek | 01/10/1976 | derek.sonderegger@nau.edu | |
P0001 | Derek | 01/10/1976 | @D_Sonderegger | |
P0002 | Aubrey | 02/14/1980 | aubrey.sonderegger@yahoo.com |
The outer join returns all possible rows that occur in either data set. Notice
that Casey has a row in the People
data set but doesn’t have a corresponding
row in the Contacts
table. So his row in the outer join has missing data for
the contact information.
PersonID | First.Name | Birthday | Type | Handle |
---|---|---|---|---|
P0001 | Derek | 01/10/1976 | derek.sonderegger@nau.edu | |
P0001 | Derek | 01/10/1976 | @D_Sonderegger | |
P0002 | Aubrey | 02/14/1980 | aubrey.sonderegger@yahoo.com | |
P0004 | Casey | 02/14/2015 | NA | NA |
4.2.7 Pivots
Often source data is stored in a structure that isn’t useful for subsequent analysis. When the data isn’t a format of one observation per row and one variable per column, we need to fix those issues before proceeding.
4.2.7.1 Pivot Longer
When data is in the “wide” format, we might need to convert it into “long” format. For an example, suppose we have a grade book with a few students.
Name | Exam 1 | Exam 2 | Final Exam |
---|---|---|---|
Alison | 87 | 87 | 81 |
Bob | 91 | 88 | 85 |
Charlie | 88 | 79 | 92 |
What we want to do is turn this data frame from a wide data frame into a
long data frame and this action is called pivoting. Essentially
I’d like to create a data frame with three columns:
Name
, Assessment
, and Score
.
That is to say that each homework datum really
has three pieces of information: who it came from, which homework it was, and
what the score was. It doesn’t conceptually matter if I store it as 3 rows of
4 columns or 12 rows so long as there is a way to identify how a student scored
on a particular homework.
Fundamentally we’ll need to indicate a set of columns to be we want to be reshaped
into two columns as well as what to name those two columns.
Name | Assessment | Score |
---|---|---|
Alison | Exam 1 | 87 |
Bob | Exam 1 | 91 |
Charlie | Exam 1 | 88 |
Alison | Exam 2 | 87 |
Bob | Exam 2 | 88 |
Charlie | Exam 2 | 79 |
Alison | Final Exam | 81 |
Bob | Final Exam | 85 |
Charlie | Final Exam | 92 |
4.2.7.2 Pivot Wider
There are times that we need to convert a “long” data set into “wide” format. This is exactly the reverse function of the gather. Our example will be to undo the gather step from the previous. Again this is called pivoting, but we’ll make sure that we are turning rows into columns. In a reverse of the with the pivot longer action, you’ll specify which two columns to pivot and distinguish which will become the column headers, and which will be the cell values.
4.2.7.3 Why chose long vs wide?
A plant ecologist is collecting data at many different sites across Arizona. At each site she will record how many of each species she observes. Most species will only be observed in one or two sites. Should she store her data in a “wide” or “long” format? Justify your answer and what would the table structure look like?
If she stored the data “wide,” then each column would be a species and most of
her table would be 0s because most species occur at one or two sites.
A long format would have columns for Site
, Species
, and Count
and each row contains the count for each non-zero
Site/Species combination. The resulting table might look something like this:
Site | Species | Count |
---|---|---|
S101 | Gila Monster | 1 |
S101 | Desert Tortoise | 3 |
S200 | Quail | 7 |
\(\vdots\) | \(\vdots\) | \(\vdots\) |
However, if we had the same species at each site, then the wide format would actually make doing the data input substantially easier because she would have to repeatedly type (and potentially misspell) each species name repeatedly.
4.3 Combining Actions
4.3.1 Example: Column Splitting
Sometimes we have columns where we actually have two pieces of information stored and we have to split the single column into multiple columns.
For example, we might have data about across the country, where the city name
obnoxiously includes the state. It would be good to split the Name
column
into City
and State
columns. For example,
Name | Population |
---|---|
Phoenix, AZ | 1660272 |
Flagstaff, AZ | 73964 |
Tucson, AZ | 545975 |
and what we want is:
City | State | Population |
---|---|---|
Phoenix | AZ | 1660272 |
Flagstaff | AZ | 73964 |
Tucson | AZ | 545975 |
We could think of this splitting as a sequence of steps:
- Create a new column, called
City
that takes theName
column and grabs all the characters before the comma. - Create a new column, called
State
that takes the `Name column and grabs all the characters after the comma. - Remove the now redundant
Name
column and arrange the columns in the order you want.
4.3.2 Example: Calculating Percentages
We might have a data set where we need to calculate the percentage of the data composed by each group. For example, I have a data set of high school student survey responses and we might want to know the percentages of First Year, Sophomore, Junior, and Seniors are in the data.
Year | Gender | Smoke | Height | SAT |
---|---|---|---|---|
Senior | M | No | 71 | 1210 |
Sophomore | F | Yes | 66 | 1150 |
FirstYear | M | No | 72 | 1110 |
Junior | M | No | 63 | 1120 |
Sophomore | F | No | 65 | 1170 |
Sophomore | F | No | 65 | 1150 |
We could think of this calculation as a seqence of steps:
1. Summarize each data for each year to calculate the number of survey responses
for each year. We’ll call this column n
.
Year | n |
---|---|
FirstYear | 94 |
Junior | 35 |
Senior | 36 |
Sophomore | 195 |
- Calculate the total number of student responses by summing across the years
and creating a new column
Total_Num
.
Year | n | Total_Num |
---|---|---|
FirstYear | 94 | 360 |
Junior | 35 | 360 |
Senior | 36 | 360 |
Sophomore | 195 | 360 |
- Finally calculate a new column
Percent
=n
/Total_Num
.
Year | n | Total_Num | Percent |
---|---|---|---|
FirstYear | 94 | 360 | 0.261 |
Junior | 35 | 360 | 0.097 |
Senior | 36 | 360 | 0.1 |
Sophomore | 195 | 360 | 0.542 |
4.4 Using Software
4.4.1 Excel
Microsoft Excel is a wonderful program that is used widely across many many industries and nearly everyone who works with a computer in some fashion should have at least some familiarity with it. There are a number of spreadsheet programs and Excel is just the most ubiquitous. However, there are some serious downsides to using spreadsheets.
This being said, knowing how to do simple formulas in Excel is surprisingly helpful.
- Excel cells are identified by the column/row combination. Columns are identified
using the alphabet (first column is
A
, second isB
). A range of cells is specified using the UpperLeft:LowerRight of the range. For example A2:B10 specifies the first two columns and rows 2 through 10. - Formula Cells have their cell information starting with an
=
. If I were interested in calculating the average value of cells A1:A20, I would select the cell I want to store the result in and input the formula:= AVERAGE( A1:A20 )
- If you copy a formula and past it into a new cell, the range gets modified by however much you’ve shifted copied formula from.
To prevent range shifting when copying formulas, you would specify the range but
preceding each part that you don’t want to shift with a $
. For example
= AVERAGE( $A$1:$A$20 )
will always refer to this same range no matter where you paste the formula.
Finally, you can specify ranges across worksheets pre-pending the sheet name
and exclamation point (!). For example to specify cell G10
in my RawData
worksheet, we’d use the notation RawData!G10
.
It is also incredible useful to be comfortable with the sorting and filtering within Microsoft Excel as well as Google Sheets. One extremely dangerous thing that Excel allows is to sort a single column while keeping the original order for the remaining columns. I’ve personally encountered data where this mistake was made. Be careful!
4.4.2 Tableau
Tableau comes with two pieces of software, Tableau Prep Builder
and
Tableau Desktop
. The prep software is intended to support the data tidying,
cleaning, and summarization steps. Then we save the result to a file and open
the cleaned up data in Tableau to do the visualization.
This course has access to both Tableau and Tableau Prep through an academic license.
- Go to https://www.tableau.com/tft/activation to download both products.
- Select each product download link to get started. When prompted, enter your school email address for Business E-mail.
- Activate with your product key which will be distributed via email and on Bblearn.
This license will work through the semester. You may continue using Tableau after the class is over by individually requesting their own one-year license through the Tableau for Students program.
4.4.2.1 Importing data
Often data will be stored in a Comma Separated Values file that has a .csv
file suffix. These types of files store data tables, and use a comma to separate
the columns. Spreadsheet programs will recognize these files and show columns
for Student Name
, Exam 1
, Exam 2
and the Final Exam
. However, if you
open these files using a text reader program, they’ll look something like this:
1,Exam 2,Final Exam
Student Name,Exam 87,87,81
Alice,91,88,85
Bob,88,79,92 Charlie,
Below is a video showing how to open up this same data in Tableau Prep.
The import step gives a rotated view of the data and generally that doesn’t bother me, but for purposes of show what the imported data looks like, we’ll add a cleaning step that doesn’t do anything but will show the data. In general I won’t keep this step in my cleaning workflow, but it is handy to know how to do this.
4.4.2.2 Tidying or Reshaping
Often source data is stored in a structure that isn’t useful for subsequent analysis. When the data isn’t a format of one observation per row and one variable per column, we need to fix those issues before proceeding.
4.4.2.2.1 Pivot Longer
When data is in the “wide” format, we might need to convert it into “long” format. For an example, suppose we have a gradebook with a few students.
Name | Exam 1 | Exam 2 | Final Exam |
---|---|---|---|
Alison | 87 | 87 | 81 |
Bob | 91 | 88 | 85 |
Charlie | 88 | 79 | 92 |
What we want to do is turn this data frame from a wide data frame into a long
data frame. In MS Excel and Tableau, this is called pivoting. Essentially I’d like
to create a data frame with three columns: Name
, Assessment
, and Score
. That
is to say that each homework datum really has three pieces of information: who
it came from, which homework it was, and what the score was. It doesn’t conceptually
matter if I store it as 3 rows of 4 columns or 12 rows so long as there is a way
to identify how a student scored on a particular homework. So we want to reshape
the three Exam columns into just two columns (Assessment and Score).
Name | Assessment | Score |
---|---|---|
Alison | Exam 1 | 87 |
Bob | Exam 1 | 91 |
Charlie | Exam 1 | 88 |
Alison | Exam 2 | 87 |
Bob | Exam 2 | 88 |
Charlie | Exam 2 | 79 |
Alison | Final Exam | 81 |
Bob | Final Exam | 85 |
Charlie | Final Exam | 92 |
4.4.2.2.2 Pivot Wider
There are times that we need to convert a “long” dataset into “wide” format. This is exactly the reverse function of the gather. Our example will be to undo the gather step from the previous. Again in Tableau, this is called pivoting, but we’ll make sure that we are turning rows into columns.
The tricky part is what variable goes into which pivot field. The top box specifies what the new column headers will be and the bottom box specifies what the cell values will be.
4.4.2.3 Column Splitting
Sometimes we have columns where we actually have two pieces of information stored and we have to split the single column into multiple columns.
For example, we might have data about across the country, where the city name
obnoxiously includes the state. It would be good to split the Name
column
into City
and State
columns. For example,
Name | Population |
---|---|
Phoenix, AZ | 1660272 |
Flagstaff, AZ | 73964 |
Tucson, AZ | 545975 |
4.4.2.4 New Column via Calculation
Often we need to take a column and create some sort of calculation. For example,
if we are given some information (say student height) in inches, we might want
to calculate their height in centimeters by multiplying each height value by
\(2.54\) and naming the result Height (cm)
.
Grade | Gender | Height | Height (cm) |
---|---|---|---|
1st Grade | Male | 44 | 111.8 |
1st Grade | Male | 42 | 106.7 |
1st Grade | Female | 42 | 106.7 |
1st Grade | Female | 40 | 101.6 |
12th Grade | Male | 70 | 177.8 |
12th Grade | Male | 68 | 172.7 |
12th Grade | Male | 69 | 175.3 |
12th Grade | Female | 64 | 162.6 |
12th Grade | Female | 65 | 165.1 |
The issue in Tableau is to make sure you remember how to specify a column in a
formula. The syntax is [column_name]
and then you can use that quantity in
whatever formula you want.
4.4.2.5 Aggregation
Given a data set we also want to be able to calculate summary for various columns. For example, we might want to calculate the mean height for each grade level, or perhaps each gender within each grade level. Notice that we end up with a data set with fewer rows.
Grade | Gender | Avg Height |
---|---|---|
12th Grade | Female | 64.5 |
12th Grade | Male | 69 |
1st Grade | Female | 41 |
1st Grade | Male | 43 |
4.4.2.6 New Column with Aggregation
There are many cases where we need to create a calculated column, but the calculation involves some aggregated information. For, if have the number of males and females in each grade and we want to calculate the percentage of males and females in each grade. To do this we need to aggregate data to count the number of students in each grade and then divide the number of males or females in each grade by the total number of students in the grade.
Grade | Gender | Grade_Gender_n |
---|---|---|
12th Grade | Female | 2 |
12th Grade | Male | 3 |
1st Grade | Female | 2 |
1st Grade | Male | 2 |
Grade | Gender | Grade_Gender_n | Grade_n | Proportion |
---|---|---|---|---|
12th Grade | Female | 2 | 5 | 0.4 |
12th Grade | Male | 3 | 5 | 0.6 |
1st Grade | Female | 2 | 4 | 0.5 |
1st Grade | Male | 2 | 4 | 0.5 |
Anytime you need to include an aggregation value in formula, we need to specify any
grouping information for calculating the aggregate. The code for this is to specify
the grouping variable and the aggregation function and column.
The code is: {FIXED [Group]: SUM[Value]}
4.4.2.7 Unions
One way that we could “squish” two data sets together is if they have the same columns and we essentially just want to add new rows to a data set.
In this example we have two tables with identical column names and types and we just want to stack the two tables to create a single table that contains all the rows of data.
PersonID | Type | Handle |
---|---|---|
P0001 | derek.sonderegger@nau.edu | |
P0001 | @D_Sonderegger | |
P0002 | aubrey.sonderegger@yahoo.com |
PersonID | Type | Handle |
---|---|---|
P0003 | Elise.Sonderegger@gmail.com | |
P0003 | @UnicornsRule |
Taking a union of these two data sets just squishes them together vertically.
PersonID | Type | Handle |
---|---|---|
P0001 | derek.sonderegger@nau.edu | |
P0001 | @D_Sonderegger | |
P0002 | aubrey.sonderegger@yahoo.com | |
P0003 | Elise.Sonderegger@gmail.com | |
P0003 | @UnicornsRule |
4.4.2.8 Joins
There are many cases where we have two or more data sets that are related and we
need to squish them together to make a table with more columns. For example we
might have a data set People
that contains a unique numeric code for each person,
first and last names, and birth dates. We might also have a data set of Contacts
which has the person’s unique numeric code and the contact type and value.
PersonID | First.Name | Birthday |
---|---|---|
P0001 | Derek | 01/10/1976 |
P0002 | Aubrey | 02/14/1980 |
P0004 | Casey | 02/14/2015 |
PersonID | Type | Handle |
---|---|---|
P0001 | derek.sonderegger@nau.edu | |
P0001 | @D_Sonderegger | |
P0002 | aubrey.sonderegger@yahoo.com |
The PersonID
column links the two tables. But when we now have to ask what we
should do about Casey. There are really two options:
Join Type | Description |
---|---|
Inner Join | Only include rows for individuals that have information in both tables. |
Full Join | Include a row for for every row in either of the tables being joined. In this case, it results in a row Casey being included, but the contact information being left empty. |
A join squishes the two data sets together horizontally while making sure the rows match up because of one matching column that occurs in each data set. The inner join only returns rows that have data in both data sets. Joining, by = “PersonID”
PersonID | First.Name | Birthday | Type | Handle |
---|---|---|---|---|
P0001 | Derek | 01/10/1976 | derek.sonderegger@nau.edu | |
P0001 | Derek | 01/10/1976 | @D_Sonderegger | |
P0002 | Aubrey | 02/14/1980 | aubrey.sonderegger@yahoo.com |
The outer join returns all possible rows that occur in either data set. Notice that
Casey has a row in the People
data set but doesn’t have a corresponding row in
the Contacts
table. So his row in the outer join has missing data for the contact
information.
Joining, by = “PersonID”
PersonID | First.Name | Birthday | Type | Handle |
---|---|---|---|---|
P0001 | Derek | 01/10/1976 | derek.sonderegger@nau.edu | |
P0001 | Derek | 01/10/1976 | @D_Sonderegger | |
P0002 | Aubrey | 02/14/1980 | aubrey.sonderegger@yahoo.com | |
P0004 | Casey | 02/14/2015 | NA | NA |
4.5 Exercises
- We are given information about the maximum daily temperature from a weather
station in Flagstaff, AZ. The file is available at the GitHub site that this
book is hosted on. You can find it here:
https://github.com/dereksonderegger/141/raw/master/data-raw/FlagMaxTemp.csv.
Within your browser, you should be able to save the linked file use the
“Save As…” option.
This file is in a wide format, where each row represents a month and the
columns 1, 2, …, 31 represent the day of the month the observation was made.
- Convert data set to the long format where the data has only four columns:
Year
,Month
,Day
,Tmax
. - Calculate the average monthly maximum temperature for each Month in the
data set (So there will be 365 mean maximum temperatures). There are several
days that have missing data, represented by the value
null
. If you have a problem taking the mean with them in there, you likely haven’t converted the temperature from a character string to a numeric value. - Convert the average month maximums back to a wide data format where each line represents a year and there are 12 columns of temperature data (one for each month) along with a column for the year. There will be a couple of months that still have missing data because the weather station was out of commission for those months and there was NO data for the entire month.
- Save this table as a .csv file.
- Convert data set to the long format where the data has only four columns:
- A common task is to take a set of data that has multiple categorical variables
and create a table of the number of cases for each combination. An introductory
statistics textbook contains a data set summarizing student surveys from several
sections of an intro class. The two variables of interest for us are
Gender
andYear
which are the students gender and year in college.Download the dataset as a
.csv
file from the following website: http://www.lock5stat.com/datasets/StudentSurvey.csvNotice that there are two survey responses that did not give their year in school. Perform a filter step to remove any row with the value of
null
in theYear
column.Produce a data set with eight rows that contains the number of responses for each gender:year combination.
Using a
pivot
command, produce a table of the number of responses in the following form:Gender First Year Sophmore Junior Senior Female Male Save this table as a .csv file.
- In the
data-raw
directory on this book’s Github site, I keep a bunch of raw data. https://github.com/dereksonderegger/141/tree/master/data-raw. For any of those files, you can download it by clicking on the file. For some files you’ll be offered the option to download the file. For files that can be viewed in the browser, it will show a preview, and theRaw
option lets you see (and subsequently “Save As” to download) the file. For this problem we will consider theMurders_State.csv
file along with theState_Pop.csv
file.- Download both data sets.
- For the
Murders
data, select only the 2018 years. - Join this data to the population data, again for the 2018 year.
- Calculate the Murder rate as the number of murders per 100,000 people.
- Save this table as a .csv file.
- Read chapter 3 in How Charts Lie by Alberto Cairo. This chapter is about
charts that lie by using dubious data. In this chapter, Cairo’s examples are
of people accepting data and graphs where the the data was either ad-hoc and not
well thought out or maliciously misrepresented.
- What mistake did Dr Cairo make before he shared the map of metal bands? Why was he not initially suspicious?
- Explain how the CCC percentages on violent crime and race of the victim and offender were misleading due to a basic demographic issue.
- Explain why Kansas is disproportionately represented in graphs involving on-line activities?
“Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst”↩︎