Chapter 3 Importing and Cleaning Data
In the previous two chapters, we used mostly built-in data sets to practice visualizing and transforming data. In practice, of course, you’ll have to import a data set into R before starting your analysis. Once a data set is imported, it’s often necessary to spruce it up by fixing broken formatting, changing variable names, deciding what to do with missing values, etc. We address these important steps in this chapter.
We’ll need tidyverse:
library(tidyverse)
We’ll also need two new libraries used for importing data sets. They’re installed as part of base R, but you’ll have to load them.
library(readxl) # For importing Excel files
library(readr) # For importing csv (comma separated values) files
3.1 Importing Tabular Data
All of our data sets in this course are tabular, meaning that they consist of several rows of observations and several columns of variables. Tabular data is usually entered into spreadsheet software (such as Excel, Google Sheets, etc), where it can be manipulated using the built-in functionality of the software. Importing the data into R, however, allows for greater flexibility and versatility when performing your analysis.
RStudio makes it very easy to import data sets, although the first step is placing your spreadsheet file in the right directory.
Click here to view a very small sample csv file. Our goal is to import this into R. We can do so as follows:
Download the file by clicking the download button in the upper-right corner of the screen.
The next step is to get this file into R.
- If you’re using the RStudio desktop version, move this downloaded file to your working directory in R. If you’re not sure what your working directory is, you can enter
getwd()
to find out. - If you’re using Posit Cloud, click the “Files” tab in the lower-right pane, and then click the “Upload” button. Then click “Choose File” and browse to the location of the downloaded csv file. In either case (desktop or cloud), the file should now appear in your list of files.
Click on the csv file in your file list, and then choose “Import Dataset.” You’ll see a preview of what the data set will look like in R, a window of importing options, and the R code used to do the importing.
In the “Import Options” window, choose the name you want to use for the data set once it’s in R. Enter this in the “Name” field. Also, decide whether you want to use the values in the first row in your csv file as column names. If so, make sure the “First Row as Names” box is checked. There are other options in this window as well that you’ll sometimes need to adjust, but these two will get you started.
Now you’re ready to import. You can do so by either clicking the “Import” button at the bottom right or by cutting and pasting the code into your document. I’d recommend cutting and pasting the code, as it allows you to quickly re-import your csv over your original in case you have to start your analysis over. The code generated by the import feature is:
library(readr)
<- read_csv("sample_csv.csv")
sample_csv View(sample_csv)
readr is the package needed to access the read_csv
function. If you’ve already loaded it, it’s unnecessary to do so again. Also, the View(sample_csv)
command is, of course, optional. All you really need to import the file is:
<- read_csv("sample_csv.csv")
sample_csv
sample_csv
## # A tibble: 3 x 2
## state capital
## <chr> <chr>
## 1 Michigan Lansing
## 2 California Sacramento
## 3 New Jersey Trenton
Once you import it, you should see sample_csv
listed in the Global Environment pane, and you can now start to analyze it like any other data set.
Data entry is usually done in Excel (or other spreadsheet software) and then imported into R for analysis. However, you sometimes might want to enter tabular data directly into R, especially when the data set is small. This is often the case when you need a sample data set to test a piece of code.
You can do so in R using the tibble
or tribble
(= transposed tibble) functions from tidyverse.
<- tibble(
small_data_set_tibble x = c(3, 4, 2, -5, 7),
y = c(5, 6, -1, 0, 5),
z = c(-3, -2, 0, 1, 6)
)
small_data_set_tibble
## # A tibble: 5 x 3
## x y z
## <dbl> <dbl> <dbl>
## 1 3 5 -3
## 2 4 6 -2
## 3 2 -1 0
## 4 -5 0 1
## 5 7 5 6
<- tribble(
small_data_set_tribble ~name, ~points, ~year,
"Wilt Chamberlain", 100, 1962,
"Kobe Bryant", 81, 2006,
"Wilt Chamberlain", 78, 1961,
"Wilt Chamberlain", 73, 1962,
"Wilt Chamberlain", 73, 1962,
"David Thompson", 73, 1978
)
small_data_set_tribble
## # A tibble: 6 x 3
## name points year
## <chr> <dbl> <dbl>
## 1 Wilt Chamberlain 100 1962
## 2 Kobe Bryant 81 2006
## 3 Wilt Chamberlain 78 1961
## 4 Wilt Chamberlain 73 1962
## 5 Wilt Chamberlain 73 1962
## 6 David Thompson 73 1978
Notice that the column names do not require quotation marks, but any non-numeric values in the data set do.
3.1.1 Exercises
Click here to access another sample spreadsheet file. Download it as a .xlsx file and then import it into R. (Copy and paste the code generated for the import.) Why is the code a little different from that used to import the spreadsheet from above?
Click here for yet another sample spreadsheet. Download it and import it into R, but when you do, give the imported data set a more meaningful name than the default name. What else will you have to do to make the data set import correctly? (Copy and paste the generated import code.)
When you chose not to use the first row as the column names in the previous exercise, what default names did
read_excel
assign to the columns? By either checking theread_excel
documentation in R (?read_excel
) or by searching the internet, find out how to override the default column names. Then re-import the spreadsheet from the previous exercise and assign more meaningful column names.Use the
tribble
command to directly import your current course schedule into R. Include a column for “course” (e.g., MAT 210), “title” (e.g., Data Analysis with R), and “credits” (e.g., 4).
3.2 Data Types
You may have noticed that the read_csv
, read_excel
, tibble
, and tribble
functions all produce data sets in the structure of a “tibble.” Recall that tibbles are data structures in tidyverse that store tabular data sets in readable, convenient formats.
The built-in iris
data set, available in base R, is a data frame, not a tibble. Load iris
and note its appearance:
iris
To compare this data frame to a tibble, we can coerce iris
into a tibble and then reload it:
<- as_tibble(iris)
iris_tibble
iris_tibble
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ... with 140 more rows
## # i Use `print(n = ...)` to see more rows
We can’t see the entire data set in the console anymore, just a preview. However, we’re told how many rows and columns the data set has without having to scroll, and, more importantly, we’re told the data type of each variable in the data set.
Knowing the data type of a variable is important, as different data types can behave very differently. The three most commonly occurring data types we’ll encounter are summarized below.
type | abbreviation in R | objects |
---|---|---|
integer | int | whole numbers |
double-precision floating point number | dbl | real numbers, possibly containing decimal digits |
character | chr | characters or strings of characters |
You can check the types of the variables in a tibble by displaying it in the console, or you can use the type_sum
function. Suppose we want to know the type of the variable hwy
in the mpg
data set. (Recall that mpg$hwy
extracts just the vector of entries in the hwy
column from mpg
.) If we don’t want to display the entire tibble, we can do this instead:
type_sum(mpg$hwy)
## [1] "int"
When you import a data set or create one with tibble
or tribble
, R can usually determine what the variable types are. However, look again at small_data_set_tribble
, which we created above. The type of points
and year
is “double.” This is because R converts all imported or input numeric data, whether there are decimal places or not, to the double type since it’s more inclusive and is handled the same way mathematically. If we really want our numeric data to be treated like integers, there are two options:
First, if we’re entering the data directly into R using tibble
or tribble
, we can specify that the numbers are integers by including the suffix “L.” For example:
<- tribble(
small_data_set2 ~name, ~points, ~year,
"Wilt Chamberlain", 100L, 1962L,
"Kobe Bryant", 81L, 2006L,
"Wilt Chamberlain", 78L, 1961L,
"Wilt Chamberlain", 73L, 1962L,
"Wilt Chamberlain", 73L, 1962L,
"David Thompson", 73L, 1978L
)
small_data_set2
## # A tibble: 6 x 3
## name points year
## <chr> <int> <int>
## 1 Wilt Chamberlain 100 1962
## 2 Kobe Bryant 81 2006
## 3 Wilt Chamberlain 78 1961
## 4 Wilt Chamberlain 73 1962
## 5 Wilt Chamberlain 73 1962
## 6 David Thompson 73 1978
Notice that points
are year
are now integers.
Second, we can coerce a variable of one type into a variable of another type using the as.<TYPE>
function. For example:
<- small_data_set_tribble %>%
small_data_set3 mutate(points = as.integer(points),
year = as.integer(year))
small_data_set3
## # A tibble: 6 x 3
## name points year
## <chr> <int> <int>
## 1 Wilt Chamberlain 100 1962
## 2 Kobe Bryant 81 2006
## 3 Wilt Chamberlain 78 1961
## 4 Wilt Chamberlain 73 1962
## 5 Wilt Chamberlain 73 1962
## 6 David Thompson 73 1978
One other data type worth mentioning here is factor. Notice that when we converted iris
to the tibble iris_tibble
above, the type of Species
was abbreviated fct for “factor.”
type_sum(iris_tibble$Species)
## [1] "fct"
To begin to see what factors are, let’s create a small data set:
<- tribble(
schedule ~day, ~task,
"Tuesday", "wash car",
"Friday", "doctor's appointment",
"Monday", "haircut",
"Saturday", "laundry",
"Wednesday", "oil change"
)
schedule
## # A tibble: 5 x 2
## day task
## <chr> <chr>
## 1 Tuesday wash car
## 2 Friday doctor's appointment
## 3 Monday haircut
## 4 Saturday laundry
## 5 Wednesday oil change
Suppose we wanted to organize this list of tasks by the day. We could try:
arrange(schedule, day)
## # A tibble: 5 x 2
## day task
## <chr> <chr>
## 1 Friday doctor's appointment
## 2 Monday haircut
## 3 Saturday laundry
## 4 Tuesday wash car
## 5 Wednesday oil change
As you can see, this organizes by day alphabetically, which is certainly not what we want. To organize by day chronologically, we could assign an ordering to each day of the week so that the arrange
function would sort it the way we want. We can do this by coercing day
into the “factor” type.
We first have to tell R the ordering we want. We do so by entering a vector with the days of the week in chronological order. This ordering establishes the levels of the day
variable:
<- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday") day_levels
We can then turn the day
column into a factor with the above levels as follows:
<- schedule %>%
schedule2 mutate(day = factor(day, levels = day_levels))
type_sum(schedule2$day)
## [1] "fct"
Now re-run the sort above:
arrange(schedule2, day)
## # A tibble: 5 x 2
## day task
## <fct> <chr>
## 1 Monday haircut
## 2 Tuesday wash car
## 3 Wednesday oil change
## 4 Friday doctor's appointment
## 5 Saturday laundry
When we define the levels of a factor, the values of that variable are replaced internally with integers that are determined by the levels. In the example above, after we turned the day
column into factors, the value “Friday” became a 5 since “Friday” was the fifth entry in the level vector. “Monday” became a 2, “Saturday” a 7, etc. The names of the days became aliases for the integers from 1 to 7 that they represent. That’s why the arrange
function was able to sort chronologically; it was just sorting a column of numbers from 1 to 7 in ascending order.
3.2.1 Exercises
You will need the nycflights13 library for these exercises.
The
flights
data set has a variable with a type that was not covered in this section. What is the variable, and what is its type?Create a data set called
mpg2
which is an exact copy ofmpg
except that thehwy
andcty
variables are doubles rather than integers.Answer the following questions by experimenting. (Try just coercing single values rather than entire columns of data sets. For example, for part (a), try
as.character(7L)
.)- Can you coerce an integer into a character? What happens when you try?
- Can you coerce a double into a character? What happens when you try?
- Can you coerce a double into an integer? What happens when you try?
- Can you coerce character into either a double or an integer? What happens when you try?
Download and import the data set found here.
- What is the type of the variable
day
? Change it to a more appropriate type. - Use
arrange
to sort the data set so that the records are in chronological order. (You’ll have to changemonth
to a different type.)
- What is the type of the variable
3.3 Renaming Columns
For the next few sections, we’ll be using the data set found here. You should download it and then import it into R using the name high_school_data
. It contains high school data for incoming college students as well as data from their first college semester.
high_school_data
## # A tibble: 278 x 10
## `High School GPA` `ACT Comp` SAT Co~1 High ~2 Bridg~3 FA20 ~4 FA20 ~5 Acad ~6 Race Ethni~7
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 2.72 14 800 0.8880~ yes 13 13 Academ~ White Hispan~
## 2 4 NA 1380 0.8389~ <NA> 15 15 Dean's~ White Non Hi~
## 3 3.08 NA 1030 0.8940~ <NA> 14 13 Good S~ <NA> Hispan~
## 4 3.69 NA 1240 <NA> <NA> 17 17 Good S~ Asian Non Hi~
## 5 NA NA NA <NA> <NA> 13 13 Dean's~ <NA> Non Re~
## 6 3.47 NA 1030 0.5330~ <NA> 16 16 Good S~ White Non Hi~
## 7 3.43 27 NA <NA> <NA> 15 15 Dean's~ White Non Hi~
## 8 3.27 NA 1080 <NA> <NA> 15 15 Good S~ Blac~ Non Hi~
## 9 3.76 NA 1120 0.9160~ <NA> 15 15 Good S~ White Non Hi~
## 10 3.23 33 NA 0.7970~ <NA> 15 11 Academ~ Asian Non Hi~
## # ... with 268 more rows, and abbreviated variable names 1: `SAT Comp`,
## # 2: `High School Quality Academic`, 3: `Bridge Program?`,
## # 4: `FA20 Credit Hours Attempted`, 5: `FA20 Credit Hours Earned`,
## # 6: `Acad Standing After FA20 (Dean's List, Good Standing, Academic Warning, Academic Probation, Academic Dismissal, Readmission on Condition)`,
## # 7: Ethnicity
## # i Use `print(n = ...)` to see more rows
You probably immediately notice that one of the variable names, the one that starts with Acad Standing After FA20...
is extremely long. (You can’t see the column’s data in the clipped tibble in the console, but it’s one of the off-screen columns.) A data set should have variable names that are descriptive without being prohibitively long. Column names are easy to change in Excel, Google Sheets, etc, but you can also change them using the tidyverse function rename
.
First, a few notes about the way R handles column names. By default, column names are not allowed to contain non-syntactic characters, i.e., any character other than letters, numbers, .
, and _
. Also, column names cannot begin with a number or an underscore _
. In particular, column names cannot have spaces. It’s best to use an underscore in place of a space in a column name, but, maybe for aesthetic reasons, you can override these restrictions by enclosing the character string for a column name in backticks. For example, if we want to rename the long variable name above, then `Acad Standing`
is allowed, but Acad Standing
is not.
Also, when the data set you’re importing has a column name with a space in it, the space will be retained in R. However, to refer to that column, you’ll have to enclose the column name in backticks. For example, high_school_data$`ACT Comp`
will return the ACT Comp
column in high_school_data
, but high_school_data$ACT Comp
will produce an error message.
With the above in mind, renaming variables is actually very easy. The rename
function is one of the transformation functions like filter
, select
, etc, and works the same way. Let’s rename the really long Acad Standing After FA20...
column as just Acad_Standing
:
<- high_school_data %>%
high_school_data_clean rename("Acad_Standing" = `Acad Standing After FA20 (Dean's List, Good Standing, Academic Warning, Academic Probation, Academic Dismissal, Readmission on Condition)`)
high_school_data_clean
## # A tibble: 278 x 10
## `High School GPA` `ACT Comp` SAT Co~1 High ~2 Bridg~3 FA20 ~4 FA20 ~5 Acad_~6 Race Ethni~7
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 2.72 14 800 0.8880~ yes 13 13 Academ~ White Hispan~
## 2 4 NA 1380 0.8389~ <NA> 15 15 Dean's~ White Non Hi~
## 3 3.08 NA 1030 0.8940~ <NA> 14 13 Good S~ <NA> Hispan~
## 4 3.69 NA 1240 <NA> <NA> 17 17 Good S~ Asian Non Hi~
## 5 NA NA NA <NA> <NA> 13 13 Dean's~ <NA> Non Re~
## 6 3.47 NA 1030 0.5330~ <NA> 16 16 Good S~ White Non Hi~
## 7 3.43 27 NA <NA> <NA> 15 15 Dean's~ White Non Hi~
## 8 3.27 NA 1080 <NA> <NA> 15 15 Good S~ Blac~ Non Hi~
## 9 3.76 NA 1120 0.9160~ <NA> 15 15 Good S~ White Non Hi~
## 10 3.23 33 NA 0.7970~ <NA> 15 11 Academ~ Asian Non Hi~
## # ... with 268 more rows, and abbreviated variable names 1: `SAT Comp`,
## # 2: `High School Quality Academic`, 3: `Bridge Program?`,
## # 4: `FA20 Credit Hours Attempted`, 5: `FA20 Credit Hours Earned`, 6: Acad_Standing,
## # 7: Ethnicity
## # i Use `print(n = ...)` to see more rows
Notice that the syntax for rename
is rename(<"NEW NAME"> = <OLD NAME>)
. Notice also that in the above code chunk, a new data set high_school_data_clean
was created to house the renamed column. The original data set, high_school_data
, will still have the original long column name. In general, it’s a good practice to not change the originally imported data set. It’s safer to always create a new data set to accept whatever changes you need to make.
3.4 Missing Values
The other thing that stands out right away in high_school_data
is the abundance of NA
s. This is an issue in most data sets you will analyze. The way you handle the NA
s in your analysis will depend on the context, and we’ll see various methods as we proceed in the course. (Recall that we’ve already seen one way to work around NA
s, namely the na.rm = TRUE
argument we can include when we calculate summary statistics, such as means, sums, and standard deviations.)
Knowing the scope of the missing values in a data set is important when you first start to explore and clean. We can count how many NA
s there are in a given column using the is.na
function as follows.
Let’s start with a small sample vector, some of whose entries are missing:
<- c(4, "abc", NA, 16, 3.1, NA, 0, -3, NA) v
We can apply is.na
to the entire vector; it will be applied individually to each entry. (We say that the is.na
function is therefore “vectorized.”)
is.na(v)
## [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
We see the value TRUE
in the entry spots that had an NA
in v
. Remember that R treats the logical values TRUE
and FALSE
arithmetically by giving TRUE
the value 1 and FALSE
0. Thus the sum of the entries in is.na(v)
would just be the number of TRUE
values, and hence the number of NA
s in v
:
sum(is.na(v))
## [1] 3
We can apply this process to the columns of high_school_data
, replacing the vector v
with any of the column vectors. For example, let’s count how many missing values there are in the SAT Comp
column. (Notice the backticks around SAT Comp
in the code below. Recall that this is necessary since SAT Comp
has a space in it.)
sum(is.na(high_school_data$`SAT Comp`))
## [1] 40
Since NA
stands for “not available,” making the decision to change an NA
to an actual value runs the risk of compromising the integrity of the data set. An analyst should be wary of “playing God” with the data. However, we can safely convert NA
s to actual values when it’s clear what the NA
s really mean or when we need to assign them a certain value in order to make a calculation work.
In high_school_data
, the Bridge Program?
column indicates whether a student was required to attend the summer Bridge Program as a preparation for taking college classes. The missing values in this column are clearly there because the person who entered the data meant it to be understood that a blank cell means “no.” While this convention is immediately understood by anyone viewing the data set, it sets up potential problems when doing our analysis. It’s always a better practice to reserve NA
for values that are truly missing rather than using it as a stand-in for some known value.
We can replace NA
s in a column with an actual value using the transformation function mutate
together with replace_na
. Let’s replace the NA
s in the Bridge Program?
column with “no.” We’ll display only this cleaned up column so we can check our work.
<- high_school_data_clean %>%
high_school_data_clean mutate(`Bridge Program?` = replace_na(`Bridge Program?`, "no"))
%>%
high_school_data_clean select(`Bridge Program?`)
## # A tibble: 278 x 1
## `Bridge Program?`
## <chr>
## 1 yes
## 2 no
## 3 no
## 4 no
## 5 no
## 6 no
## 7 no
## 8 no
## 9 no
## 10 no
## # ... with 268 more rows
## # i Use `print(n = ...)` to see more rows
3.4.1 Exercises
Do you feel confident in replacing any of
NA
s in any of the columns besidesBridge Program?
? Why or why not?Run the command
summary(high_school_data)
. One of the many things you see is the number ofNA
s in some of the columns. For what types of columns is thisNA
count missing?For the columns from the previous exercise that did not produce an
NA
count, find the number ofNA
s using the method from this section.A rough formula for converting an ACT score to a comparable SAT score is \(\textrm{SAT} = 41.6 \times\textrm{ACT} + 102.4\).
- Use this formula to create a new column in
high_school_data_clean
that contains the converted ACT scores for each student who took it. - Create another new column in
high_school_data_clean
that contains the better of each student’s SAT and converted ACT score. (Hint: Look up thepmax
function.) You’ll have to think about how to handle the manyNA
s in theACT Comp
andSAT Comp
columns.
- Use this formula to create a new column in
This exercise requires the Lahman library. In baseball, a “plate appearance” is any instance during a game of a batter standing at home plate being pitched to. Only some plate appearances count as “at bats,” namely the ones for which the batter either gets a hit or fails to reach base through some fault of his own. For example, when a player draws a walk (which is the result of four bad pitches) or is hit by a pitch, these plate appearances do not count as at bats. Or if a batter purposely hits the ball in such a way that he gets out so that other base runners can advance, this type of plate appearance is also not counted as an at bat. The idea is that a player’s batting average (hits divided by at bats) should not be affected by walks, being hit by pitches, or sacrifice hits. A player’s plate appearance total is the sum of at bats (
AB
), walks (BB
), intentional walks (IBB
), number of times being hit by a pitch (HBP
), sacrifice hits (SH
), and sacrifice flies (SF
).- Add a column to
Batting
that records each player’s plate appearance totals. - Why are there so many
NA
s in the plate appearance column added in part (a)? - For each of the six variables that contribute to the plate appearances total, determine the number of missing values.
- For all of the variables from part (c) that had missing values, replace the missing values with 0.
- Re-do part (a) and verify that there are no missing values in your new plate appearances column.
- Who holds the record for most plate appearances in a single season?
- Who holds the record for most plate appearances throughout his entire career?
- Add a column to
3.5 Detecting Entry Errors
It’s very common for data entered into a spreadsheet manually to have a few entry errors, and there are standard methods for detecting these errors. One of the first things to look for are numerical data points that seem way too big or way too small. A good place to start is with the summary statistics. (We’ll be using the data set high_school_data_clean
throughout this section, which is the partially cleaned version of the original high_school_data
with the renamed Acad_Standing
column and with the NA
s in the Bridge Program?
column replaced by "no"
.)
summary(high_school_data_clean)
## High School GPA ACT Comp SAT Comp High School Quality Academic
## Min. :2.279 Min. : 4.05 Min. : 100 Length:278
## 1st Qu.:3.225 1st Qu.:20.00 1st Qu.:1020 Class :character
## Median :3.625 Median :23.00 Median :1120 Mode :character
## Mean :3.534 Mean :22.62 Mean :1116
## 3rd Qu.:3.930 3rd Qu.:26.00 3rd Qu.:1220
## Max. :4.650 Max. :33.00 Max. :1480
## NA's :2 NA's :205 NA's :40
## Bridge Program? FA20 Credit Hours Attempted FA20 Credit Hours Earned Acad_Standing
## Length:278 Min. : 0.00 Min. : 0.00 Length:278
## Class :character 1st Qu.: 14.00 1st Qu.:13.00 Class :character
## Mode :character Median : 15.00 Median :15.00 Mode :character
## Mean : 15.06 Mean :13.94
## 3rd Qu.: 16.00 3rd Qu.:16.00
## Max. :150.00 Max. :19.00
##
## Race Ethnicity
## Length:278 Length:278
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
This compilation is most helpful for continuous data. We’ll deal with categorical data separately. Notice that the maximum value of FA20 Credit Hours Attempted
is quite a bit bigger than the third quartile value (and totally unrealistic anyway). This is certainly an entry error. A visual way to detect this would be to obtain a plot of the distribution of the FA20 Credit Hours Attempted
variable. Since this is a continuous variable, we’ll use a histogram. (Recall that categorical distributions are visualized with bar graphs.)
ggplot(data = high_school_data_clean) +
geom_histogram(mapping = aes(x = `FA20 Credit Hours Attempted`))
This histogram is skewed very far to the right with a huge gap between the right-most bar and the nearest bar to its left. This is another possible sign of an entry error.
We can drill down and see what’s going on with this point by sorting the FA20 Credit Hours Attempted
variable in descending order. In the code below, head
is a useful function that can be used to display only the beginning of a data set. (There’s also a tail
function that displays the end of the data set.) By default, head
displays the first 6 records, but you can override this and specify the number you want to display. Let’s display the records with the top ten highest values of FA20 Credit Hours Attempted
. (The reason for select
is to make the FA20 Credit Hours Attempted
column visible.)
head(
arrange(high_school_data_clean, desc(`FA20 Credit Hours Attempted`)) %>%
select(`FA20 Credit Hours Attempted`,
`FA20 Credit Hours Earned`,
Acad_Standing),10)
## # A tibble: 10 x 3
## `FA20 Credit Hours Attempted` `FA20 Credit Hours Earned` Acad_Standing
## <dbl> <dbl> <chr>
## 1 150 15 Good Standing
## 2 19 19 Dean's List
## 3 18 18 Dean's List
## 4 18 18 Dean's List
## 5 18 17 Good Standing
## 6 18 18 Good Standing
## 7 18 18 Good Standing
## 8 18 18 Dean's List
## 9 18 18 Dean's List
## 10 18 18 Dean's List
Since we can see the entire row with the error, we can also get an insight into what probably happened. The student earned 15 credit hours and finished the semester in good academic standing, which likely means they didn’t fail any classes. Therefore the person doing the data entry probably absent-mindedly entered 150 instead of 15.
It’s a minor judgement call, but it’s probably safe to change the 150 to 15. (The other option would be to ignore this row altogether, which is not ideal.) The easiest way to do this is to make the change in the original spreadsheet and then re-import it. However, we can also make the change directly in R. We just have to see how to access an entry in a specific row and column of our data set.
First, let’s see how to pick out a specific entry from a vector. We’ll use the vector w
to illustrate:
<- c(4, -2, 2.4, 8, 10) w
To extract the entry in the, let’s say, fourth spot in w
, we just enter:
4] w[
## [1] 8
In a tibble, each column is a vector, so we can access the entry in a given spot the same way. Suppose we want the 27th entry in the FA20 Credit Hours Attempted
column of high_school_data_clean
:
$`FA20 Credit Hours Attempted`[27] high_school_data_clean
## [1] 14
If we instead know the actual value in a vector for which we’re searching but not the spot it’s in (which will be the situation as we search for the 150 attempted credit hours value), we can use the which
function. Suppose for the vector w
above, we know that one of the entries is 2.4, but we don’t know which spot it’s in. We can locate it as follows:
which(w == 2.4)
## [1] 3
We thus see that 2.4 is the entry in the third spot in w
. Notice that we used a double equal sign ==
. This is because the argument for which
must be a truth condition on the vector. The which
function will return the location of every entry that satisfies that truth condition.
We can use this method to find where 150 is in FA20 Credit Hours Attempted
. Let’s store the spot number in the variable spot_150
so we can refer to it in the follow-up.
<- which(high_school_data_clean$`FA20 Credit Hours Attempted` == 150)
spot_150
spot_150
## [1] 74
So our problematic 150 value is in the 74th row. We can now make the change. We just have to overwrite the 150 in the column FA20 Credit Hours Attempted
and row number 74 with 15.
$`FA20 Credit Hours Attempted`[spot_150] <- 15 high_school_data_clean
Entry errors for categorical variables can be harder to detect than for continuous ones. In this case, entry errors often take the form of inconsistent naming. For example, let’s look at the Race
column in high_school_data_clean
. One way to check for inconsistencies is to use group_by
and summarize
to see how many times each value in Race
shows up in the data set:
%>%
high_school_data_clean group_by(Race) %>%
summarize(count = n())
## # A tibble: 7 x 2
## Race count
## <chr> <int>
## 1 American/Alaska Native 5
## 2 Asian 9
## 3 Black or African American 11
## 4 Black/African American 1
## 5 Hawaiian/Pacific Islander 1
## 6 White 230
## 7 <NA> 21
We should especially pay attention to the values with low counts since these might indicate inconsistencies. Notice that “Black or African American” and “Black/African American” are both listed as distinct values, although this is certainly because of an inconsistent naming system. Since “Black/African American” only shows up once, this is probably a mistake and should be changed to “Black or African American.” We can make this change with the str_replace
function within mutate
. The syntax is str_replace(<VARIABLE>, <VALUE TO REPLACE>, <NEW VALUE>)
:
<- high_school_data_clean %>%
high_school_data_clean mutate(Race = str_replace(Race, "Black/African American", "Black or African American"))
Running the grouped summary again on high_school_data_clean
shows that the problem has been fixed:
%>%
high_school_data_clean group_by(Race) %>%
summarize(count = n())
## # A tibble: 6 x 2
## Race count
## <chr> <int>
## 1 American/Alaska Native 5
## 2 Asian 9
## 3 Black or African American 12
## 4 Hawaiian/Pacific Islander 1
## 5 White 230
## 6 <NA> 21
3.5.1 Exercises
Look for possible entry errors for other continuous variables in
high_school_data_clean
. Explain why you think there’s an entry error by:- Citing the summary statistics.
- Referring to a visualization of the variable’s distribution.
- Sorting the data set by the variable and displaying the first or last few entries.
For any of the entry errors you found in the previous exercise, decide whether you think you should fix them. If so, proceed to do so, but if not, explain why.
Suppose the owner of the data set tells you that the 100 in the
SAT Comp
column was supposed to be 1000. Make this change directly in R.We saw above that
<VECTOR>[<NUMBER>]
returns the entry from<VECTOR>
in spot<NUMBER>
. Suppose we want to return the entries from several of the spots within<VECTOR>
. For the vectort <- c("a", "b", "c", "d", "e")
, try to come up with a single-line command that will return the entries in spots 1, 4, and 5.For this exercise, you will need the Lahman package.
- Use the
which
function to return all of the row numbers fromBatting
that have a home run (HR
) value from 50 through 59. - Use part (a) and the idea from Exercise 4 to write a single-line command that will return a list of the players who had a home run total from 50 through 59. (A player could be listed more than once.)
- Use the
3.6 Tidying Data
Before beginning any data analysis project, we would ideally like to have data that is properly named, has as few missing values as possible, and is devoid of entry errors. We also want our data to be tidy, meaning that every variable must have its own dedicated column, every observation must have its own dedicated row, and every value must have its own dedicated cell. In this section, we’ll see how to deal with untidy data by examining four built-in untidy data sets. First, here’s an example of a tidy data set. The cases
variable refers to the number of cases of tuberculosis reported in the given year.
table1
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
It might seem at first that this is not tidy since it looks like observations are spread over two rows. However, though each country appears in two rows, the year
value in each row is different, making them different observations.
Below, we’ll pose four problems that could make our data untidy and offer a way to fix each one.
Problem: more than one value in a cell
Solution: separate
Our first untidy data set is table3
:
table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
The rate
column seems to contain the number of tuberculosis cases divided by the population, which would give the percentage of the population that contracted tuberculosis. There are a few problems with this. First, it’s not very helpful to report rates this way. Rates are usually more meaningful as decimals or percentages, but even if this is how rate
were displayed, remember that when comparing rates, it’s important to also display an overall count. In other words, we should convert the fractions in rate
to decimals, but we should also keep the actual case and population numbers, each in its own column. The way to do this is to use the separate
function. Notice that the into =
argument below is the vector of new columns we want to create to house the two numbers in rate
. We’ll also go ahead and add a true rate
column reported as a percentage:
<- table3 %>%
table3_tidy separate(rate, into = c("cases", "population")) %>%
mutate(rate = cases / population * 100)
## Error in `mutate()`:
## ! Problem while computing `rate = cases/population * 100`.
## Caused by error in `cases / population`:
## ! non-numeric argument to binary operator
What does this error message mean? There’s apparently a problem with rate = cases/population * 100
, and it’s that there’s a non-numeric argument to binary operator
. A binary operator is an arithmetic operation that takes two numbers as input and produces a single numeric output. The binary operator of concern here is division, so the error message must be saying that one or both of cases
or population
is non-numeric. Let’s check this out by running the above code again with the offending mutate
commented out:
<- table3 %>%
table3_tidy separate(rate, into = c("cases", "population")) #%>%
#mutate(rate = cases / population * 100)
table3_tidy
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Notice that, indeed, cases
and population
are both listed as characters. The reason is that the default procedure for separate
is to make the separated variables into characters. We can override this by including the convert = TRUE
argument in separate
:
<- table3 %>%
table3_tidy separate(rate, into = c("cases", "population"), convert = TRUE) #%>%
#mutate(rate = cases / population * 100)
table3_tidy
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Since we’re now able to compute the rate, let’s remove the #
and run the mutate
:
<- table3 %>%
table3_tidy separate(rate, into = c("cases", "population"), convert = TRUE) %>%
mutate(rate = cases / population * 100)
table3_tidy
## # A tibble: 6 x 5
## country year cases population rate
## <chr> <int> <int> <int> <dbl>
## 1 Afghanistan 1999 745 19987071 0.00373
## 2 Afghanistan 2000 2666 20595360 0.0129
## 3 Brazil 1999 37737 172006362 0.0219
## 4 Brazil 2000 80488 174504898 0.0461
## 5 China 1999 212258 1272915272 0.0167
## 6 China 2000 213766 1280428583 0.0167
There’s still a problem with rate
, though. The percentages are so small as to be somewhat meaningless. The reason is that a percentage tells you how many cases there are for every hundred people (cent = hundred). When the number is so small, it’s better to report the rate as cases per some other power of 10. In this case, if we multiply the percentages by 1000, we get more meaningful numbers, although now we’re calculating cases per 100,000:
<- table3 %>%
table3_tidy separate(rate, into = c("cases", "population"), convert = TRUE) %>%
mutate(rate_per_100K = cases / population * 100000)
table3_tidy
## # A tibble: 6 x 5
## country year cases population rate_per_100K
## <chr> <int> <int> <int> <dbl>
## 1 Afghanistan 1999 745 19987071 3.73
## 2 Afghanistan 2000 2666 20595360 12.9
## 3 Brazil 1999 37737 172006362 21.9
## 4 Brazil 2000 80488 174504898 46.1
## 5 China 1999 212258 1272915272 16.7
## 6 China 2000 213766 1280428583 16.7
By default, separate
will split the cell values at the first non-syntactic character, such as a space, a comma, an arithmetic symbol, etc. However, you can specify where the separation is to occur by including the optional sep = <SEPARATION CHARACTER>
argument. The separate
function will look for instances of <SEPARATION CHARACTER>
in the column to be separated and split the values at that character.
While the sep =
argument allows you to split the cells at any named character, sometimes you’ll want to split the cells into strings of a certain length instead, regardless of the characters involved. Look at this small sample data set, which contains a runner’s split times in a 5K race:
<- tribble(
split_times ~`distance (km)`, ~`time (min)`,
1, 430,
2, 920,
3, 1424,
4, 1920,
5, 2425
)
split_times
## # A tibble: 5 x 2
## `distance (km)` `time (min)`
## <dbl> <dbl>
## 1 1 430
## 2 2 920
## 3 3 1424
## 4 4 1920
## 5 5 2425
It obviously didn’t take the runner 430 minutes to finish the first kilometer; the numbers in the time (min)
column are clearly meant to be interpreted as “minutes:seconds.” The first 1 or 2 digits are minutes, and the last two are seconds. Let’s fix this.
First, we should separate the last two digits from the first 1 or 2, although there’s no character we can give to sep =
. Luckily, sep =
also accepts a number that states how many characters into the string we make our separation. If we include the argument sep = <NUMBER>
, then the cell will be separated <NUMBER>
characters in from the left if <NUMBER>
is positive, and from the right otherwise. In our data set above, we want the separation to occur 2 characters in from the right, so we’ll use sep = -2
:
<- split_times %>%
split_times separate(`time (min)`, into = c("minutes", "seconds"), sep = -2)
split_times
## # A tibble: 5 x 3
## `distance (km)` minutes seconds
## <dbl> <chr> <chr>
## 1 1 4 30
## 2 2 9 20
## 3 3 14 24
## 4 4 19 20
## 5 5 24 25
Next, we’ll look at the unite
function and see how we can combine the minutes
and seconds
columns back into a single column with a colon :
inserted.
Problem: one value is spread across two cells
Solution: unite
Consider the built-in table5
:
table5
## # A tibble: 6 x 4
## country century year rate
## * <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
This also has the problematic rate
column, which we addressed with separate
above, but the issue to confront here is the unnecessary separation of the century
and year
variables. It would be better to paste them together into a single variable, and we can do so with the unite
function. Note below that "year"
(with quotes) refers to the newly created column that the two variables will be united in, while year
(no quotes) refers to the already existing year
variable in table5
.
<- table5 %>%
table5_tidy unite("year", century, year)
table5_tidy
## # A tibble: 6 x 3
## country year rate
## <chr> <chr> <chr>
## 1 Afghanistan 19_99 745/19987071
## 2 Afghanistan 20_00 2666/20595360
## 3 Brazil 19_99 37737/172006362
## 4 Brazil 20_00 80488/174504898
## 5 China 19_99 212258/1272915272
## 6 China 20_00 213766/1280428583
You can see that by default, unite
separates combined values with an underscore _
. We can override this with the optional sep = <SEPARATION CHARACTER>
argument. In this case, we don’t want any separation, so we’ll use an empty pair of quotation marks as our separation character:
<- table5 %>%
table5_tidy unite("year", century, year, sep = "")
table5_tidy
## # A tibble: 6 x 3
## country year rate
## <chr> <chr> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
The only other bit of cleaning we should do (aside from the mess with rate
) is to convert the new year
variable to an integer. (unite
doesn’t offer a convert = TRUE
option like separate
does.)
<- table5_tidy %>%
table5_tidy mutate(year = as.integer(year))
table5_tidy
## # A tibble: 6 x 3
## country year rate
## <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
We can now finish tidying the split_times
data set from above:
<- split_times %>%
split_times unite("time", minutes, seconds, sep = ":")
split_times
## # A tibble: 5 x 2
## `distance (km)` time
## <dbl> <chr>
## 1 1 4:30
## 2 2 9:20
## 3 3 14:24
## 4 4 19:20
## 5 5 24:25
Of course, in practice, we’d combine the separate
and unite
in the pipe:
<- split_times %>%
split_times separate(`time (min)`, into = c("minutes", "seconds"), sep = -2) %>%
unite("time", minutes, seconds, sep = ":")
Problem: variables used as values
Solution: pivot_wider
Consider the data set:
table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
As you can see, each observation, which consists of a given country in a given year, is spread over two rows. The variable type
has values (cases
and population
) which are actually variables themselves, and their values are the corresponding numbers in the count
column. We should thus perform a pivot on the table that turns the values from the type
column into variables and assigns the corresponding values for these new variables from count
. A pivot of this type has the potential to give the data table more columns than it currently has (although not in this case) and thus make the table wider. The syntax below reflects exactly what we want to do:
%>%
table2 pivot_wider(names_from = type, values_from = count)
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Problem: values used as variables
Solution: pivot_longer
Now consider the following data sets:
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
and
table4b
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
The column names 1999
and 2000
are certainly not variables; they’re values of a missing year
variable. This is the opposite of the problem with table2
above: values are being used as variables. We thus will need to perform a pivot which will send the variable names 1999
and 2000
to values of a year
variable. We will also have to say what to do with the values in the 1999
and 2000
columns. In table4a
, we can recognize these as numbers of tuberculosis cases. In table4b
, we can recognize them as populations of the countries.
Since we’re sending variable names to values, a pivot of this type has the potential to make the data table longer. The code below shows how to do this. The first argument is a vector that lists the column names that are being pivoted to values. (Notice the backticks around 1999
and 2000
, which are necessary for names that don’t start with a letter.)
<- table4a %>%
table4a_tidy pivot_longer(cols = c(`1999`, `2000`), names_to = "year", values_to = "cases")
table4a_tidy
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766
<- table4b %>%
table4b_tidy pivot_longer(cols = c(`1999`, `2000`), names_to = "year", values_to = "population")
table4b_tidy
## # A tibble: 6 x 3
## country year population
## <chr> <chr> <int>
## 1 Afghanistan 1999 19987071
## 2 Afghanistan 2000 20595360
## 3 Brazil 1999 172006362
## 4 Brazil 2000 174504898
## 5 China 1999 1272915272
## 6 China 2000 1280428583
Notice that pivoting the years 1999
and 2000
away from variable names and into values converted them to character type. We should manually convert them back to integers since pivot_longer
also does not have a convert = TRUE
option.
<- table4a_tidy %>%
table4a_tidy mutate(year = as.integer(year))
<- table4b_tidy %>%
table4b_tidy mutate(year = as.integer(year))
3.6.1 Exercises
You will need the nycflights13 package for these exercises.
Create a version of the
flights
data set that has a single column for the date, with dates entered asmm-dd-yyyy
rather than three separate columns for the month, day, and year.Click here to go to the Wikipedia page that lists all members of the Baseball Hall of Fame. Then do the following:
- Copy and paste the list, including the column headings, into an Excel document and import it into R.
- Create a data table that lists only the name of each hall of famer in a single column, alphabetized by last name. Be sure there aren’t any
NA
entries. Try to execute this using a sequence of commands in the pipe. (Each name should appear in a single cell when you’re done, with the first name listed first. For example, the first entry in your finished list should show up as: Hank Aaron.) - Are there any problematic entries in your list? What’s going on with these? (Hint: Try to interpret the warning message you’ll get.)
How would you tidy the following data table?
<- tribble(
patients ~name, ~attribute, ~value,
"Stephen Johnson", "age", 47,
"Stephen Johnson", "weight", 185,
"Ann Smith", "age", 39,
"Ann Smith", "weight", 125,
"Ann Smith", "height", 64,
"Mark Davidson", "age", 51,
"Mark Davidson", "weight", 210,
"Mark Davidson", "height", 73
)
- Repeat the previous exercise for this data table. What problem do you run into? (Use the
View
command to see the pivoted table.) What could be done to avoid this problem?
<- tribble(
patients ~name, ~attribute, ~value,
"Stephen Johnson", "age", 47,
"Stephen Johnson", "weight", 185,
"Ann Smith", "age", 39,
"Ann Smith", "weight", 125,
"Ann Smith", "height", 64,
"Mark Davidson", "age", 51,
"Mark Davidson", "weight", 210,
"Mark Davidson", "height", 73,
"Ann Smith", "age", 43,
"Ann Smith", "weight", 140,
"Ann Smith", "height", 66
)
- How should the table below be tidied?
<- tribble(
preg ~pregnant, ~male, ~female,
"yes", NA, 10,
"no", 20, 12
)
3.7 Merging Data
Ideally, we would like all of the information in table4a_tidy
and table4b_tidy
from the last section to appear in a single table. A set of more than one data table is called relational data, and it’s often necessary to have the related data tables communicate with each other.
Merging relational data into a single table is a very common practice. When merging relational data, the first step is to identify the key of each table. The key is a variable or set of variables that uniquely identifies each observation. In both table4a_tidy
and table4b_tidy
, the key is the pair of variables country
and year
. (Notice that neither of these variables alone can be a key since knowing only one of them does not uniquely identify an observation.)
A good way to check whether a set of variables is a key is to group the data by these variables and count how many observations show up in each group. If the variables constitute a key, there should only be one observation per group.
%>%
table4a_tidy group_by(country, year) %>%
summarize(count = n())
## # A tibble: 6 x 3
## # Groups: country [3]
## country year count
## <chr> <int> <int>
## 1 Afghanistan 1999 1
## 2 Afghanistan 2000 1
## 3 Brazil 1999 1
## 4 Brazil 2000 1
## 5 China 1999 1
## 6 China 2000 1
We thus see that the pair country
and year
is a key for table4a_tidy
. The same will be true for table4b_tidy
.
For these two tables, the values of the key are identical: We have the exact same six pairs of country
/year
values. In this case, an appropriate way to merge them into a single table is the inner_join
function. The by
argument below is the vector of key variables.
<- inner_join(table4a_tidy, table4b_tidy, by = c("country", "year"))
table4a4b_comb
table4a4b_comb
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
It’s more often the case that the key values in your relational data are not identical. In this case, we might need to use other types of joins. We’ll explore these in the exercises.
One more type of merge we’ll examine here is useful when you have two data tables that you want to append to each other, meaning that you want to insert the rows of one table right after the rows of another. (Think of copying the cells of an Excel spreadsheet and pasting them after the last cell of another.) This can be accomplished with the union
function. It assumes that the two data tables have the exact same variables.
Let’s create two small data tables using the tribble
function:
<- tribble(
grades_2021 ~year, ~class, ~grade,
2021, "HY 102", "B",
2021, "MS 121", "A",
2021, "PH 101", "A-",
2021, "BY 170", "B+"
)
<- tribble(
grades_2022 ~year, ~class, ~grade,
2022, "AT 150", "B-",
2022, "MS 122", "A-",
2022, "PG 100", "A",
2022, "BY 171", "A-",
2022, "CS 152", "B+"
)
Now let’s merge these together by appending the 2022 grades right after the 2021 grades:
<- union(grades_2021, grades_2022)
grades
grades
## # A tibble: 9 x 3
## year class grade
## <dbl> <chr> <chr>
## 1 2021 HY 102 B
## 2 2021 MS 121 A
## 3 2021 PH 101 A-
## 4 2021 BY 170 B+
## 5 2022 AT 150 B-
## 6 2022 MS 122 A-
## 7 2022 PG 100 A
## 8 2022 BY 171 A-
## 9 2022 CS 152 B+
3.7.1 Exercises
You will need the Lahman and nycflights13 packages for these exercises.
Identify a key variable or set of variables for each of the following data sets. Verify your answers by computing a grouped count.
msleep
Batting
flights
Does
mpg
have a key? If so, what is it? If not, how might you create one? (Hint: Research therow_number()
function.)Enter the following two small data sets:
<- tribble(
sample1 ~x, ~y,
1, "a",
2, "b",
3, "c"
)
<- tribble(
sample2 ~x, ~z,
1, "p",
2, "q",
4, "r"
)
Then perform each of the following joins, and based on the results, explain what each one does.
inner_join(sample1, sample2, by = "x")
left_join(sample1, sample2, by = "x")
right_join(sample1, sample2, by = "x")
full_join(sample1, sample2, by = "x")
Another data set included in Lahman is
Salaries
, which contains player salaries for various years. If you wanted to add asalary
column to thePitching
data set, which of the four types of joins from the previous exercise would be appropriate? Perform this join.Create a data set that contains the name of every major league baseball player since 1871 (both pitchers and batters) along with the year and their salary for that year. If a player’s salary is not available for a given year, the player should still show up in the list with an
NA
in the salary column.
3.8 Cleaning Example
Data cleaning can be a long, tedious process, and almost all real-life data that you’ll have to analyze will need a lot of it. The people who prepare the data often aren’t doing so with a follow-up analysis in mind; they just need a way to represent it that lists all of the relevant information in some systematic way.
We’ll close this chapter by cleaning and analyzing a real-life data set based on two related built-in data sets, who
and population
, from the 2014 World Health Organization Global Tuberculosis Report. As always, the very first steps are to view the data:
who
## # A tibble: 7,240 x 60
## country iso2 iso3 year new_s~1 new_s~2 new_s~3 new_s~4 new_s~5 new_s~6 new_s~7 new_s~8
## <chr> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 Afghanis~ AF AFG 1980 NA NA NA NA NA NA NA NA
## 2 Afghanis~ AF AFG 1981 NA NA NA NA NA NA NA NA
## 3 Afghanis~ AF AFG 1982 NA NA NA NA NA NA NA NA
## 4 Afghanis~ AF AFG 1983 NA NA NA NA NA NA NA NA
## 5 Afghanis~ AF AFG 1984 NA NA NA NA NA NA NA NA
## 6 Afghanis~ AF AFG 1985 NA NA NA NA NA NA NA NA
## 7 Afghanis~ AF AFG 1986 NA NA NA NA NA NA NA NA
## 8 Afghanis~ AF AFG 1987 NA NA NA NA NA NA NA NA
## 9 Afghanis~ AF AFG 1988 NA NA NA NA NA NA NA NA
## 10 Afghanis~ AF AFG 1989 NA NA NA NA NA NA NA NA
## # ... with 7,230 more rows, 48 more variables: new_sp_f1524 <int>, new_sp_f2534 <int>,
## # new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>, new_sp_f65 <int>,
## # new_sn_m014 <int>, new_sn_m1524 <int>, new_sn_m2534 <int>, new_sn_m3544 <int>,
## # new_sn_m4554 <int>, new_sn_m5564 <int>, new_sn_m65 <int>, new_sn_f014 <int>,
## # new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>, new_sn_f4554 <int>,
## # new_sn_f5564 <int>, new_sn_f65 <int>, new_ep_m014 <int>, new_ep_m1524 <int>,
## # new_ep_m2534 <int>, new_ep_m3544 <int>, new_ep_m4554 <int>, new_ep_m5564 <int>, ...
## # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
population
## # A tibble: 4,060 x 3
## country year population
## <chr> <int> <int>
## 1 Afghanistan 1995 17586073
## 2 Afghanistan 1996 18415307
## 3 Afghanistan 1997 19021226
## 4 Afghanistan 1998 19496836
## 5 Afghanistan 1999 19987071
## 6 Afghanistan 2000 20595360
## 7 Afghanistan 2001 21347782
## 8 Afghanistan 2002 22202806
## 9 Afghanistan 2003 23116142
## 10 Afghanistan 2004 24018682
## # ... with 4,050 more rows
## # i Use `print(n = ...)` to see more rows
As always, you can enter View(who)
and View(population)
to see the full tables.
We should then read the documentation. This documentation would often be found in the publication or web site from which the data is pulled, but since these are built-in data sets, we’re lucky to have access to their dictionaries right in R. They’re both found in the same place:
?who
We will clean these data sets and then formulate questions to ask about them. We’ll then try to answer these questions using visualizations and transformations from the previous chapters.
The population
data set looks ready to go. A couple of things might stand out right away in who
, though, namely the redundant iso2
and iso3
columns, the several NA
s, and the strange looking column names.
Let’s address the column names first. The documentation explains that the column names encode several pieces of information. They all begin with new
, indicating that these are newly diagnosed cases. After new_
, the next string encodes the method of diagnosis. For example, rel
means the case was a relapse. The next string is an m
(for male) or f
(for female), followed by numbers which indicate an age range. Thus each column heading is actually a set of four data values. The numbers in these columns are the numbers of cases. Since we have values being used as column names, we should pivot the column names into values. This will make the data table longer, so we’ll use pivot_longer
. We’ll also get rid of the iso2
column while we’re at it. (We’re going to leave iso3
since we’ll need it later. It provides a standardized name for each country.)
<- who %>%
who_clean select(-iso2) %>%
pivot_longer(cols = new_sp_m014:newrel_f65, names_to = "case_type", values_to = "cases")
who_clean
## # A tibble: 405,440 x 5
## country iso3 year case_type cases
## <chr> <chr> <int> <chr> <int>
## 1 Afghanistan AFG 1980 new_sp_m014 NA
## 2 Afghanistan AFG 1980 new_sp_m1524 NA
## 3 Afghanistan AFG 1980 new_sp_m2534 NA
## 4 Afghanistan AFG 1980 new_sp_m3544 NA
## 5 Afghanistan AFG 1980 new_sp_m4554 NA
## 6 Afghanistan AFG 1980 new_sp_m5564 NA
## 7 Afghanistan AFG 1980 new_sp_m65 NA
## 8 Afghanistan AFG 1980 new_sp_f014 NA
## 9 Afghanistan AFG 1980 new_sp_f1524 NA
## 10 Afghanistan AFG 1980 new_sp_f2534 NA
## # ... with 405,430 more rows
## # i Use `print(n = ...)` to see more rows
Another easy cleanup is to remove the rows with missing values for cases
, as any question we could ask about this data would depend on knowing the number of cases.
<- who_clean %>%
who_clean filter(!is.na(cases))
who_clean
## # A tibble: 76,046 x 5
## country iso3 year case_type cases
## <chr> <chr> <int> <chr> <int>
## 1 Afghanistan AFG 1997 new_sp_m014 0
## 2 Afghanistan AFG 1997 new_sp_m1524 10
## 3 Afghanistan AFG 1997 new_sp_m2534 6
## 4 Afghanistan AFG 1997 new_sp_m3544 3
## 5 Afghanistan AFG 1997 new_sp_m4554 5
## 6 Afghanistan AFG 1997 new_sp_m5564 2
## 7 Afghanistan AFG 1997 new_sp_m65 0
## 8 Afghanistan AFG 1997 new_sp_f014 5
## 9 Afghanistan AFG 1997 new_sp_f1524 38
## 10 Afghanistan AFG 1997 new_sp_f2534 36
## # ... with 76,036 more rows
## # i Use `print(n = ...)` to see more rows
You may have noticed during the pivot above that there are, annoyingly, some inconsistent naming conventions. For example, in some of the case types, there is an underscore after new
and in some there are not. Let’s perform a grouped count on the case_type
variable to see the extent of this problem. Notice that we’re not storing this grouped table in who_clean
since we’re only using it to detect entry errors.
<- who_clean %>%
case_type_count group_by(case_type) %>%
summarize(count = n())
By viewing the full case_type_count
table (View(case_type_count)
), we see that the last 14 values are entered as newrel
rather than new_rel
. Here’s what you’d see at the end of the table:
tail(case_type_count, 16)
## # A tibble: 16 x 2
## case_type count
## <chr> <int>
## 1 new_sp_m5564 3218
## 2 new_sp_m65 3209
## 3 newrel_f014 190
## 4 newrel_f1524 184
## 5 newrel_f2534 182
## 6 newrel_f3544 183
## 7 newrel_f4554 183
## 8 newrel_f5564 183
## 9 newrel_f65 185
## 10 newrel_m014 190
## 11 newrel_m1524 182
## 12 newrel_m2534 183
## 13 newrel_m3544 184
## 14 newrel_m4554 184
## 15 newrel_m5564 185
## 16 newrel_m65 182
We can fix this with str_replace
:
<- who_clean %>%
who_clean mutate(case_type = str_replace(case_type, "newrel", "new_rel"))
Now let’s separate the case_type
variable into each of its four parts. We’ll first separate at the underscores, which separate
will do by default since they’re non-syntactic characters. (Do you now see why we first had to fix the newrel
entry error?)
<- who_clean %>%
who_clean separate(case_type, into = c("new or old", "diagnosis type", "sex_age"))
who_clean
## # A tibble: 76,046 x 7
## country iso3 year `new or old` `diagnosis type` sex_age cases
## <chr> <chr> <int> <chr> <chr> <chr> <int>
## 1 Afghanistan AFG 1997 new sp m014 0
## 2 Afghanistan AFG 1997 new sp m1524 10
## 3 Afghanistan AFG 1997 new sp m2534 6
## 4 Afghanistan AFG 1997 new sp m3544 3
## 5 Afghanistan AFG 1997 new sp m4554 5
## 6 Afghanistan AFG 1997 new sp m5564 2
## 7 Afghanistan AFG 1997 new sp m65 0
## 8 Afghanistan AFG 1997 new sp f014 5
## 9 Afghanistan AFG 1997 new sp f1524 38
## 10 Afghanistan AFG 1997 new sp f2534 36
## # ... with 76,036 more rows
## # i Use `print(n = ...)` to see more rows
Now let’s separate sex_age
, which will mean splitting off the m
or f
character, hence the sep = 1
argument:
<- who_clean %>%
who_clean separate(sex_age, into = c("sex", "age"), sep = 1)
who_clean
## # A tibble: 76,046 x 8
## country iso3 year `new or old` `diagnosis type` sex age cases
## <chr> <chr> <int> <chr> <chr> <chr> <chr> <int>
## 1 Afghanistan AFG 1997 new sp m 014 0
## 2 Afghanistan AFG 1997 new sp m 1524 10
## 3 Afghanistan AFG 1997 new sp m 2534 6
## 4 Afghanistan AFG 1997 new sp m 3544 3
## 5 Afghanistan AFG 1997 new sp m 4554 5
## 6 Afghanistan AFG 1997 new sp m 5564 2
## 7 Afghanistan AFG 1997 new sp m 65 0
## 8 Afghanistan AFG 1997 new sp f 014 5
## 9 Afghanistan AFG 1997 new sp f 1524 38
## 10 Afghanistan AFG 1997 new sp f 2534 36
## # ... with 76,036 more rows
## # i Use `print(n = ...)` to see more rows
Now that the columns are separated, notice that it looks like every value in new or old
is new
. Let’s check:
n_distinct(who_clean$`new or old`)
## [1] 1
Thus, in this case, the new_or_old
column is not helpful, so let’s remove it:
<- who_clean %>%
who_clean select(-`new or old`)
Now let’s clean up the age
column, making it clearer that it represents an age range. Let’s first split off the last two numbers. (Do you see why we wouldn’t want to split off the first two?)
<- who_clean %>%
who_clean separate(age, into = c("first", "last"), sep = -2)
who_clean
## # A tibble: 76,046 x 8
## country iso3 year `diagnosis type` sex first last cases
## <chr> <chr> <int> <chr> <chr> <chr> <chr> <int>
## 1 Afghanistan AFG 1997 sp m "0" 14 0
## 2 Afghanistan AFG 1997 sp m "15" 24 10
## 3 Afghanistan AFG 1997 sp m "25" 34 6
## 4 Afghanistan AFG 1997 sp m "35" 44 3
## 5 Afghanistan AFG 1997 sp m "45" 54 5
## 6 Afghanistan AFG 1997 sp m "55" 64 2
## 7 Afghanistan AFG 1997 sp m "" 65 0
## 8 Afghanistan AFG 1997 sp f "0" 14 5
## 9 Afghanistan AFG 1997 sp f "15" 24 38
## 10 Afghanistan AFG 1997 sp f "25" 34 36
## # ... with 76,036 more rows
## # i Use `print(n = ...)` to see more rows
Now we could unite these columns back together, separating the values by a dash, so that the age ranges look like 25-34
, for example:
<- who_clean %>%
who_clean unite("age range", first, last, sep = "-")
who_clean
## # A tibble: 76,046 x 7
## country iso3 year `diagnosis type` sex `age range` cases
## <chr> <chr> <int> <chr> <chr> <chr> <int>
## 1 Afghanistan AFG 1997 sp m 0-14 0
## 2 Afghanistan AFG 1997 sp m 15-24 10
## 3 Afghanistan AFG 1997 sp m 25-34 6
## 4 Afghanistan AFG 1997 sp m 35-44 3
## 5 Afghanistan AFG 1997 sp m 45-54 5
## 6 Afghanistan AFG 1997 sp m 55-64 2
## 7 Afghanistan AFG 1997 sp m -65 0
## 8 Afghanistan AFG 1997 sp f 0-14 5
## 9 Afghanistan AFG 1997 sp f 15-24 38
## 10 Afghanistan AFG 1997 sp f 25-34 36
## # ... with 76,036 more rows
## # i Use `print(n = ...)` to see more rows
This creates a new problem, though, because the “65 and older” values are now expressed as -65
. Another str_replace
easily fixes this:
<- who_clean %>%
who_clean mutate(`age range` = str_replace(`age range`, "-65", "65 and older"))
who_clean
## # A tibble: 76,046 x 7
## country iso3 year `diagnosis type` sex `age range` cases
## <chr> <chr> <int> <chr> <chr> <chr> <int>
## 1 Afghanistan AFG 1997 sp m 0-14 0
## 2 Afghanistan AFG 1997 sp m 15-24 10
## 3 Afghanistan AFG 1997 sp m 25-34 6
## 4 Afghanistan AFG 1997 sp m 35-44 3
## 5 Afghanistan AFG 1997 sp m 45-54 5
## 6 Afghanistan AFG 1997 sp m 55-64 2
## 7 Afghanistan AFG 1997 sp m 65 and older 0
## 8 Afghanistan AFG 1997 sp f 0-14 5
## 9 Afghanistan AFG 1997 sp f 15-24 38
## 10 Afghanistan AFG 1997 sp f 25-34 36
## # ... with 76,036 more rows
## # i Use `print(n = ...)` to see more rows
Lastly, we should check that the cases
column doesn’t have any potential entry errors. We can start with the summary statistics:
summary(who_clean$cases)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 3.0 26.0 570.7 184.0 250051.0
The maximum value seems very large. Let’s look at the distribution:
ggplot(data = who_clean) +
geom_histogram(mapping = aes(x = cases))
This is skewed way to the right, which might indicate an entry error. Let’s sort by cases
and look at the top of the list:
head(arrange(who_clean, desc(cases)))
## # A tibble: 6 x 7
## country iso3 year `diagnosis type` sex `age range` cases
## <chr> <chr> <int> <chr> <chr> <chr> <int>
## 1 India IND 2007 sn m 35-44 250051
## 2 India IND 2007 sn f 35-44 148811
## 3 China CHN 2013 rel m 65 and older 124476
## 4 China CHN 2013 rel m 55-64 112558
## 5 India IND 2007 ep m 35-44 105825
## 6 India IND 2007 ep f 35-44 101015
It looks like there actually are some very high case counts in the table, but they come from India and China, the two most populous countries in the world. While the value 250,051 does seem extremely high (maybe it should be 150,051?), there doesn’t seem to be enough rationale to change it. If there had been any unrealistically high values or negative values or decimal values (and we know there are not because we can see that the variable type is “integer”), then we would probably just filter out those observations.
At this point, who_clean
is in good shape. It should be noted that the piecemeal approach above would probably not be done in practice. Using the pipe, we could condense all of the cleaning we just did into the following:
<- who %>%
who_clean select(-iso2) %>%
pivot_longer(cols = new_sp_m014:newrel_f65, names_to = "case_type", values_to = "cases") %>%
filter(!is.na(cases)) %>%
mutate(case_type = str_replace(case_type, "newrel", "new_rel")) %>%
separate(case_type, into = c("new or old", "diagnosis type", "sex_age")) %>%
separate(sex_age, into = c("sex", "age"), sep = 1) %>%
select(-`new or old`) %>%
separate(age, into = c("first", "last"), sep = -2) %>%
unite("age range", first, last, sep = "-") %>%
mutate(`age range` = str_replace(`age range`, "-65", "65 and older"))
We know that we can import Excel files into R data sets, but we can also export R data sets back into Excel files. This makes it easier to share data sets with others. Since we now have such a nice, clean version of who
, let’s do this. It requires the writexl package:
install.packages("writexl")
Running the following will create an Excel version of who_clean
named “who_clean.xlsx” in your working directory.
library(writexl)
write_xlsx(who_clean, "who_clean.xlsx")
Now that who_clean
is ready for analysis, we can start to ask questions about the data in it. The ones we’ll address here are, “How has the rate of tuberculosis diagnoses changed over time?” and “How does a country’s geographic area impact tuberculosis rates?”
As usual, these are somewhat open-ended questions, and it’s the job of the analyst to think of a way to make them more specific. Let’s start by analyzing the rate over time question. By “rate,” we mean a measure of the number of cases relative to the population. As we saw in Section 3.6, a percentage might not be the best such measure. We might have to look at the number of cases per a power of 10 other than 100. But before we think about that, we have to acknowledge that who_clean
doesn’t contain enough information to calculate rates; we need to know the countries’ populations in the years provided. This, finally, is where population
comes in:
population
## # A tibble: 4,060 x 3
## country year population
## <chr> <int> <int>
## 1 Afghanistan 1995 17586073
## 2 Afghanistan 1996 18415307
## 3 Afghanistan 1997 19021226
## 4 Afghanistan 1998 19496836
## 5 Afghanistan 1999 19987071
## 6 Afghanistan 2000 20595360
## 7 Afghanistan 2001 21347782
## 8 Afghanistan 2002 22202806
## 9 Afghanistan 2003 23116142
## 10 Afghanistan 2004 24018682
## # ... with 4,050 more rows
## # i Use `print(n = ...)` to see more rows
We should add a column to who_clean
that states the population of a given country in a given year by performing a join of who_clean
with population
. First, since our question doesn’t involve diagnosis type
, sex
, or age range
, we should group the data by country
(and iso3
, which will soon be needed) and year
and count the total number of cases for each group:
<- who_clean %>%
who_grouped group_by(country, iso3, year) %>%
summarize(cases = sum(cases))
who_grouped
## # A tibble: 3,484 x 4
## # Groups: country, iso3 [219]
## country iso3 year cases
## <chr> <chr> <int> <int>
## 1 Afghanistan AFG 1997 128
## 2 Afghanistan AFG 1998 1778
## 3 Afghanistan AFG 1999 745
## 4 Afghanistan AFG 2000 2666
## 5 Afghanistan AFG 2001 4639
## 6 Afghanistan AFG 2002 6509
## 7 Afghanistan AFG 2003 6528
## 8 Afghanistan AFG 2004 8245
## 9 Afghanistan AFG 2005 9949
## 10 Afghanistan AFG 2006 12469
## # ... with 3,474 more rows
## # i Use `print(n = ...)` to see more rows
Now we can join the population
table to who_grouped
by the key consisting of country
and year
. The values of country
and year
for the two tables are not identical; for example, population
contains data for Afghanistan in 1995, but who_grouped
does not. It’s also possible that who_grouped
might contain key values that are missing from population
. Since we won’t be able to use data with missing values, we should choose our join so that it only merges the tables on key values that are present in both tables. This is what inner_join
does:
<- who_grouped %>%
who_rates inner_join(population, by = c("country", "year"))
who_rates
## # A tibble: 3,432 x 5
## # Groups: country, iso3 [217]
## country iso3 year cases population
## <chr> <chr> <int> <int> <int>
## 1 Afghanistan AFG 1997 128 19021226
## 2 Afghanistan AFG 1998 1778 19496836
## 3 Afghanistan AFG 1999 745 19987071
## 4 Afghanistan AFG 2000 2666 20595360
## 5 Afghanistan AFG 2001 4639 21347782
## 6 Afghanistan AFG 2002 6509 22202806
## 7 Afghanistan AFG 2003 6528 23116142
## 8 Afghanistan AFG 2004 8245 24018682
## 9 Afghanistan AFG 2005 9949 24860855
## 10 Afghanistan AFG 2006 12469 25631282
## # ... with 3,422 more rows
## # i Use `print(n = ...)` to see more rows
Now we can calculate rates. We can try a percentage first to see how it looks.
<- who_rates %>%
who_rates_percent mutate(percentage = cases/population * 100)
who_rates_percent
## # A tibble: 3,432 x 6
## # Groups: country, iso3 [217]
## country iso3 year cases population percentage
## <chr> <chr> <int> <int> <int> <dbl>
## 1 Afghanistan AFG 1997 128 19021226 0.000673
## 2 Afghanistan AFG 1998 1778 19496836 0.00912
## 3 Afghanistan AFG 1999 745 19987071 0.00373
## 4 Afghanistan AFG 2000 2666 20595360 0.0129
## 5 Afghanistan AFG 2001 4639 21347782 0.0217
## 6 Afghanistan AFG 2002 6509 22202806 0.0293
## 7 Afghanistan AFG 2003 6528 23116142 0.0282
## 8 Afghanistan AFG 2004 8245 24018682 0.0343
## 9 Afghanistan AFG 2005 9949 24860855 0.0400
## 10 Afghanistan AFG 2006 12469 25631282 0.0486
## # ... with 3,422 more rows
## # i Use `print(n = ...)` to see more rows
The percentage numbers look pretty small. Let’s see what their range is:
summary(who_rates_percent$percentage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.007535 0.021223 0.041985 0.052621 0.813478
If we were to move the decimal point three places to the right, these numbers would be easier to digest. Thus, when we calculate the rate, we should multiply the quotient by 100,000 instead of 100 (since 100,000 has three more zeros than 100). This means we should measure rate in cases per 100,000 people (as in Section 3.6).
<- who_rates %>%
who_rates_100K mutate(rate_per_100K = cases/population * 100000)
who_rates_100K
## # A tibble: 3,432 x 6
## # Groups: country, iso3 [217]
## country iso3 year cases population rate_per_100K
## <chr> <chr> <int> <int> <int> <dbl>
## 1 Afghanistan AFG 1997 128 19021226 0.673
## 2 Afghanistan AFG 1998 1778 19496836 9.12
## 3 Afghanistan AFG 1999 745 19987071 3.73
## 4 Afghanistan AFG 2000 2666 20595360 12.9
## 5 Afghanistan AFG 2001 4639 21347782 21.7
## 6 Afghanistan AFG 2002 6509 22202806 29.3
## 7 Afghanistan AFG 2003 6528 23116142 28.2
## 8 Afghanistan AFG 2004 8245 24018682 34.3
## 9 Afghanistan AFG 2005 9949 24860855 40.0
## 10 Afghanistan AFG 2006 12469 25631282 48.6
## # ... with 3,422 more rows
## # i Use `print(n = ...)` to see more rows
Now to answer the question about how the prevalence of tuberculosis diagnoses has changed over time, we could isolate a single country and create a line graph of cases
vs. year
. For example, suppose we wanted to do this for Brazil:
<- who_rates_100K %>%
Brazil_rates filter(country == "Brazil")
ggplot(data = Brazil_rates, mapping = aes(x = year, y = rate_per_100K)) +
geom_line() +
geom_point() +
labs(title = "Tuberculosis Rates in Brazil",
y = "cases per 100,000 people",
x = "year")
It looks like there was a spike in cases in 2000 and a relapse in 2006.
As for the question about geographic area, it would be helpful to be able to assign each country to a specific world region. The web page found here has exactly the information we need. We can copy and paste the data into an Excel spreadsheet and then import it into R. The code chunk below assumes that the data has been pasted into an Excel document named “regions.xlsx” which is stored in your working directory:
<- read_excel("regions.xlsx") regions
We can now add a column called Region
to who_rates_100K
by joining the regions
data set, using iso3
as the key. (This is why we kept the iso3
column above. Do you see why we could not have used country
as the key?) Since it won’t be helpful to have any missing country or region values, we’ll use inner_join
so that the only rows that are retained in the merged data set are the ones for which the key values are present in both data sets. After merging, we’ll select only the relevant columns.
<- who_rates_100K %>%
who_regions inner_join(regions, by = "iso3") %>%
select(country, iso3, year, cases, population, rate_per_100K, Region)
who_regions
## # A tibble: 3,321 x 7
## # Groups: country, iso3 [206]
## country iso3 year cases population rate_per_100K Region
## <chr> <chr> <int> <int> <int> <dbl> <chr>
## 1 Afghanistan AFG 1997 128 19021226 0.673 South Asia
## 2 Afghanistan AFG 1998 1778 19496836 9.12 South Asia
## 3 Afghanistan AFG 1999 745 19987071 3.73 South Asia
## 4 Afghanistan AFG 2000 2666 20595360 12.9 South Asia
## 5 Afghanistan AFG 2001 4639 21347782 21.7 South Asia
## 6 Afghanistan AFG 2002 6509 22202806 29.3 South Asia
## 7 Afghanistan AFG 2003 6528 23116142 28.2 South Asia
## 8 Afghanistan AFG 2004 8245 24018682 34.3 South Asia
## 9 Afghanistan AFG 2005 9949 24860855 40.0 South Asia
## 10 Afghanistan AFG 2006 12469 25631282 48.6 South Asia
## # ... with 3,311 more rows
## # i Use `print(n = ...)` to see more rows
Creating a box plot of rate_per_100K
vs. Region
would be good way to compare tuberculosis rates by region. Notice that we’re reordering the boxes to make the comparison easier to see and that we’re laying out the boxes horizontally since there are too many categories to display them vertically without crowding the x-axis labels.
ggplot(data = who_regions) +
geom_boxplot(mapping = aes(x = reorder(Region, rate_per_100K), y = rate_per_100K)) +
coord_flip() +
labs(title = "Tuberculosis Rates from Around the World",
y = "cases per 100,000 people",
x = "region")
It seems that African regions generally have the highest rates and European regions have the lowest. Factors such as degree of industrialization, poverty levels, climate, etc, might explain some of these trends.
In practice, the process of importing a data set, cleaning it, transforming it, merging it with other data sets, and visualizing it is very long and not at all linear. It’s often the case that we have to back-track and re-clean or re-transform the data set after our analysis reveals something we missed. However, our who
project gives a good cross-section of the data analysis techniques we’ve covered in the course so far.
3.9 Dashboards
A dashboard is a compilation of data tables and visualizations that is meant to provide a quick snapshot of certain relevant performance indicators and metrics. They are often interactive and meant to be a purely visual overview of the data describing a process or phenomenon. Click here for an example of a dashboard displaying COVID-related data for Kent County, Michigan.
Dashboards usually don’t contain much or any data storytelling like a data analysis report would. Text is kept to a minimum and it’s left to the viewer to extract their own insights. On the other hand, all displayed data tables are absolutely pristine, and all visualizations are aesthetically pleasing and fully labeled.
The COVID dashboard linked to above was created with Microsoft PowerBI, which is very powerful visualization software often used in industry, but we can also create dashboards in R Markdown. In doing so, we can learn a few best practices of dashboard building that would transer to other software platforms. The project for this chapter will require you to create a dashboard, so we’ll use this section to go over the basics in R.
First, in a scratch work .R file, install the flexdashboard package, which provides the functionality to create nice-looking dashboards in R. You will also need the package htmltools.
install.packages("flexdashboard")
install.packages("htmltools")
Let’s create a sample dashboard that displays information about the mpg
data set. Click here to access a .Rmd file that you can download and put into your working R directory. When you’ve done so, open it and knit it so you can see the output. We’ll use the rest of this section to walk through it. This file can serve as a template for building your own dashboards.
Lines 1-6:
This is the standard heading setup for any R Markdown file. Notice that the output is flexdashboard::flex_dashboard
. flex_dashboard
is the output profile we want to use, but it’s part of the flexdashboard
library you just installed. Since there’s nowhere to load this library before the heading setup, we have to install the library “on the fly” by including flexdashboard::
before the the output profile we’re accessing.
Lines 8-11
This is a good place to load any libraries you’ll need to display information in your dashboard. You’ll probably at least need tidyverse and DT (for data tables), but you might need others as well. The include=FALSE
option ensures that these lines of code don’t show up in your dashboard.
Lines 13-18
On these lines, some data transformation is done to mpg
to create the data set mpg_manufacturer
that will be analyzed in the dashboard. (Again, notice the include=FALSE
.) This is a good spot to do any cleaning and transformation on your data sets to get them ready to display and visualize. Code should never be included in a dashboard, though, so remember include=FALSE
.
It’s very important to note that any variables to which you refer in your .Rmd file must be defined within the .Rmd file. If you create a variable in a scratch .R file, for example, but not in your .Rmd file, then you’ll get an error message when you try to knit.
Line 20
Dashboards are separated into several tabbed pages, and each page is given a name that describes what the page contains. To set up and name a page, use a single #
followed by a space and the desired page name.
Line 22
Each page is further subdivided into vertical columns. You can set up and name a column with a double ##
followed by the column name. The column name is only for your benefit and is not visible in the compiled dashboard.
Line 24
Each column can be even further subdivided into horizontal rows. To set up up a row, use a triple ###
followed by the row name. The row name should be a description of the information it contains. Unlike column names, row names do appear in the compiled dashboard.
Lines 26-28
After the page, column, and row are established, you’re ready to enter the code that will produce whatever you want to display there. In this sample dashboard, we’re displaying the mpg_manufacturer
data table. Notice that the row name we choose describes what the data table is showing us.
Lines 30-38
Still on page 1, we’re starting a second column (using ##
) and within that second column, a new row (using ###
) named with a description of what’s to follow. We then enter the code chunk to produce a box plot.
Lines 40-48
We’re now starting a whole new page (which will show up as another tab in the dashboard). We establish a first column and a row within that column. Then we have a code chunk that will display the full mpg
data table.
Lines 50-55
Columns can have more than one row, and here we’re setting up a second row within the first column of page 2. The code chunk produces a scatter plot.
Lines 57-81
We establish a second column on page 2, which contains three rows. Each row contains a code chunk that produces the desired outputs. Notice that in lines 69-71, we have code that transforms mpg
to create class_mileage
. Since this code only establishes the class_mileage
data set and produces no actual output, we can include it here and not worry about it showing up in the dashboard.
Line 84
We start a third page, which we could then populate with columns and rows as needed.
We can structure our dashboards with as many pages, rows, and columns as we want, but the guiding principles are that it should contain immediately accessible information, should be uncluttered and not overwhelming, and should be free of R code and blocks of text. The sample dashboard we walked through here actually has some problems that are worth pointing out:
Throughout, the original variable names from
mpg
such ashwy
,cty
,displ
, etc., and some newly created variables such asavg_hwy
show up in the dashboard. Every instance of these abbreviations should be replaced by more meaningful names.If we’re using row names to describe what the visualizations contain, we don’t have to also label the visualizations with titles, but we should still label the x- and y-axes (and the legend if it exists) using descriptive names.
The box plots should be ordered by mean value. (Recall the
reorder
function.) All visualizations in a dashboard should be as aesthetically pleasing as you can make them.Most importantly, page 2 is very crowded. The plots are too small, and the tables are too compressed. The information on this page should be spread out over more pages with fewer rows.
3.10 Project
Project Description: The purpose of this project is to import and clean a real-world data set and to display the results in a dashboard.
Instructions:
Open the spreadsheet linked to here. This contains data from the 2019 STAR math placement exams. Import this data into R.
Give the data set a very thorough cleaning, considering all of the techniques discussed in this chapter.
This web site gives ACT/SAT concordance tables, which are used to convert scores of one test to those of the other. Find the conversion from ACT Math scores to SAT Math scores, and copy and paste the information into a spreadsheet. Then import this spreadsheet into R.
Use the imported concordance spreadsheet from the previous problem to add a column to the STAR data table that contains a converted SAT Math score for each student who took the ACT. Then create a new column called
SAT_ACT_Max
that lists the higher of the SAT Math or converted ACT Math score.The rest of the project will involve analyzing and summarizing your cleaned data in a dashboard. Specifically, your dashboard should contain:
- The cleaned data table.
- A visualization which shows the distribution of the placement exam scores. (Recall the geom which does this for continuous data.)
- A visualization which shows the relationship between GPA and placement exam score.
- A visualization which shows the relationship between
SAT_ACT_Max
and placement exam score. - A visualization which shows the relationship between STAR date and placement exam score. (Think carefully about which geom is most appropriate. What types of variables are
STAR date
andAQ Score
?) - A data table that ranks the high schools by average AQ math placement score. (You might want to consider filtering out the high schools that didn’t have many students at STARs.)
- A data table that ranks the career goals by average AQ math placement score. (You might want to consider filtering out the career goals that didn’t have many students at STARs.)
Guidelines:
Since your work in this project will be put into a dashboard rather than a report, the guidelines are a little different from those of the first two projects.
- Make sure that no code or warning/error messages show up in your dashboard.
- Give descriptive names to all of the pages and rows in the dashboard.
- Label all visualizations, although as noted above, you can omit the title of a visualization when the title is instead the row name in the dashboard.
- Use variable names in your displayed data tables and visualizations that are meaningful and are not just abbreviations.
- Make sure that none of your dashboard pages are too cluttered. Everything should be visually pleasing.
- Use the
datatable
function from DT to display your tables.
Grading Rubric:
- Cleanness: Did you properly clean the data? Did you pay attention to data types and convert them if necessary? Did you rename unnecessarily long columns? Did you handle missing values appropriately? Did you search for and fix entry errors? Is your data tidy? Did you merge relational data correctly? (30 points)
- Dashboard: Does your dashboard have all of the required components? (30 points)
- Professionalism: Does your dashboard look nice? Is the information neatly laid out in rows and columns? Would a potential viewer be able to make sense of it easily? Did you follow the guidelines listed above? (15 points)