Chapter 3 Importing and Cleaning Data
In the previous two chapters, we used only built-in data sets to practice visualizing and transforming data. In practice, of course, you’ll have to import a data set into R before starting your analysis. We address this important first step in this chapter.
We’ll need tidyverse:
library(tidyverse)
We’ll also need two new libraries used for importing data sets. They’re installed as part of base R, but you’ll have to load them.
library(readxl) # For importing Excel files
library(readr) # For importing csv (comma separated values) files
3.1 Importing Tabular Data
All of our data sets in this course are tabular, meaning that they consist of several rows of observations and several columns of variables. Tabular data is usually entered into spreadsheet software (such as Excel, Google Sheets, etc), where it can be manipulated using the built-in functionality of the software. Importing the data into R, however, allows for greater flexibility and versatility when performing your analysis.
RStudio makes it very easy to import data sets, although the first step is placing your spreadsheet file in the right directory.
Click here to view a very small sample csv file. Our goal is to import this into R. We can do so as follows:
Download the file by clicking the download button in the upper-right corner of the screen.
The next step is to get this file into R.
- If you’re using the RStudio desktop version, move this downloaded file to your working directory in R. If you’re not sure what your working directory is, you can enter
getwd()
to find out. - If you’re using RStudio Cloud, click the “Files” tab in the lower-right pane, and then click the “Upload” button. Then click “Choose File” and browse to the location of the downloaded csv file. In either case (desktop or cloud), the file should now appear in your list of files.
Click on the csv file in your file list, and then choose “Import Dataset.” You’ll see a preview of what the data set will look like in R, a window of importing options, and the R code used to do the importing.
In the “Import Options” window, choose the name you want to use for the data set once it’s in R. Enter this in the “Name” field. Also, decide whether you want to use the values in the first row in your csv file as column names. If so, make sure the “First Row as Names” box is checked. There are other options in this window as well that you’ll sometimes need to adjust, but these two will get us started.
Now you’re ready to import. You can do so by either clicking the “Import” button at the bottom right or by cutting and pasting the code into your document. I’d recommend cutting and pasting the code, as it allows you to quickly re-import your csv over your original in case you have to start your analysis over. The code generated by the import feature is:
library(readr)
<- read_csv("sample_csv.csv")
sample_csv View(sample_csv)
readr is the package needed to access the read_csv
function. If you already loaded it, it’s unnecessary to do so again. Also, the View(sample_csv)
command is, of course, optional. All you really need to import the file is:
<- read_csv("sample_csv.csv")
sample_csv
sample_csv
## # A tibble: 3 x 2
## state capital
## <chr> <chr>
## 1 Michigan Lansing
## 2 California Sacramento
## 3 New Jersey Trenton
Once you import it, you should see sample_csv
listed in the Global Environment pane, and you can now start to analyze it like any other data set.
Data entry is usually done in Excel (or other spreadsheet software) and then imported into R for analysis. However, you sometimes might want to enter tabular data directly into R, especially when the data set is small. This is often the case when you need a sample data set to test a piece of code.
You can do so in R using the tibble
or tribble
(= transposed tibble) functions from tidyverse.
<- tibble(
small_data_set_tibble x = c(3, 4, 2, -5, 7),
y = c(5, 6, -1, 0, 5),
z = c(-3, -2, 0, 1, 6)
)
small_data_set_tibble
## # A tibble: 5 x 3
## x y z
## <dbl> <dbl> <dbl>
## 1 3 5 -3
## 2 4 6 -2
## 3 2 -1 0
## 4 -5 0 1
## 5 7 5 6
<- tribble(
small_data_set_tribble ~name, ~points, ~year,
"Wilt Chamberlain", 100, 1962,
"Kobe Bryant", 81, 2006,
"Wilt Chamberlain", 78, 1961,
"Wilt Chamberlain", 73, 1962,
"Wilt Chamberlain", 73, 1962,
"David Thompson", 73, 1978
)
small_data_set_tribble
## # A tibble: 6 x 3
## name points year
## <chr> <dbl> <dbl>
## 1 Wilt Chamberlain 100 1962
## 2 Kobe Bryant 81 2006
## 3 Wilt Chamberlain 78 1961
## 4 Wilt Chamberlain 73 1962
## 5 Wilt Chamberlain 73 1962
## 6 David Thompson 73 1978
Notice that the column names do not require quotation marks, but any non-numeric values in the data set do.
3.1.1 Exercises
Click here to access another sample spreadsheet file. Download it as a .xlsx file and then import it into R. (Copy and paste the code generated for the import.) Why is the code a little different from that used to import the spreadsheet from above?
Click here for yet another sample spreadsheet. Download it and import it into R, but when you do, give the imported data set a more meaningful name than the default name. What else will you have to do to make the data set import correctly? (Copy and paste the generated import code.)
When you chose not to use the first row as the column names in the previous exercise, what default names did
read_excel
assign to the columns? By either checking theread_excel
documentation in R (?read_excel
) or by searching the internet, find out how to override the default column names. Then re-import the spreadsheet from the previous exercise and assign more meaningful column names.Use the
tribble
command to directly import your current course schedule into R. Include a column for “course” (i.e., MS 282), “title” (i.e., Applied Statistics with R), and “credits” (i.e., 3).
3.2 Data Types
You may have noticed that the read_csv
, read_excel
, and tribble
functions all produce data sets in the structure of a “tibble.” Recall that tibbles are data structures in tidyverse that store tabular data sets in readable, convenient formats.
The built-in iris
data set, available in base R, is a data frame, not a tibble. Execute iris
in the console and note its appearance:
iris
To compare this data frame to a tibble, we can coerce iris
into a tibble and then re-execute:
<- as_tibble(iris)
iris_tibble
iris_tibble
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ... with 140 more rows
## # i Use `print(n = ...)` to see more rows
We can’t see the entire data set in the console anymore, just a preview. However, we’re told how many rows and columns the data set has without having to scroll, and, more importantly, we’re told the data type of each variable in the data set.
Knowing the data type of a variable is important, as different data types can behave very differently. The three most commonly occurring data types we’ll encounter are summarized below.
type | abbreviation in R | objects |
---|---|---|
integer | int | whole numbers |
double-precision floating point number | dbl | real numbers, possibly containing decimal digits |
character | chr | characters or strings of characters |
You can check the types of the variables in a tibble by displaying it in the console, or you can use the type_sum
function. Suppose we want to know the type of the variable hwy
in the mpg
data set. (Recall that mpg$hwy
extracts just the vector of entries in the hwy
column from mpg
.) If we don’t want to display the entire tibble, we can do this instead:
type_sum(mpg$hwy)
## [1] "int"
When you import a data set or create one with tibble
or tribble
, R can usually determine what the types of the variables are. However, look again at small_data_set_tribble
, which we created above. The type of points
and year
is “double.” This is because R converts all imported or input numeric data, whether there are decimal places or not, to the double type since it’s more inclusive and is handled the same way mathematically.
If we really want our numeric data to be treated like integers, there are two options:
First, if we’re entering the data directly into R using tibble
or tribble
, we can specify that the numbers are integers by including the suffix “L.” For example:
<- tribble(
small_data_set2 ~name, ~points, ~year,
"Wilt Chamberlain", 100L, 1962L,
"Kobe Bryant", 81L, 2006L,
"Wilt Chamberlain", 78L, 1961L,
"Wilt Chamberlain", 73L, 1962L,
"Wilt Chamberlain", 73L, 1962L,
"David Thompson", 73L, 1978L
)
small_data_set2
## # A tibble: 6 x 3
## name points year
## <chr> <int> <int>
## 1 Wilt Chamberlain 100 1962
## 2 Kobe Bryant 81 2006
## 3 Wilt Chamberlain 78 1961
## 4 Wilt Chamberlain 73 1962
## 5 Wilt Chamberlain 73 1962
## 6 David Thompson 73 1978
Notice that points
are year
are now integers.
Second, we can coerce a variable of one type into a variable of another type using the as.<TYPE>
function. For example:
<- small_data_set_tribble %>%
small_data_set3 mutate(points = as.integer(points),
year = as.integer(year))
small_data_set3
## # A tibble: 6 x 3
## name points year
## <chr> <int> <int>
## 1 Wilt Chamberlain 100 1962
## 2 Kobe Bryant 81 2006
## 3 Wilt Chamberlain 78 1961
## 4 Wilt Chamberlain 73 1962
## 5 Wilt Chamberlain 73 1962
## 6 David Thompson 73 1978
One other data type worth mentioning here is factor. Notice that when we converted iris
to the tibble iris_tibble
above, the type of Species
was abbreviated fct for “factor.”
type_sum(iris_tibble$Species)
## [1] "fct"
To begin to see what factors are, let’s create a small data set:
<- tribble(
schedule ~day, ~task,
"Tuesday", "wash car",
"Friday", "doctor's appointment",
"Monday", "haircut",
"Saturday", "laundry",
"Wednesday", "oil change"
)
schedule
## # A tibble: 5 x 2
## day task
## <chr> <chr>
## 1 Tuesday wash car
## 2 Friday doctor's appointment
## 3 Monday haircut
## 4 Saturday laundry
## 5 Wednesday oil change
Suppose we wanted to organize this list of tasks by the day. We could try:
arrange(schedule, day)
## # A tibble: 5 x 2
## day task
## <chr> <chr>
## 1 Friday doctor's appointment
## 2 Monday haircut
## 3 Saturday laundry
## 4 Tuesday wash car
## 5 Wednesday oil change
As you can see, this organizes by day alphabetically, which is certainly not what we want. To organize by day chronologically, we could assign an ordering to each day of the week so that the arrange
function would sort it the way we want. We can do this by coercing day
into the “factor” type.
We first have to tell R the ordering we want. We do so by entering a vector with the days of the week in chronological order. This ordering establishes the levels of the day
variable:
<- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday") day_levels
We can then turn the day
column into a factor with the above levels as follows:
<- schedule %>%
schedule2 mutate(day = factor(day, levels = day_levels))
type_sum(schedule2$day)
## [1] "fct"
Now re-run the sort above:
arrange(schedule2, day)
## # A tibble: 5 x 2
## day task
## <fct> <chr>
## 1 Monday haircut
## 2 Tuesday wash car
## 3 Wednesday oil change
## 4 Friday doctor's appointment
## 5 Saturday laundry
When we define the levels of a factor, the values of that variable are replaced internally with integers that are determined by the levels. In the example above, after we turned the day
column into factors, the value “Friday” became a 5 since “Friday” was the fifth entry in the level vector. “Monday” became a 2, “Saturday” a 7, etc. The names of the days became aliases for the integers from 1 to 7 that they represent. That’s why the arrange
function was able to sort chronologically; it was just sorting a column of numbers from 1 to 7 in ascending order.
3.2.1 Exercises
You will need the nycflights13 library for these exercises.
The
flights
data set has a variable with a type that was not covered in this section. What is the variable, and what is its type?Create a data set called
mpg2
which is an exact copy ofmpg
except that thehwy
andcty
variables are doubles rather than integers.Answer the following questions by experimenting. (Try just coercing single values rather than entire columns of data sets. For example, for part (a), try
as.character(7L)
.)- Can you coerce an integer into a character? What happens when you try?
- Can you coerce a double into a character? What happens when you try?
- Can you coerce a double into an integer? What happens when you try?
- Can you coerce character into either a double or an integer? What happens when you try?
Download and import the data set found here.
- What is the type of the variable
day
? Change it to a more appropriate type. - Sort the data set so that the records are in chronological order. (You’ll have to change
month
to a different type.)
- What is the type of the variable
3.3 Renaming Columns
For the next few sections, we’ll be using the data set found here. You should download it and then import it into R using the name high_school_data
. It contains high school data for incoming college students as well as data from their first college semester.
high_school_data
## # A tibble: 278 x 10
## `High School GPA` `ACT Comp` SAT Co~1 High ~2 Bridg~3 FA20 ~4 FA20 ~5 Acad ~6 Race Ethni~7
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 2.72 14 800 0.8880~ yes 13 13 Academ~ White Hispan~
## 2 4 NA 1380 0.8389~ <NA> 15 15 Dean's~ White Non Hi~
## 3 3.08 NA 1030 0.8940~ <NA> 14 13 Good S~ <NA> Hispan~
## 4 3.69 NA 1240 <NA> <NA> 17 17 Good S~ Asian Non Hi~
## 5 NA NA NA <NA> <NA> 13 13 Dean's~ <NA> Non Re~
## 6 3.47 NA 1030 0.5330~ <NA> 16 16 Good S~ White Non Hi~
## 7 3.43 27 NA <NA> <NA> 15 15 Dean's~ White Non Hi~
## 8 3.27 NA 1080 <NA> <NA> 15 15 Good S~ Blac~ Non Hi~
## 9 3.76 NA 1120 0.9160~ <NA> 15 15 Good S~ White Non Hi~
## 10 3.23 33 NA 0.7970~ <NA> 15 11 Academ~ Asian Non Hi~
## # ... with 268 more rows, and abbreviated variable names 1: `SAT Comp`,
## # 2: `High School Quality Academic`, 3: `Bridge Program?`,
## # 4: `FA20 Credit Hours Attempted`, 5: `FA20 Credit Hours Earned`,
## # 6: `Acad Standing After FA20 (Dean's List, Good Standing, Academic Warning, Academic Probation, Academic Dismissal, Readmission on Condition)`,
## # 7: Ethnicity
## # i Use `print(n = ...)` to see more rows
You probably immediately notice that one of the variable names, the one that starts with Acad Standing After FA20...
is extremely long. (You can’t see the column’s data in the clipped tibble in the console, but it’s one of the off-screen columns.) A data set should have variable names that are descriptive without being prohibitively long. Column names are easy to change in Excel, Google Sheets, etc, but you can also change them using the tidyverse function rename
.
First, a few notes about the way R handles column names. By default, column names are not allowed to contain non-syntactic characters, i.e., any character other than letters, numbers, .
, and _
. Also, column names cannot begin with a number or an underscore _
. In particular, column names cannot have spaces. It’s best to use an underscore in place of a space in a column name, but, maybe for aesthetic reasons, you can override these restrictions by enclosing the character string for a column name in backticks. For example, if we want to rename the long variable name above, then `Acad Standing`
is allowed, but Acad Standing
is not.
Also, when the data set you’re importing has a column name with a space in it, the space will be retained in R. However, to refer to that column, you’ll have to enclose the column name in backticks. For example, high_school_data$`ACT Comp`
will return the ACT Comp
column in high_school_data
, but high_school_data$ACT Comp
will produce an error message.
With the above in mind, renaming variables is actually very easy. The rename
function is one of the transformation functions like filter
, select
, etc, and works the same way. Let’s rename the really long Acad Standing After FA20...
column as just Acad_Standing
:
<- high_school_data %>%
high_school_data_clean rename("Acad_Standing" = `Acad Standing After FA20 (Dean's List, Good Standing, Academic Warning, Academic Probation, Academic Dismissal, Readmission on Condition)`)
high_school_data_clean
## # A tibble: 278 x 10
## `High School GPA` `ACT Comp` SAT Co~1 High ~2 Bridg~3 FA20 ~4 FA20 ~5 Acad_~6 Race Ethni~7
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 2.72 14 800 0.8880~ yes 13 13 Academ~ White Hispan~
## 2 4 NA 1380 0.8389~ <NA> 15 15 Dean's~ White Non Hi~
## 3 3.08 NA 1030 0.8940~ <NA> 14 13 Good S~ <NA> Hispan~
## 4 3.69 NA 1240 <NA> <NA> 17 17 Good S~ Asian Non Hi~
## 5 NA NA NA <NA> <NA> 13 13 Dean's~ <NA> Non Re~
## 6 3.47 NA 1030 0.5330~ <NA> 16 16 Good S~ White Non Hi~
## 7 3.43 27 NA <NA> <NA> 15 15 Dean's~ White Non Hi~
## 8 3.27 NA 1080 <NA> <NA> 15 15 Good S~ Blac~ Non Hi~
## 9 3.76 NA 1120 0.9160~ <NA> 15 15 Good S~ White Non Hi~
## 10 3.23 33 NA 0.7970~ <NA> 15 11 Academ~ Asian Non Hi~
## # ... with 268 more rows, and abbreviated variable names 1: `SAT Comp`,
## # 2: `High School Quality Academic`, 3: `Bridge Program?`,
## # 4: `FA20 Credit Hours Attempted`, 5: `FA20 Credit Hours Earned`, 6: Acad_Standing,
## # 7: Ethnicity
## # i Use `print(n = ...)` to see more rows
Notice that the syntax for rename
is rename(<"NEW NAME"> = <OLD NAME>)
. Notice also that in the above code chunk, a new data set high_school_data_clean
was created to house the renamed column. The original data set, high_school_data
, will still have the original long column name. In general, it’s a good practice to not change the originally imported data set. It’s safer to always create a new data set to accept whatever changes you need to make.
3.4 Missing Values
The other thing that stands out right away in high_school_data
is the abundance of NA
s. This is an issue in most data sets you will analyze. The way you handle the NA
s in your analysis will depend on the context, and we’ll see various methods as we proceed in the course. (Recall that we’ve already seen one way to work around NA
s, namely the na.rm = TRUE
argument we can include when we calculate summary statistics, such as means, sums, and standard deviations.)
Knowing the scope of the missing values in a data set is important when you first start to explore and clean. We can count how many NA
s there are in a given column using the is.na
function as follows.
Let’s start with a small sample vector, some of whose entries are missing:
<- c(4, "abc", NA, 16, 3.1, NA, 0, -3, NA) v
We can apply is.na
to the entire vector; it will be applied individually to each entry. (We say that this function is therefore “vectorized.”)
is.na(v)
## [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
We see the value TRUE
in the entry spots that had an NA
in v
. Remember that R treats the logical values TRUE
and FALSE
arithmetically by giving TRUE
the value 1 and FALSE
0. Thus the sum of the entries in is.na(v)
would just be the number of TRUE
values, and hence the number of NA
s in v
:
sum(is.na(v))
## [1] 3
We can apply this process to the columns of high_school_data
, replacing the vector v
with any of the column vectors. For example, let’s count how many missing values there are in the SAT Comp
column. (Notice the backticks around SAT Comp
in the code below. Recall that this is necessary since SAT Comp
has a space in it.)
sum(is.na(high_school_data$`SAT Comp`))
## [1] 40
Recall that NA
stands for “not available,” so making the decision to change an NA
to an actual value runs the risk of compromising the integrity of the data set. An analyst should be wary of “playing God” with the data. However, we can safely convert NA
s to actual values when it’s clear what the NA
s really mean or when we need to assign them a certain value in order to make a calculation work.
In high_school_data
, the Bridge Program?
column indicates whether a student was required to attend the summer Bridge Program as a preparation for taking college classes. The missing values in this column are clearly there because the person who entered the data meant it to be understood that a blank cell means “no.” While this convention is immediately understood by anyone viewing the data set, it sets up potential problems when doing our analysis. It’s always a better practice to reserve NA
for values that are truly missing rather than using it as a stand-in for some known value.
We can replace NA
s in a column with an actual value using the transformation function mutate
together with replace_na
. Let’s replace the NA
s in the Bridge Program?
column with “no.” We’ll display only this cleaned up column so we can check our work.
<- high_school_data_clean %>%
high_school_data_clean mutate(`Bridge Program?` = replace_na(`Bridge Program?`, "no"))
%>%
high_school_data_clean select(`Bridge Program?`)
## # A tibble: 278 x 1
## `Bridge Program?`
## <chr>
## 1 yes
## 2 no
## 3 no
## 4 no
## 5 no
## 6 no
## 7 no
## 8 no
## 9 no
## 10 no
## # ... with 268 more rows
## # i Use `print(n = ...)` to see more rows
3.4.1 Exercises
Do you feel confident in replacing any of
NA
s in any of the columns besidesBridge Program?
? Why or why not?Run the command
summary(high_school_data)
. One of the many things you see is the number ofNA
s in some of the columns. For what types of columns is thisNA
count missing?For the columns from the previous exercise that did not produce an
NA
count, find the number ofNA
s using the method from this section.A rough formula for converting an ACT score to a comparable SAT score is \(\textrm{SAT} = 41.6 \times\textrm{ACT} + 102.4\).
- Use this formula to create a new column in
high_school_data_clean
that contains the converted ACT scores for each student who took it. - Create another new column in
high_school_data_clean
that contains the better of each student’s SAT and converted ACT score. (Hint: Look up thepmax
function.) You’ll have to think about how to handle the manyNA
s in theACT Comp
andSAT Comp
columns.
- Use this formula to create a new column in
3.5 Detecting Entry Errors
It’s very common for data entered into a spreadsheet manually to have a few entry errors, and there are a few standard methods for detecting these errors. One of the first things to look for are numerical data points that seem way too big or way too small. A good place to start is with the summary statistics. (We’ll be using the data set high_school_data_clean
throughout this section, which is the partially cleaned version of the original high_school_data
with the renamed Acad_Standing
column and with the NA
s in the Bridge Program?
column replaced by "no"
.)
summary(high_school_data_clean)
## High School GPA ACT Comp SAT Comp High School Quality Academic
## Min. :2.279 Min. : 4.05 Min. : 100 Length:278
## 1st Qu.:3.225 1st Qu.:20.00 1st Qu.:1020 Class :character
## Median :3.625 Median :23.00 Median :1120 Mode :character
## Mean :3.534 Mean :22.62 Mean :1116
## 3rd Qu.:3.930 3rd Qu.:26.00 3rd Qu.:1220
## Max. :4.650 Max. :33.00 Max. :1480
## NA's :2 NA's :205 NA's :40
## Bridge Program? FA20 Credit Hours Attempted FA20 Credit Hours Earned Acad_Standing
## Length:278 Min. : 0.00 Min. : 0.00 Length:278
## Class :character 1st Qu.: 14.00 1st Qu.:13.00 Class :character
## Mode :character Median : 15.00 Median :15.00 Mode :character
## Mean : 15.06 Mean :13.94
## 3rd Qu.: 16.00 3rd Qu.:16.00
## Max. :150.00 Max. :19.00
##
## Race Ethnicity
## Length:278 Length:278
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
This compilation is most helpful for continuous data. We’ll deal with categorical data separately. Notice that the maximum value of FA20 Credit Hours Attempted
is quite a bit bigger than the third quartile value (and totally unrealistic anyway). This is certainly an entry error. A visual way to detect this would be to obtain a plot of the distribution of the FA20 Credit Hours Attempted
variable. Since this is a continuous variable, we’ll use a histogram. (Recall that categorical distributions are visualized with bar graphs.)
ggplot(data = high_school_data_clean) +
geom_histogram(mapping = aes(x = `FA20 Credit Hours Attempted`))
This histogram is skewed very far to the right with a huge gap between the right-most bar and the nearest bar to its left. This is another possible sign of an entry error.
We can drill down and see what’s going on with this point by sorting the FA20 Credit Hours Attempted
variable in descending order. In the code below, head
is a useful function that can be used to display only the beginning of a data set. (There’s also a tail
function that displays the end of the data set.) By default, head
displays the first 6 records, but you can override this and specify the number you want to display. Let’s display the records with the top ten highest values of FA20 Credit Hours Attempted
. (The reason for select
is to make the FA20 Credit Hours Attempted
column visible.)
head(
arrange(high_school_data_clean, desc(`FA20 Credit Hours Attempted`)) %>%
select(`FA20 Credit Hours Attempted`,
`FA20 Credit Hours Earned`,
Acad_Standing),10)
## # A tibble: 10 x 3
## `FA20 Credit Hours Attempted` `FA20 Credit Hours Earned` Acad_Standing
## <dbl> <dbl> <chr>
## 1 150 15 Good Standing
## 2 19 19 Dean's List
## 3 18 18 Dean's List
## 4 18 18 Dean's List
## 5 18 17 Good Standing
## 6 18 18 Good Standing
## 7 18 18 Good Standing
## 8 18 18 Dean's List
## 9 18 18 Dean's List
## 10 18 18 Dean's List
Since we can see the entire row with the error, we can also get an insight into what probably happened. The student earned 15 credit hours and finished the semester in good academic standing, which likely means they didn’t fail any classes. Therefore the person doing the data entry probably absent-mindedly entered 150 instead of 15.
It’s a minor judgement call, but it’s probably safe to change the 150 to 15. (The other option would be to ignore this row altogether, which is not ideal.) The easiest way to do this is to make the change in the original spreadsheet and then re-import it. However, we can also make the change directly in R. We just have to see how to access an entry in a specific row and column of our data set.
First, let’s see how to pick out a specific entry from a vector. We’ll use the vector w
to illustrate:
<- c(4, -2, 2.4, 8, 10) w
To extract the entry in the, let’s say, fourth spot in w
, we just enter:
4] w[
## [1] 8
In a tibble, each column is a vector, so we can access the entry in a given spot the same way. Suppose we want the 27th entry in the FA20 Credit Hours Attempted
column of high_school_data_clean
:
$`FA20 Credit Hours Attempted`[27] high_school_data_clean
## [1] 14
If we instead know the actual value in a vector for which we’re searching but not the spot it’s in (which will be the situation as we search for the 150 attempted credit hours value), we can use the which
function. Suppose for the vector w
above, we know that one of the entries is 2.4, but we don’t know which spot it’s in. We can locate it as follows:
which(w == 2.4)
## [1] 3
We thus see that 2.4 is the entry in the third spot in w
. Notice that we used a double equal sign ==
. This is because the argument for which
must be a truth condition on the vector. The which
function will return the location of every entry that satisfies that truth condition.
We can use this method to find where 150 is in FA20 Credit Hours Attempted
. Let’s store the spot number in the variable spot_150
so we can refer to it in the follow-up.
<- which(high_school_data_clean$`FA20 Credit Hours Attempted` == 150)
spot_150
spot_150
## [1] 74
So our problematic 150 value is in the 74th row. We can now make the change. We just have to overwrite the 150 in the column FA20 Credit Hours Attempted
and row number 74 with 15.
$`FA20 Credit Hours Attempted`[spot_150] <- 15 high_school_data_clean
Entry errors for categorical variables can be harder to detect than for continuous ones. In this case, entry errors often take the form of inconsistent naming. For example, let’s look at the Race
column in high_school_data_clean
. One way to check for inconsistencies is to use group_by
and summarize
to see how many times each value in Race
shows up in the data set:
%>%
high_school_data_clean group_by(Race) %>%
summarize(count = n())
## # A tibble: 7 x 2
## Race count
## <chr> <int>
## 1 American/Alaska Native 5
## 2 Asian 9
## 3 Black or African American 11
## 4 Black/African American 1
## 5 Hawaiian/Pacific Islander 1
## 6 White 230
## 7 <NA> 21
We should especially pay attention to the values with low counts since these might indicate inconsistencies. Notice that “Black or African American” and “Black/African American” are both listed as distinct values, although this is certainly because of an inconsistent naming system. Since “Black/African American” only shows up once, this is probably a mistake and should be changed to “Black or African American.” We can make this change with the str_replace
function within mutate
. The syntax is str_replace(<VARIABLE>, <VALUE TO REPLACE>, <NEW VALUE>)
:
<- high_school_data_clean %>%
high_school_data_clean mutate(Race = str_replace(Race, "Black/African American", "Black or African American"))
Running the grouped summary again on high_school_data_clean
shows that the problem has been fixed:
%>%
high_school_data_clean group_by(Race) %>%
summarize(count = n())
## # A tibble: 6 x 2
## Race count
## <chr> <int>
## 1 American/Alaska Native 5
## 2 Asian 9
## 3 Black or African American 12
## 4 Hawaiian/Pacific Islander 1
## 5 White 230
## 6 <NA> 21
3.5.1 Exercises
Look for possible entry errors for other continuous variables in
high_school_data_clean
. Explain why you think there’s an entry error by:- Citing the summary statistics.
- Referring to a visualization of the variable’s distribution.
- Sorting the data set by the variable and displaying the first or last few entries.
For any of the entry errors you found in the previous exercise, decide whether you think you should fix them. If so, proceed to do so, but if not, explain why.
Suppose the owner of the data set tells you that the 100 in the
SAT Comp
column was supposed to be 1000. Make this change directly in R.We saw above that
<VECTOR>[<NUMBER>]
returns the entry from<VECTOR>
in spot<NUMBER>
. Suppose we want to return the entries from several of the spots within<VECTOR>
. For the vectort <- c("a", "b", "c", "d", "e")
, try to come up with a single-line command that will return the entries in spots 1, 4, and 5.For this exercise, you will need the Lahman package.
- Use the
which
function to return all of the row numbers fromBatting
that have a home run (HR
) value from 50 through 59. - Use part (a) and the idea from Exercise 4 to write a single-line command that will return a list of the players who had a home run total from 50 through 59. (A player could be listed more than once.)
- Use the
3.6 Tidying Data
Before beginning any data analysis, we would ideally like to have data that is properly named, has as few missing values as possible, and is devoid of entry errors. We also want our data to be tidy, meaning that every variable must have its own dedicated column, every observation must have its own dedicated row, and every value must have its own dedicated cell. In this section, we’ll see how to deal with untidy data by examining four built-in untidy data sets. First, here’s an example of a tidy data set. The cases
variable refers to the number of cases of tuberculosis reported in the given year.
table1
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
It might seem at first that this is not tidy since it looks like observations are spread over two rows. However, though each country appears in two rows, the year
value in each row is different, making them different observations.
Below, we’ll pose four problems that could make our data untidy and offer a way to fix each one.
Problem: more than one value in a cell
Solution: separate
Our first untidy data set is table3
:
table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
The rate
column seems to contain the number of tuberculosis cases divided by the population, which would give the percentage of the population that contracted tuberculosis. There are a few problems with this. First, it’s not very helpful to report rates this way. Rates are usually more meaningful as decimals or percentages, but even if this is how rate
were displayed, remember that when comparing rates, it’s important to also display an overall count. In other words, we should convert the fractions in rate
to decimals, but we should also keep the actual case and population numbers, each in its own column. The way to do this is to use the separate
function. Notice that the into =
argument below is the vector of new columns we want to create to house the two numbers in rate
. We’ll also go ahead and add a true rate
column reported as a percentage:
<- table3 %>%
table3_tidy separate(rate, into = c("cases", "population")) %>%
mutate(rate = cases / population * 100)
## Error in `mutate()`:
## ! Problem while computing `rate = cases/population * 100`.
## Caused by error in `cases / population`:
## ! non-numeric argument to binary operator
What does this error message mean? There’s apparently a problem with rate = cases/population * 100
, and it’s that there’s a non-numeric argument to binary operator
. A binary operator is an arithmetic operation that takes two numbers as input and produces a single numeric output. The binary operator of concern here is division, so the error message must be saying that one or both of cases
or population
is non-numeric. Let’s check this out by running the above code again with the offending mutate
commented out:
<- table3 %>%
table3_tidy separate(rate, into = c("cases", "population")) #%>%
#mutate(rate = cases / population * 100)
table3_tidy
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Notice that, indeed, cases
and population
are both listed as characters. The reason is that the default procedure for separate
is to make the separated variables into characters. We can override this by including the convert = TRUE
argument in separate
:
<- table3 %>%
table3_tidy separate(rate, into = c("cases", "population"), convert = TRUE) #%>%
#mutate(rate = cases / population * 100)
table3_tidy
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Since we’re now able to compute the rate, let’s remove the #
and run the mutate
:
<- table3 %>%
table3_tidy separate(rate, into = c("cases", "population"), convert = TRUE) %>%
mutate(rate = cases / population * 100)
table3_tidy
## # A tibble: 6 x 5
## country year cases population rate
## <chr> <int> <int> <int> <dbl>
## 1 Afghanistan 1999 745 19987071 0.00373
## 2 Afghanistan 2000 2666 20595360 0.0129
## 3 Brazil 1999 37737 172006362 0.0219
## 4 Brazil 2000 80488 174504898 0.0461
## 5 China 1999 212258 1272915272 0.0167
## 6 China 2000 213766 1280428583 0.0167
There’s still a problem with rate
, though. The percentages are so small as to be somewhat meaningless. The reason is that a percentage tells you how many cases there are for every hundred people (cent = hundred). When the number is so small, it’s better to report the rate as cases per some other power of 10. In this case, if we multiply the percentages by 1000, we get more meaningful numbers, although now we’re calculating cases per 100,000:
<- table3 %>%
table3_tidy separate(rate, into = c("cases", "population"), convert = TRUE) %>%
mutate(rate_per_100K = cases / population * 100000)
table3_tidy
## # A tibble: 6 x 5
## country year cases population rate_per_100K
## <chr> <int> <int> <int> <dbl>
## 1 Afghanistan 1999 745 19987071 3.73
## 2 Afghanistan 2000 2666 20595360 12.9
## 3 Brazil 1999 37737 172006362 21.9
## 4 Brazil 2000 80488 174504898 46.1
## 5 China 1999 212258 1272915272 16.7
## 6 China 2000 213766 1280428583 16.7
By default, separate
will split the cell values at the first non-syntactic character, such as a space, a comma, an arithmetic symbol, etc. However, you can specify where the separation is to occur by including the optional sep = <SEPARATION CHARACTER>
argument. The separate
function will look for instances of <SEPARATION CHARACTER>
in the column to be separated and split the values at that character.
While the sep =
argument allows you to split the cells at any named character, sometimes you’ll want to split the cells into strings of a certain length instead, regardless of the characters involved. Look at this small sample data set, which contains a runner’s split times in a 5K race:
<- tribble(
split_times ~`distance (km)`, ~`time (min)`,
1, 430,
2, 920,
3, 1424,
4, 1920,
5, 2425
)
split_times
## # A tibble: 5 x 2
## `distance (km)` `time (min)`
## <dbl> <dbl>
## 1 1 430
## 2 2 920
## 3 3 1424
## 4 4 1920
## 5 5 2425
It obviously didn’t take the runner 430 minutes to finish the first kilometer; the numbers in the time (min)
column are clearly meant to be interpreted as “minutes:seconds.” The first 1 or 2 digits are minutes, and the last two are seconds. Let’s fix this.
First, we should separate the last two digits from the first 1 or 2, although there’s no character we can give to sep =
. Luckily, sep =
also accepts a number that states how many characters into the string we make our separation. If we include the argument sep = <NUMBER>
, then the cell will be separated <NUMBER>
characters in from the left if <NUMBER>
is positive, and from the right otherwise. In our data set above, we want the separation to occur 2 characters in from the right, so we’ll use sep = -2
:
<- split_times %>%
split_times separate(`time (min)`, into = c("minutes", "seconds"), sep = -2)
split_times
## # A tibble: 5 x 3
## `distance (km)` minutes seconds
## <dbl> <chr> <chr>
## 1 1 4 30
## 2 2 9 20
## 3 3 14 24
## 4 4 19 20
## 5 5 24 25
Next, we’ll look at the unite
function and see how we can combine the minutes
and seconds
columns back into a single column with a colon :
inserted.
Problem: one value is spread across two cells
Solution: unite
Consider the built-in table5
:
table5
## # A tibble: 6 x 4
## country century year rate
## * <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
This also has the problematic rate
column, which we addressed with separate
above, but the issue to confront here is the unnecessary separation of the century
and year
variables. It would be better to paste them together into a single variable, and we can do so with the unite
function. Note below that "year"
(with quotes) refers to the newly created column that the two variables will be united in, while year
(no quotes) refers to the already existing year
variable in table5
.
<- table5 %>%
table5_tidy unite("year", century, year)
table5_tidy
## # A tibble: 6 x 3
## country year rate
## <chr> <chr> <chr>
## 1 Afghanistan 19_99 745/19987071
## 2 Afghanistan 20_00 2666/20595360
## 3 Brazil 19_99 37737/172006362
## 4 Brazil 20_00 80488/174504898
## 5 China 19_99 212258/1272915272
## 6 China 20_00 213766/1280428583
You can see that by default, unite
separates combined values with an underscore _
. We can override this with the optional sep = <SEPARATION CHARACTER>
argument. In this case, we don’t want any separation, so we’ll use an empty pair of quotation marks as our separation character:
<- table5 %>%
table5_tidy unite("year", century, year, sep = "")
table5_tidy
## # A tibble: 6 x 3
## country year rate
## <chr> <chr> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
The only other bit of cleaning we should do (aside from the mess with rate
) is to convert the new year
variable to an integer. (unite
doesn’t offer a convert = TRUE
option like separate
does.)
<- table5_tidy %>%
table5_tidy mutate(year = as.integer(year))
table5_tidy
## # A tibble: 6 x 3
## country year rate
## <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
We can now finish tidying the split_times
data set from above:
<- split_times %>%
split_times unite("time", minutes, seconds, sep = ":")
split_times
## # A tibble: 5 x 2
## `distance (km)` time
## <dbl> <chr>
## 1 1 4:30
## 2 2 9:20
## 3 3 14:24
## 4 4 19:20
## 5 5 24:25
Of course, in practice, we’d combine the separate
and unite
in the pipe:
<- split_times %>%
split_times separate(`time (min)`, into = c("minutes", "seconds"), sep = -2) %>%
unite("time", minutes, seconds, sep = ":")
Problem: variables used as values
Solution: pivot_wider
Consider the data set:
table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
As you can see, each observation, which consists of a given country in a given year, is spread over two rows. The variable type
has values (cases
and population
) which are actually variables themselves, and their values are the corresponding numbers in the count
column. We should thus perform a pivot on the table that turns the values from the type
column into variables and assigns the corresponding values for these new variables from count
. A pivot of this type has the potential to give the data table more columns than it currently has (although not in this case) and thus make the table wider. The syntax below reflects exactly what we want to do:
%>%
table2 pivot_wider(names_from = type, values_from = count)
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Problem: values used as variables
Solution: pivot_longer
Now consider the following data sets:
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
and
table4b
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
The column names 1999
and 2000
are certainly not variables; they’re values of a missing year
variable. This is the opposite of the problem with table2
above: values are being used as variables. We thus will need to perform a pivot which will send the variable names 1999
and 2000
to values of a year
variable. We will also have to say what to do with the values in the 1999
and 2000
columns. In table4a
, we can recognize these as numbers of tuberculosis cases. In table4b
, we can recognize them as populations of the countries.
Since we’re sending variable names to values, a pivot of this type has the potential to make the data table longer. The code below shows how to do this. The first argument is a vector that lists the column names that are being pivoted to values. (Notice the backticks around 1999
and 2000
, which are necessary for names that don’t start with a letter.)
<- table4a %>%
table4a_tidy pivot_longer(cols = c(`1999`, `2000`), names_to = "year", values_to = "cases")
table4a_tidy
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766
<- table4b %>%
table4b_tidy pivot_longer(cols = c(`1999`, `2000`), names_to = "year", values_to = "population")
table4b_tidy
## # A tibble: 6 x 3
## country year population
## <chr> <chr> <int>
## 1 Afghanistan 1999 19987071
## 2 Afghanistan 2000 20595360
## 3 Brazil 1999 172006362
## 4 Brazil 2000 174504898
## 5 China 1999 1272915272
## 6 China 2000 1280428583
Notice that pivoting the years 1999
and 2000
away from variable names and into values converted them to character type. We should manually convert them back to integers since pivot_longer
also does not have a convert = TRUE
option.
<- table4a_tidy %>%
table4a_tidy mutate(year = as.integer(year))
<- table4b_tidy %>%
table4b_tidy mutate(year = as.integer(year))
3.6.1 Exercises
You will need the nycflights13 package for these exercises.
Create a version of the
flights
data set that has a single column for the date, with dates entered asmm-dd-yyyy
rather than three separate columns for the month, day, and year.Click here to go to the Wikipedia page that lists all members of the Baseball Hall of Fame. Then do the following:
- Copy and paste the list, including the column headings, into an Excel document and import it into R.
- Create a data table that lists only the name of each hall of famer in a single column, alphabetized by last name. Be sure there aren’t any
NA
entries. Try to execute this using a sequence of commands in the pipe. - Are there any problematic entries in your list? What’s going on with these? (Hint: Try to interpret the warning message you’ll get.)
How would you tidy the following data table?
<- tribble(
patients ~name, ~attribute, ~value,
"Stephen Johnson", "age", 47,
"Stephen Johnson", "weight", 185,
"Ann Smith", "age", 39,
"Ann Smith", "weight", 125,
"Ann Smith", "height", 64,
"Mark Davidson", "age", 51,
"Mark Davidson", "weight", 210,
"Mark Davidson", "height", 73
)
- Repeat the previous exercise for this data table. What problem do you run into? (Use the
View
command to see the pivoted table.) What could be done to avoid this problem?
<- tribble(
patients ~name, ~attribute, ~value,
"Stephen Johnson", "age", 47,
"Stephen Johnson", "weight", 185,
"Ann Smith", "age", 39,
"Ann Smith", "weight", 125,
"Ann Smith", "height", 64,
"Mark Davidson", "age", 51,
"Mark Davidson", "weight", 210,
"Mark Davidson", "height", 73,
"Ann Smith", "age", 43,
"Ann Smith", "weight", 140,
"Ann Smith", "height", 66
)
- How should the table below be tidied?
<- tribble(
preg ~pregnant, ~male, ~female,
"yes", NA, 10,
"no", 20, 12
)
3.7 Merging Data
Ideally, we would like all of the information in table4a_tidy
and table4b_tidy
from the last section to appear in a single table. A set of more than one data table is called relational data, and it’s often necessary to have the related data tables communicate with each other.
Merging relational data into a single table is a very common practice. When merging relational data, the first step is to identify the key of each table. The key is a variable or set of variables that uniquely identifies each observation. In both table4a_tidy
and table4b_tidy
, the key is the pair of variables country
and year
. (Notice that neither of these variables alone can be a key since knowing only one of them does not uniquely identify an observation.)
A good way to check whether a set of variables is a key is to group the data by these variables and count how many observations show up in each group. If the variables constitute a key, there should only be one observation per group.
%>%
table4a_tidy group_by(country, year) %>%
summarize(count = n())
## # A tibble: 6 x 3
## # Groups: country [3]
## country year count
## <chr> <int> <int>
## 1 Afghanistan 1999 1
## 2 Afghanistan 2000 1
## 3 Brazil 1999 1
## 4 Brazil 2000 1
## 5 China 1999 1
## 6 China 2000 1
We thus see that the pair country
and year
is a key for table4a_tidy
. The same will be true for table4b_tidy
.
For these two tables, the values of the key are identical: We have the exact same six pairs of country
/year
values. In this case, an appropriate way to merge them into a single table is the inner_join
function. The by
argument below is the vector of key variables.
<- inner_join(table4a_tidy, table4b_tidy, by = c("country", "year"))
table4a4b_comb
table4a4b_comb
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
It’s more often the case that the key values in your relational data are not identical. In this case, we might need to use other types of joins. We’ll explore these in the exercises.
One more type of merge we’ll examine here is useful when you have two data tables that you want to append to each other, meaning that you want to insert the rows of one table right after the rows of another. (Think of copying the cells of an Excel spreadsheet and pasting them after the last cell of another.) This can be accomplished with the union
function. It assumes that the two data tables have the exact same variables.
Let’s create two small data tables using the tribble
function:
<- tribble(
grades_2021 ~year, ~class, ~grade,
2021, "HY 102", "B",
2021, "MS 121", "A",
2021, "PH 101", "A-",
2021, "BY 170", "B+"
)
<- tribble(
grades_2022 ~year, ~class, ~grade,
2022, "AT 150", "B-",
2022, "MS 122", "A-",
2022, "PG 100", "A",
2022, "BY 171", "A-",
2022, "CS 152", "B+"
)
Now let’s merge these together by appending the 2022 grades right after the 2021 grades:
<- union(grades_2021, grades_2022)
grades
grades
## # A tibble: 9 x 3
## year class grade
## <dbl> <chr> <chr>
## 1 2021 HY 102 B
## 2 2021 MS 121 A
## 3 2021 PH 101 A-
## 4 2021 BY 170 B+
## 5 2022 AT 150 B-
## 6 2022 MS 122 A-
## 7 2022 PG 100 A
## 8 2022 BY 171 A-
## 9 2022 CS 152 B+
3.7.1 Exercises
You will need the Lahman and nycflights13 packages for these exercises.
Identify a key variable or set of variables for each of the following data sets. Verify your answers by computing a grouped count.
msleep
Batting
flights
Does
mpg
have a key? If so, what is it? If not, how might you create one? (Hint: Research therow_number()
function.)Enter the following two small data sets:
<- tribble(
sample1 ~x, ~y,
1, "a",
2, "b",
3, "c"
)
<- tribble(
sample2 ~x, ~z,
1, "p",
2, "q",
4, "r"
)
Then perform each of the following joins, and based on the results, explain what each one does.
inner_join(sample1, sample2, by = "x")
left_join(sample1, sample2, by = "x")
right_join(sample1, sample2, by = "x")
full_join(sample1, sample2, by = "x")
Another data set included in Lahman is
Salaries
, which contains player salaries for various years. If you wanted to add asalary
column to thePitching
data set, which of the four types of joins from the previous exercise would be appropriate? Perform this join.Create a data set that contains the name of every major league baseball player since 1871 (both pitchers and batters) along with the year and their salary for that year. If a player’s salary is not available for a given year, the player should still show up in the list with an
NA
in the salary column.
3.8 Cleaning Example
Data cleaning can be a long, tedious process, and almost all real-life data that you’ll have to analyze will need a lot of it. The people who prepare the data often aren’t doing so with a follow-up analysis in mind; they just need a way to represent it that lists all of the relevant information in some systematic way.
We’ll close this chapter by preparing a real-life data set based on two related built-in data sets, who
and population
, from the 2014 World Health Organization Global Tuberculosis Report. As always, the very first steps are to view the data:
who
## # A tibble: 7,240 x 60
## country iso2 iso3 year new_s~1 new_s~2 new_s~3 new_s~4 new_s~5 new_s~6 new_s~7 new_s~8
## <chr> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 Afghanis~ AF AFG 1980 NA NA NA NA NA NA NA NA
## 2 Afghanis~ AF AFG 1981 NA NA NA NA NA NA NA NA
## 3 Afghanis~ AF AFG 1982 NA NA NA NA NA NA NA NA
## 4 Afghanis~ AF AFG 1983 NA NA NA NA NA NA NA NA
## 5 Afghanis~ AF AFG 1984 NA NA NA NA NA NA NA NA
## 6 Afghanis~ AF AFG 1985 NA NA NA NA NA NA NA NA
## 7 Afghanis~ AF AFG 1986 NA NA NA NA NA NA NA NA
## 8 Afghanis~ AF AFG 1987 NA NA NA NA NA NA NA NA
## 9 Afghanis~ AF AFG 1988 NA NA NA NA NA NA NA NA
## 10 Afghanis~ AF AFG 1989 NA NA NA NA NA NA NA NA
## # ... with 7,230 more rows, 48 more variables: new_sp_f1524 <int>, new_sp_f2534 <int>,
## # new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>, new_sp_f65 <int>,
## # new_sn_m014 <int>, new_sn_m1524 <int>, new_sn_m2534 <int>, new_sn_m3544 <int>,
## # new_sn_m4554 <int>, new_sn_m5564 <int>, new_sn_m65 <int>, new_sn_f014 <int>,
## # new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>, new_sn_f4554 <int>,
## # new_sn_f5564 <int>, new_sn_f65 <int>, new_ep_m014 <int>, new_ep_m1524 <int>,
## # new_ep_m2534 <int>, new_ep_m3544 <int>, new_ep_m4554 <int>, new_ep_m5564 <int>, ...
## # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
population
## # A tibble: 4,060 x 3
## country year population
## <chr> <int> <int>
## 1 Afghanistan 1995 17586073
## 2 Afghanistan 1996 18415307
## 3 Afghanistan 1997 19021226
## 4 Afghanistan 1998 19496836
## 5 Afghanistan 1999 19987071
## 6 Afghanistan 2000 20595360
## 7 Afghanistan 2001 21347782
## 8 Afghanistan 2002 22202806
## 9 Afghanistan 2003 23116142
## 10 Afghanistan 2004 24018682
## # ... with 4,050 more rows
## # i Use `print(n = ...)` to see more rows
(or View(who)
and View(population)
to see the full tables) and then read the documentation. This documentation would often be found in the publication or web site from which the data is pulled, but since these are built-in data sets, we’re lucky to have access to their dictionaries right in R. They’re both found in the same place:
?who
We will clean these data sets and then formulate questions to ask about them. We’ll these try to answer these questions using visualizations and transformations from the previous chapters.
The population
data set looks ready to go. A couple of things might stand out right away in who
, though, namely the redundant iso2
and iso3
columns, the several NA
s, and the strange looking column names.
Let’s address the column names first. The documentation explains that the column names encode several pieces of information. They all begin with new
, indicating that these are newly diagnosed cases. After new_
, the next string encodes the method of diagnosis. For example, rel
means the case was a relapse. The next string is an m
(for male) or f
(for female), followed by numbers which indicate an age range. Thus each column heading is actually a set of four data values. The numbers in these columns are the numbers of cases. Since we have values being used as column names, we should pivot the column names into values. This will make the data table longer, so we’ll use pivot_longer
. We’ll also get rid of the iso2
and iso3
columns while we’re at it.
<- who %>%
who_clean select(-iso2, -iso3) %>%
pivot_longer(cols = new_sp_m014:newrel_f65, names_to = "case_type", values_to = "cases")
who_clean
## # A tibble: 405,440 x 4
## country year case_type cases
## <chr> <int> <chr> <int>
## 1 Afghanistan 1980 new_sp_m014 NA
## 2 Afghanistan 1980 new_sp_m1524 NA
## 3 Afghanistan 1980 new_sp_m2534 NA
## 4 Afghanistan 1980 new_sp_m3544 NA
## 5 Afghanistan 1980 new_sp_m4554 NA
## 6 Afghanistan 1980 new_sp_m5564 NA
## 7 Afghanistan 1980 new_sp_m65 NA
## 8 Afghanistan 1980 new_sp_f014 NA
## 9 Afghanistan 1980 new_sp_f1524 NA
## 10 Afghanistan 1980 new_sp_f2534 NA
## # ... with 405,430 more rows
## # i Use `print(n = ...)` to see more rows
Another easy cleanup is to remove the rows with missing values for cases
, as any question we could ask about this data would depend in knowing the number of cases.
<- who_clean %>%
who_clean filter(!is.na(cases))
who_clean
## # A tibble: 76,046 x 4
## country year case_type cases
## <chr> <int> <chr> <int>
## 1 Afghanistan 1997 new_sp_m014 0
## 2 Afghanistan 1997 new_sp_m1524 10
## 3 Afghanistan 1997 new_sp_m2534 6
## 4 Afghanistan 1997 new_sp_m3544 3
## 5 Afghanistan 1997 new_sp_m4554 5
## 6 Afghanistan 1997 new_sp_m5564 2
## 7 Afghanistan 1997 new_sp_m65 0
## 8 Afghanistan 1997 new_sp_f014 5
## 9 Afghanistan 1997 new_sp_f1524 38
## 10 Afghanistan 1997 new_sp_f2534 36
## # ... with 76,036 more rows
## # i Use `print(n = ...)` to see more rows
You may have noticed during the pivot above that there are, annoyingly, some inconsistent naming conventions. For example, in some of the case types, there is an underscore after new
and in some there are not. Let’s perform a grouped count on the case_type
variable to see the extent of this problem. Notice that we’re not storing this grouped table in who_clean
since we’re only using it to detect the entry error.
<- who_clean %>%
case_type_count group_by(case_type) %>%
summarize(count = n())
By viewing the full case_type_count
table (View(case_type_count)
), we see that the last 14 values are entered as newrel
rather than new_rel
. Here’s what you’d see:
tail(case_type_count, 16)
## # A tibble: 16 x 2
## case_type count
## <chr> <int>
## 1 new_sp_m5564 3218
## 2 new_sp_m65 3209
## 3 newrel_f014 190
## 4 newrel_f1524 184
## 5 newrel_f2534 182
## 6 newrel_f3544 183
## 7 newrel_f4554 183
## 8 newrel_f5564 183
## 9 newrel_f65 185
## 10 newrel_m014 190
## 11 newrel_m1524 182
## 12 newrel_m2534 183
## 13 newrel_m3544 184
## 14 newrel_m4554 184
## 15 newrel_m5564 185
## 16 newrel_m65 182
We can fix this with str_replace
:
<- who_clean %>%
who_clean mutate(case_type = str_replace(case_type, "newrel", "new_rel"))
Now let’s separate the case_type
variable into each of its four parts. We’ll first separate at the underscores, which separate
will do by default since they’re non-syntactic characters. (Do you see why we first had to fix the newrel
entry error?)
<- who_clean %>%
who_clean separate(case_type, into = c("new or old", "diagnosis type", "sex_age"))
who_clean
## # A tibble: 76,046 x 6
## country year `new or old` `diagnosis type` sex_age cases
## <chr> <int> <chr> <chr> <chr> <int>
## 1 Afghanistan 1997 new sp m014 0
## 2 Afghanistan 1997 new sp m1524 10
## 3 Afghanistan 1997 new sp m2534 6
## 4 Afghanistan 1997 new sp m3544 3
## 5 Afghanistan 1997 new sp m4554 5
## 6 Afghanistan 1997 new sp m5564 2
## 7 Afghanistan 1997 new sp m65 0
## 8 Afghanistan 1997 new sp f014 5
## 9 Afghanistan 1997 new sp f1524 38
## 10 Afghanistan 1997 new sp f2534 36
## # ... with 76,036 more rows
## # i Use `print(n = ...)` to see more rows
Now let’s separate sex_age
, which will mean splitting off the m
or f
character, hence the sep = 1
argument:
<- who_clean %>%
who_clean separate(sex_age, into = c("sex", "age"), sep = 1)
who_clean
## # A tibble: 76,046 x 7
## country year `new or old` `diagnosis type` sex age cases
## <chr> <int> <chr> <chr> <chr> <chr> <int>
## 1 Afghanistan 1997 new sp m 014 0
## 2 Afghanistan 1997 new sp m 1524 10
## 3 Afghanistan 1997 new sp m 2534 6
## 4 Afghanistan 1997 new sp m 3544 3
## 5 Afghanistan 1997 new sp m 4554 5
## 6 Afghanistan 1997 new sp m 5564 2
## 7 Afghanistan 1997 new sp m 65 0
## 8 Afghanistan 1997 new sp f 014 5
## 9 Afghanistan 1997 new sp f 1524 38
## 10 Afghanistan 1997 new sp f 2534 36
## # ... with 76,036 more rows
## # i Use `print(n = ...)` to see more rows
Now that the columns are separated, notice that it looks like every value in new or old
is new
. Let’s check:
n_distinct(who_clean$`new or old`)
## [1] 1
In this case, the new_or_old
column is not helpful, so let’s remove it:
<- who_clean %>%
who_clean select(-`new or old`)
Now let’s clean up the age
column, making it clearer that it represents an age range. Let’s first split off the last two numbers. (Do you see why we wouldn’t want to split off the first two?)
<- who_clean %>%
who_clean separate(age, into = c("first", "last"), sep = -2)
who_clean
## # A tibble: 76,046 x 7
## country year `diagnosis type` sex first last cases
## <chr> <int> <chr> <chr> <chr> <chr> <int>
## 1 Afghanistan 1997 sp m "0" 14 0
## 2 Afghanistan 1997 sp m "15" 24 10
## 3 Afghanistan 1997 sp m "25" 34 6
## 4 Afghanistan 1997 sp m "35" 44 3
## 5 Afghanistan 1997 sp m "45" 54 5
## 6 Afghanistan 1997 sp m "55" 64 2
## 7 Afghanistan 1997 sp m "" 65 0
## 8 Afghanistan 1997 sp f "0" 14 5
## 9 Afghanistan 1997 sp f "15" 24 38
## 10 Afghanistan 1997 sp f "25" 34 36
## # ... with 76,036 more rows
## # i Use `print(n = ...)` to see more rows
Now we could unite these columns back together, separating the values by a dash, so that the age ranges look like 25-34
, for example:
<- who_clean %>%
who_clean unite("age range", first, last, sep = "-")
who_clean
## # A tibble: 76,046 x 6
## country year `diagnosis type` sex `age range` cases
## <chr> <int> <chr> <chr> <chr> <int>
## 1 Afghanistan 1997 sp m 0-14 0
## 2 Afghanistan 1997 sp m 15-24 10
## 3 Afghanistan 1997 sp m 25-34 6
## 4 Afghanistan 1997 sp m 35-44 3
## 5 Afghanistan 1997 sp m 45-54 5
## 6 Afghanistan 1997 sp m 55-64 2
## 7 Afghanistan 1997 sp m -65 0
## 8 Afghanistan 1997 sp f 0-14 5
## 9 Afghanistan 1997 sp f 15-24 38
## 10 Afghanistan 1997 sp f 25-34 36
## # ... with 76,036 more rows
## # i Use `print(n = ...)` to see more rows
This creates a new problem, though, because the “65 and older” values are now expressed as -65
. Another str_replace
easily fixes this:
<- who_clean %>%
who_clean mutate(`age range` = str_replace(`age range`, "-65", "65 and older"))
who_clean
## # A tibble: 76,046 x 6
## country year `diagnosis type` sex `age range` cases
## <chr> <int> <chr> <chr> <chr> <int>
## 1 Afghanistan 1997 sp m 0-14 0
## 2 Afghanistan 1997 sp m 15-24 10
## 3 Afghanistan 1997 sp m 25-34 6
## 4 Afghanistan 1997 sp m 35-44 3
## 5 Afghanistan 1997 sp m 45-54 5
## 6 Afghanistan 1997 sp m 55-64 2
## 7 Afghanistan 1997 sp m 65 and older 0
## 8 Afghanistan 1997 sp f 0-14 5
## 9 Afghanistan 1997 sp f 15-24 38
## 10 Afghanistan 1997 sp f 25-34 36
## # ... with 76,036 more rows
## # i Use `print(n = ...)` to see more rows
Lastly, we should check that the cases
column doesn’t have any potential entry errors. We can start with the summary statistics:
summary(who_clean$cases)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 3.0 26.0 570.7 184.0 250051.0
The maximum value seems very large. Let’s look at the distribution:
ggplot(data = who_clean) +
geom_histogram(mapping = aes(x = cases))
This is skewed way to the right, which might indicate an entry error. Let’s sort by cases
and look at the top of the list:
head(arrange(who_clean, desc(cases)))
## # A tibble: 6 x 6
## country year `diagnosis type` sex `age range` cases
## <chr> <int> <chr> <chr> <chr> <int>
## 1 India 2007 sn m 35-44 250051
## 2 India 2007 sn f 35-44 148811
## 3 China 2013 rel m 65 and older 124476
## 4 China 2013 rel m 55-64 112558
## 5 India 2007 ep m 35-44 105825
## 6 India 2007 ep f 35-44 101015
It looks like there actually are some very high case counts in the table, but they come from India and China, the two most populous countries in the world. While the value 250,051 does seem extremely high (maybe it should be 150,051?), there doesn’t seem to be enough rationale to change it. If there had been any unrealistically high values or negative values or decimal values (and we know there are not because we can see that the variable type is “integer”), then we would probably just filter out those observations.
At this point, who_clean
is in good shape. It should be noted that the piecemeal approach above would probably not be done in practice. Using the pipe, we could condense all of the cleaning we just did into the following:
<- who %>%
who_clean select(-iso2, -iso3) %>%
pivot_longer(cols = new_sp_m014:newrel_f65, names_to = "case_type", values_to = "cases") %>%
filter(!is.na(cases)) %>%
mutate(case_type = str_replace(case_type, "newrel", "new_rel")) %>%
separate(case_type, into = c("new or old", "diagnosis type", "sex_age")) %>%
separate(sex_age, into = c("sex", "age"), sep = 1) %>%
select(-`new or old`) %>%
separate(age, into = c("first", "last"), sep = -2) %>%
unite("age range", first, last, sep = "-") %>%
mutate(`age range` = str_replace(`age range`, "-65", "65 and older"))
We know that we can import Excel files into R, but we can also export them. Since we now have such a nice, clean version of who
, let’s do this. It requires the writexl package:
install.packages("writexl")
Running the following will create an Excel version of who_clean
named “who_clean.xlsx” in your working directory.
library(writexl)
write_xlsx(who_clean, "who_clean.xlsx")
Now that who_clean
is ready for analysis, we can start to ask questions about the data in it. The ones we’ll address here are, “How has the rate of tuberculosis diagnoses changed over time, and does the geographic area seem to make a difference?”
As usual, this is a somewhat open-ended question, and it’s the job of the analyst to think of a way to make it more specific. Let’s start by analyzing the rate over time question. By “rate,” we mean a measure of the number of cases relative to the population. As we saw in Section 3.6, a percentage might not be the best such measure. We might have to look at the number of cases per a power of 10 other than 100. But before we think about that, we have to acknowledge that who_clean
doesn’t contain enough information to calculate rates; we need to know the countries’ populations in the years provided. This, finally, is where population
comes in:
population
## # A tibble: 4,060 x 3
## country year population
## <chr> <int> <int>
## 1 Afghanistan 1995 17586073
## 2 Afghanistan 1996 18415307
## 3 Afghanistan 1997 19021226
## 4 Afghanistan 1998 19496836
## 5 Afghanistan 1999 19987071
## 6 Afghanistan 2000 20595360
## 7 Afghanistan 2001 21347782
## 8 Afghanistan 2002 22202806
## 9 Afghanistan 2003 23116142
## 10 Afghanistan 2004 24018682
## # ... with 4,050 more rows
## # i Use `print(n = ...)` to see more rows
We should add a column to who_clean
that states the population of a given country in a given year by performing a join of who_clean
with population
. First, since our question doesn’t involve diagnosis type
, sex
, or age range
, we should group the data by country
and year
and count the total number of cases for each group:
<- who_clean %>%
who_grouped group_by(country, year) %>%
summarize(cases = sum(cases))
who_grouped
## # A tibble: 3,484 x 3
## # Groups: country [219]
## country year cases
## <chr> <int> <int>
## 1 Afghanistan 1997 128
## 2 Afghanistan 1998 1778
## 3 Afghanistan 1999 745
## 4 Afghanistan 2000 2666
## 5 Afghanistan 2001 4639
## 6 Afghanistan 2002 6509
## 7 Afghanistan 2003 6528
## 8 Afghanistan 2004 8245
## 9 Afghanistan 2005 9949
## 10 Afghanistan 2006 12469
## # ... with 3,474 more rows
## # i Use `print(n = ...)` to see more rows
Now we can join the population
table to who_grouped
by the key consisting of country
and year
. The values of country
and year
for the two tables are not identical; for example, population
contains data for Afghanistan in 1995, but who_grouped
does not. It’s also possible that who_grouped
might contain key values that are missing from population
. Since we won’t be able to use data with missing values, we should choose our join so that it only merges the tables on key values that are present in both tables. This is what inner_join
does:
<- who_grouped %>%
who_rates inner_join(population, by = c("country", "year"))
who_rates
## # A tibble: 3,432 x 4
## # Groups: country [217]
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1997 128 19021226
## 2 Afghanistan 1998 1778 19496836
## 3 Afghanistan 1999 745 19987071
## 4 Afghanistan 2000 2666 20595360
## 5 Afghanistan 2001 4639 21347782
## 6 Afghanistan 2002 6509 22202806
## 7 Afghanistan 2003 6528 23116142
## 8 Afghanistan 2004 8245 24018682
## 9 Afghanistan 2005 9949 24860855
## 10 Afghanistan 2006 12469 25631282
## # ... with 3,422 more rows
## # i Use `print(n = ...)` to see more rows
Now we can calculate rates. We can try a percentage first to see how it looks.
<- who_rates %>%
who_rates_percent mutate(percentage = cases/population * 100)
who_rates_percent
## # A tibble: 3,432 x 5
## # Groups: country [217]
## country year cases population percentage
## <chr> <int> <int> <int> <dbl>
## 1 Afghanistan 1997 128 19021226 0.000673
## 2 Afghanistan 1998 1778 19496836 0.00912
## 3 Afghanistan 1999 745 19987071 0.00373
## 4 Afghanistan 2000 2666 20595360 0.0129
## 5 Afghanistan 2001 4639 21347782 0.0217
## 6 Afghanistan 2002 6509 22202806 0.0293
## 7 Afghanistan 2003 6528 23116142 0.0282
## 8 Afghanistan 2004 8245 24018682 0.0343
## 9 Afghanistan 2005 9949 24860855 0.0400
## 10 Afghanistan 2006 12469 25631282 0.0486
## # ... with 3,422 more rows
## # i Use `print(n = ...)` to see more rows
The percentage numbers look pretty small. Let’s see what their range is:
summary(who_rates_percent$percentage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.007535 0.021223 0.041985 0.052621 0.813478
If we were to move the decimal point three places to the right, these numbers would be easier to digest. Thus, when we calculate the rate, we should multiply the quotient by 100,000 instead of 100 (since 100,000 has three more zeros than 100). This means we should measure rate in cases per 100,000 people (as in Section 3.6).
<- who_rates %>%
who_rates_100K mutate(rate_per_100K = cases/population * 100000)
who_rates_100K
## # A tibble: 3,432 x 5
## # Groups: country [217]
## country year cases population rate_per_100K
## <chr> <int> <int> <int> <dbl>
## 1 Afghanistan 1997 128 19021226 0.673
## 2 Afghanistan 1998 1778 19496836 9.12
## 3 Afghanistan 1999 745 19987071 3.73
## 4 Afghanistan 2000 2666 20595360 12.9
## 5 Afghanistan 2001 4639 21347782 21.7
## 6 Afghanistan 2002 6509 22202806 29.3
## 7 Afghanistan 2003 6528 23116142 28.2
## 8 Afghanistan 2004 8245 24018682 34.3
## 9 Afghanistan 2005 9949 24860855 40.0
## 10 Afghanistan 2006 12469 25631282 48.6
## # ... with 3,422 more rows
## # i Use `print(n = ...)` to see more rows
As for the question about geographic area, there could be many ways to address this (and we’ll do so in a robust way in the exercises below). For our purposes, we’ll choose a set of countries from different locations and track their case rates over time. (Choosing these representative countries would have to be done carefully by someone with expertise, but our choice will still illustrate the process.) Let’s compare the case rates over time in France, Botswana, India, United States, Brazil, Australia, and Iraq. We’ll start by filtering out all other countries. (The %in%
operator used below is new to us. It’s used to avoid a long string of “or” statements such as country == "France" | country = "Botswana" |...
.)
<- who_rates_100K %>%
who_rate_compare filter(country %in% c("France", "Botswana", "India", "United States of America", "Brazil", "Australia", "Iraq"))
who_rate_compare
## # A tibble: 121 x 5
## # Groups: country [7]
## country year cases population rate_per_100K
## <chr> <int> <int> <int> <dbl>
## 1 Australia 1997 170 18563442 0.916
## 2 Australia 1998 204 18794552 1.09
## 3 Australia 1999 323 19027438 1.70
## 4 Australia 2000 251 19259377 1.30
## 5 Australia 2001 228 19487257 1.17
## 6 Australia 2002 212 19714625 1.08
## 7 Australia 2003 113 19953121 0.566
## 8 Australia 2004 166 20218481 0.821
## 9 Australia 2005 222 20520736 1.08
## 10 Australia 2006 1164 20865583 5.58
## # ... with 111 more rows
## # i Use `print(n = ...)` to see more rows
Since we’re tracking a quantity over time, a line plot would be a good way to visualize this. We can map country
to the color aesthetic so we can compare:
ggplot(data = who_rate_compare, mapping = aes(x = year,
y = rate_per_100K,
color = country)) +
geom_line() +
geom_point() +
labs(title = "Tuberculosis Rates from Around the World",
y = "cases per 100,000")
Based on this visualization, we could make a conjecture that tuberculosis is largely under control in so-called “first world countries” but less so in developing countries. Of course, a much deeper analysis would have to be done to verify this conjecture.
In practice, the process of importing a data set, cleaning it, transforming it, merging it with other data sets, and visualizing it is very long and not at all linear. It’s often the case that we have to back-track and re-clean or re-transform the data set after our analysis reveals something we missed. However, our who
project gives a good cross-section of the data analysis techniques we’ve covered in the course so far.
3.8.1 Exercises
Access the web page found here. You’ll see a data table with information about countries.
Copy and paste the entire data table (column headings too) into an Excel spreadsheet and import it into R.
Use the data table you just imported to add the “Region” column to
who_rates_100K
. (Since the country names inwho
and your imported table don’t match perfectly, you’ll have to re-dowho_rates_100K
keeping theiso3
. When you perform your join, you can then useiso3
as the key since it’s present in both data sets.)Since there are too many values of
Region
to duplicate the line graph we saw above, visualize the relationship betweenRegion
andrate_per_100K
using a box plot. (If your box plot looks too crowded, try thecoord_flip()
option.) Be sure to use thereorder
function within your box plot to arrange the boxes by their mean values.Can you draw any conclusions about tuberculosis prevalence based on geographical region?
3.9 Dashboards
A dashboard is a compilation of data tables and visualizations that is meant to provide a quick snapshot of certain relevant performance indicators and metrics. They are often interactive and meant to be a purely visual overview of the data describing a process or phenomenon. Click here for an example of a dashboard displaying COVID-related data for Kent County, Michigan.
Dashboards usually don’t contain much or any data storytelling like a data analysis report would. Text is kept to a minimum and it’s left to the viewer to extract their own insights. On the other hand, all displayed data tables are absolutely pristine, and all visualizations are aesthetically pleasing and fully labeled.
The COVID dashboard linked to above was created with Microsoft PowerBI, which is very powerful visualization software often used in industry, but we can also create dashboards in R Markdown. In doing so, we can learn a few best practices of dashboard building that would transer to other software platforms. The project for this chapter will require you to create a dashboard, so we’ll use this section to go over the basics in R.
First, in a scratch work .R file, install the flexdashboard package, which provides the functionality to create nice-looking dashboards in R. You will also need the package htmltools.
install.packages("flexdashboard")
install.packages("htmltools")
Let’s create a sample dashboard that displays information about the mpg
data set. Click here to access a .Rmd file that you can download and put into your working R directory. When you’ve done so, open it and knit it so you can see the output. we’ll use the rest of this section to walk through it. This file can serve as a template for building your own dashboards.
Lines 1-6:
This is the standard heading setup for any R Markdown file. Notice that the output is flexdashboard::flex_dashboard
. flex_dashboard
is the output profile we want to use, but it’s part of the flexdashboard
library you just installed. Since there’s nowhere to load this library before the heading setup, we have to install the library “on the fly” by including flexdashboard::
before the the output profile we’re accessing.
Lines 8-11
This is a good place to load any libraries you’ll need to display information in your dashboard. You’ll probably at least need tidyverse and DT (for data tables), but you might need others as well. The include=FALSE
option ensures that these lines of code don’t show up in your dashboard.
Lines 13-18
On these lines, some data transformation is done to mpg
to create the data set mpg_manufacturer
that will be analyzed in the dashboard. (Again, notice the include=FALSE
.) This is a good spot to do any cleaning and transformation on your data sets to get them ready to display and visualize. Code should never be included in a dashboard, though, so remember include=FALSE
.
Line 20
Dashboards are separated into several tabbed pages, and each page is given a name that describes what the page contains. To set up and name a page, use a single #
followed by a space and the desired page name.
Line 22
Each page is further subdivided into vertical columns. You can set up and name a column with a double ##
followed by the column name. The column name is only for your benefit and is not visible in the dashboard.
Line 24
Each column can be even further subdivided into horizontal rows. To set up up a row, use a triple ###
followed by the row name. The row name should be a description of the information it contains.
Lines 26-28
After the page, column, and row are established, you’re ready to enter the code that will produce whatever you want to display there. In this sample dashboard, we’re displaying the mpg_manufacturer
data table. Notice that the row name we choose describes what the data table is showing us.
Lines 30-38
Still on page 1, we’re starting a second column (using ##
) and within that second column, a new row (using ###
) named with a description of what’s to follow. We then enter the code chunk to produce a box plot.
Lines 40-48
We’re now starting a whole new page (which will show up as another tab in the dashboard). We establish a first column and a row within that column. Then we have a code chunk that will display the full mpg
data table.
Lines 50-55
Columns can have more than one row, and here we’re setting up a second row within the first column of page 2. The code chunk produces a scatter plot.
Lines 57-81
We establish a second column on page 2, which contains three rows. Each row contains a code chunk that produces the desired outputs. Notice that in lines 69-71, we have code that transforms mpg
to create class_mileage
. Since this code only establishes the class_mileage
data set and produces no actual output, we can include it here and not worry about it showing up in the dashboard.
Line 84
We start a third page, which we could then populate with columns and rows as needed.
We can structure our dashboards with as many pages, rows, and columns as we want, but the guiding principles are that it should contain immediately accessible information, should be uncluttered and not overwhelming, and should be free of R code and blocks of text. The sample dashboard we walked through here actually has some issues that are worth pointing out:
Throughout, the original variable names from
mpg
such ashwy
,cty
,displ
, etc., and some newly created variables such asavg_hwy
show up in the dashboard. Every instance of these abbreviations should be replaced by more meaningful names.If we’re using row names to describe what the visualizations contain, we don’t have to also label the visualizations with titles, but we should still label the x- and y-axes (and the legend if it exists) using descriptive names.
The box plots should be ordered by mean value. (Recall the
reorder
function.) All visualizations in a dashboard should be as aesthetically pleasing as you can make them.Most importantly, page 2 is very crowded. The plots are too small, and the tables are too compressed. The information on this page should be spread out over more pages with fewer rows.
3.10 Project
Project Description: The purpose of this project is to import and clean a real-world data set and to display the results in a dashboard.
Instructions:
Open the spreadsheet linked to here. This contains data from the 2019 STAR math placement exams. Import this data into R.
Give the data set a very thorough cleaning, considering all of the techniques discussed in this chapter.
This web site gives ACT/SAT concordance tables, which are used to convert scores of one test to those of the other. Find the conversion from ACT Math scores to SAT Math scores, and copy and paste the information into a spreadsheet. Then import this spreadsheet into R.
Use the imported concordance spreadsheet from the previous problem to add a column to the STAR data table that contains a converted SAT Math score for each student who took the ACT. Then create a new column called
SAT-ACT_Max
that lists the higher of the SAT Math or converted ACT Math score.The rest of the project will involve analyzing and summarizing your cleaned data in a dashboard. Specifically, your dashboard should contain:
- The cleaned data table.
- A visualization which shows the distribution of the placement exam scores. (Recall the geom which does this for continuous data.)
- A visualization which shows the relationship between GPA and placement exam score.
- A visualization which shows the relationship between
SAT-ACT_Max
and placement exam score. - A visualization which shows the relationship between STAR date and placement exam score. (Think carefully about which geom is most appropriate. What types of variables are
STAR date
andAQ Score
?) - A data table that ranks the high schools by average AQ math placement score. (You might want to consider filtering out the high schools that didn’t have many students at STARs.)
- A data table that ranks the career goals by average AQ math placement score. (You might want to consider filtering out the career goals that didn’t have many students at STARs.)
Guidelines:
- Make sure that no code or warning/error messages show up in your dashboard.
- Give descriptive names to the all of the pages and rows in the dashboard.
- Label all visualizations.
- Use variable names in your displayed data tables and visualizations that are meaningful and are not just abbreviations.
- Make sure that none of your dashboard pages are too cluttered. Everything should be visually pleasing.
- Use the
datatable
function from DT to display your tables.
Grading Rubric:
- Cleanness: Did you properly clean the data? Did you pay attention to data types and convert them if necessary? Did you rename unnecessarily long columns? Did you handle missing values appropriately? Did you search for and fix entry errors? Is your data tidy? Did you merge relational data correctly? (30 points)
- Dashboard: Does your dashboard have all of the required components? (30 points)
- Professionalism: Does your dashboard look nice? Is the information neatly laid out in rows and columns? Would a potential viewer be able to make sense of it easily? (15 points)