Chapter 2 Data Transformation
In this chapter, we will use a set of functions from the dplyr package (which is part of tidyverse) to transform a data set in various ways. In particular, we will see how to filter out observations, sort the rows of a data set, remove unnecessary variables, create new variables from old ones, and group observations together. We’ll also introduce a tidyverse construct called “the pipe” which provides a streamlined way to perform several transformations in succession.
We’ll need to install two new packages1:
install.packages("nycflights13")
install.packages("Lahman")
The first package contains a data set with records from all of the flights that originated from a major New York City airport in 2013. The second contains Major League Baseball data sets dating back to 1871. Be sure to load these libraries, along with tidyverse.
library(tidyverse)
library(nycflights13)
library(Lahman)
2.1 Filtering a Data Set
Often, we will want to filter a data set so that we only see the observations that satisfy a certain condition. For example, suppose we want to see only the cars in the mpg
data set that have at least 6-cylinder engines. The code for this is:
filter(mpg, cyl >= 6)
## # A tibble: 149 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
## 2 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
## 3 audi a4 3.1 2008 6 auto(av) f 18 27 p compact
## 4 audi a4 quattro 2.8 1999 6 auto(l5) 4 15 25 p compact
## 5 audi a4 quattro 2.8 1999 6 manual(m5) 4 17 25 p compact
## 6 audi a4 quattro 3.1 2008 6 auto(s6) 4 17 25 p compact
## 7 audi a4 quattro 3.1 2008 6 manual(m6) 4 15 25 p compact
## 8 audi a6 quattro 2.8 1999 6 auto(l5) 4 15 24 p midsize
## 9 audi a6 quattro 3.1 2008 6 auto(s6) 4 17 25 p midsize
## 10 audi a6 quattro 4.2 2008 8 auto(s6) 4 16 23 p midsize
## # i 139 more rows
You can see the full filtered data set by assigning it a name and then viewing it:
<- filter(mpg, cyl >= 6)
cyl6 View(cyl6)
In general, the syntax for the filter
function is:
filter(<DATA SET>, <CONDITIONS>)
The condition arguments in filter
are logical statements, meaning they are statements whose values are either TRUE
or FALSE
. Notice that the filter
function keeps the observations for which the conditions have a value of TRUE
and discards those with a value of FALSE
. (In other words, we specify the observations we want to keep, not the ones we want to get rid of.)
Logical statements are often expressed as equations or inequalities. For example, suppose we make the assignments x <- 7
and y <- 9
. The following table lists various ways to construct logical statements in terms of these two variables.
relation | statement | value |
---|---|---|
less than | x < y |
TRUE |
greater than | x > y |
FALSE |
less than or equal to | x+3 <= y |
FALSE |
greater than or equal to | x >= y-2 |
TRUE |
equal to | 3*x+5 == 4*y-10 |
TRUE |
Notice that the double equal sign ==
is used to check equality. You can also negate statements or form compound statements. For example:
relation | statement | value |
---|---|---|
not equal to | x != y |
TRUE |
and | (x < y) & (x+2 == y) |
TRUE |
(x < y) & (x != x) |
FALSE |
|
or | (x == 8) | (y == 9) |
TRUE |
(x >= y) | (y < 0) |
FALSE |
Notice that an “and” statement is only true when both of the conditions are true. An “or” statement is true as long as at least one of the conditions is true.
The values TRUE
and FALSE
can actually be handled arithmetically in R. The value TRUE
is given the numerical value 1, and FALSE
is given 0. For example:
TRUE + TRUE
## [1] 2
FALSE * 28
## [1] 0
This is useful when you want to count how many values in a column satisfy a certain condition. For example, suppose we want to know how many cars in mpg
have 4-wheel drive. In the following code chunk, mpg$drv
refers to the drv
column within the mpg
data set.
sum(mpg$drv == "4")
## [1] 103
The condition mpg$drv == "4"
assigns a value of TRUE
or FALSE
to all of the values in the drv
column of mpg
depending on whether the drv
value is “4” or not. (Since drv
is a non-numeric variable, we need quotes around its values, even though 4 itself is actually a number.) The sum
function then adds these TRUE
/FALSE
values, treating TRUE
as 1 and FALSE
as 0. Therefore, we just found that there are 103 cars with 4-wheel drive in mpg
.
We can also use the mean
function to calculate the percentage of values in a column that satisfy a condition. For example, the percentage of 4-wheel drive cars in mpg
is calculated by:
mean(mpg$drv == "4")
## [1] 0.4401709
The reason this works is that mean
finds the average of all of the 1’s and 0’s assigned to the values of the drv
column by the statement mpg$drv == "4"
. In doing so, it finds the sum of these 1’s and 0’s (which results in the number of 4-wheel drive cars) and then divides by the total number of 1’s and 0’s. Thus, we’re dividing the number of 4-wheel drive cars by the total number of cars, resulting in the percentage of 4-wheel drives.
It’s often the case that a data set contains missing values for various reasons. In R, these missing values are indicated with the letters NA
, which stand for “not available.” For example, examine the built-in msleep
data set, which contains data relating to the sleep habits of various mammals.
msleep
Missing data is a problem for data analysts, and we’ll eventually talk about different methods for dealing with it. One method, though, is to filter out the observations that are missing values. Suppose, for example, that we want to filter out the observations in msleep
whose conservation status is missing. Here’s the syntax:
filter(msleep, !is.na(conservation))
## # A tibble: 54 x 11
## name genus vore order conservation sleep_total sleep_rem sleep_cycle awake brainwt bodywt
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Cheetah Acin~ carni Carn~ lc 12.1 NA NA 11.9 NA 50
## 2 Mountain beaver Aplo~ herbi Rode~ nt 14.4 2.4 NA 9.6 NA 1.35
## 3 Greater short-tailed~ Blar~ omni Sori~ lc 14.9 2.3 0.133 9.1 0.00029 0.019
## 4 Cow Bos herbi Arti~ domesticated 4 0.7 0.667 20 0.423 600
## 5 Northern fur seal Call~ carni Carn~ vu 8.7 1.4 0.383 15.3 NA 20.5
## 6 Dog Canis carni Carn~ domesticated 10.1 2.9 0.333 13.9 0.07 14
## 7 Roe deer Capr~ herbi Arti~ lc 3 NA NA 21 0.0982 14.8
## 8 Goat Capri herbi Arti~ lc 5.3 0.6 NA 18.7 0.115 33.5
## 9 Guinea pig Cavis herbi Rode~ domesticated 9.4 0.8 0.217 14.6 0.0055 0.728
## 10 Grivet Cerc~ omni Prim~ lc 10 0.7 NA 14 NA 4.75
## # i 44 more rows
Let’s examine the condition !is.na(conservation)
more closely. First, notice that the way to check whether a value is missing is not to enter conservation == NA
. The reason is that NA
is an indication that a value is not available; it’s not a value itself. The code conservation == NA
would thus be the meaningless statement that says the value of conservation
is valueless. For this reason, the code is.na(conservation)
has to be used instead. Recalling that the !
preceding a condition negates the condition, we then see that !is.na(conservation)
is the condition that says the value of conservation
is not NA
. The above filter
call thus returns the observations which do not have an NA
in the conservation column.
Exercises
To complete the exercises below, download this R Notebook file and open it in RStudio using the Visual Editor. Then enter your answers in the space provided.
These exercises require the nycflights13 package you installed earlier. Be sure the library is loaded.
The nycflights13 package contains a data set called
flights
. Load this data set and read its documentation:?flights
. How many observations does it have? How many variables?Use
filter
to find all of the flights that- departed in February.
- were operated by United or American Airlines.
- departed in summer (June, July, and August).
- arrived more than two hours late, but did not leave late.
- were delayed by more than an hour, but made up over 30 minutes during the flight.
- departed between midnight and 6am.
How many flights were canceled? (Think about how a canceled flight might be detected from the data set.)
An airline’s on-time arrival rate is the percentage of its flights that arrived on time or early. What was the on-time arrival rate for Delta Airlines during 2013? What was it during the winter months (January, February, and December)? A flight that did not arrive, due to a cancellation, crash, emergency landing, etc, should not figure into the on-time arrival rate.
2.2 Sorting a Data Set
The function used to sort a data set is arrange
. For example, suppose we want to sort the observations in mpg
according to the highway gas mileage:
arrange(mpg, hwy)
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 dodge dakota pickup 4wd 4.7 2008 8 auto(l5) 4 9 12 e pickup
## 2 dodge durango 4wd 4.7 2008 8 auto(l5) 4 9 12 e suv
## 3 dodge ram 1500 pickup 4wd 4.7 2008 8 auto(l5) 4 9 12 e pickup
## 4 dodge ram 1500 pickup 4wd 4.7 2008 8 manual(m6) 4 9 12 e pickup
## 5 jeep grand cherokee 4wd 4.7 2008 8 auto(l5) 4 9 12 e suv
## 6 chevrolet k1500 tahoe 4wd 5.3 2008 8 auto(l4) 4 11 14 e suv
## 7 jeep grand cherokee 4wd 6.1 2008 8 auto(l5) 4 11 14 p suv
## 8 chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r 11 15 e suv
## 9 chevrolet k1500 tahoe 4wd 5.7 1999 8 auto(l4) 4 11 15 r suv
## 10 dodge dakota pickup 4wd 5.2 1999 8 auto(l4) 4 11 15 r pickup
## # i 224 more rows
You can see that by default, arrange
sorts the observations in ascending order. To sort in descending order, do the following:
arrange(mpg, desc(hwy))
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 volkswagen jetta 1.9 1999 4 manual(m5) f 33 44 d compact
## 2 volkswagen new beetle 1.9 1999 4 manual(m5) f 35 44 d subcompact
## 3 volkswagen new beetle 1.9 1999 4 auto(l4) f 29 41 d subcompact
## 4 toyota corolla 1.8 2008 4 manual(m5) f 28 37 r compact
## 5 honda civic 1.8 2008 4 auto(l5) f 25 36 r subcompact
## 6 honda civic 1.8 2008 4 auto(l5) f 24 36 c subcompact
## 7 toyota corolla 1.8 1999 4 manual(m5) f 26 35 r compact
## 8 toyota corolla 1.8 2008 4 auto(l4) f 26 35 r compact
## 9 honda civic 1.8 2008 4 manual(m5) f 26 34 r subcompact
## 10 honda civic 1.6 1999 4 manual(m5) f 28 33 r subcompact
## # i 224 more rows
You can also include a “tie-breaker” variable in arrange
. Notice that the Volkswagen Jetta and the Volkswagen New Beetle manual tie for the best highway mileage. Let’s use city mileage (in descending order) as the tie-breaker. You just have to add the tie-breaker variable as a third argument inside arrange
:
arrange(mpg, desc(hwy), desc(cty))
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 volkswagen new beetle 1.9 1999 4 manual(m5) f 35 44 d subcompact
## 2 volkswagen jetta 1.9 1999 4 manual(m5) f 33 44 d compact
## 3 volkswagen new beetle 1.9 1999 4 auto(l4) f 29 41 d subcompact
## 4 toyota corolla 1.8 2008 4 manual(m5) f 28 37 r compact
## 5 honda civic 1.8 2008 4 auto(l5) f 25 36 r subcompact
## 6 honda civic 1.8 2008 4 auto(l5) f 24 36 c subcompact
## 7 toyota corolla 1.8 1999 4 manual(m5) f 26 35 r compact
## 8 toyota corolla 1.8 2008 4 auto(l4) f 26 35 r compact
## 9 honda civic 1.8 2008 4 manual(m5) f 26 34 r subcompact
## 10 honda civic 1.6 1999 4 manual(m5) f 28 33 r subcompact
## # i 224 more rows
We can also sort by a function of a variable or variables. For example, suppose we want to sort the mpg
data according to the difference between the highway and city gas mileages (in descending order). We just have to use hwy - cty
as the sort variable:
arrange(mpg, desc(hwy - cty))
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 honda civic 1.8 2008 4 auto(l5) f 24 36 c subcompact
## 2 volkswagen new beetle 1.9 1999 4 auto(l4) f 29 41 d subcompact
## 3 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
## 4 audi a4 2 2008 4 manual(m6) f 20 31 p compact
## 5 chevrolet malibu 3.5 2008 6 auto(l4) f 18 29 r midsize
## 6 honda civic 1.8 2008 4 auto(l5) f 25 36 r subcompact
## 7 volkswagen jetta 1.9 1999 4 manual(m5) f 33 44 d compact
## 8 volkswagen passat 1.8 1999 4 auto(l5) f 18 29 p midsize
## 9 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
## 10 audi a4 quattro 2.8 1999 6 auto(l5) 4 15 25 p compact
## # i 224 more rows
The 2008 Honda Civic and 1999 Volkswagen New Beetle thus have the largest difference between city and highway gas mileage.
Notice in the previous sort that there is no column that actually lists the highway/city mileage differences. A better practice would be to create a new variable that records these values and then to sort by that new variable. We’ll see how to create new variables in Section 2.5.
Exercises
To complete the exercises below, download this R Notebook file and open it in RStudio using the Visual Editor. Then enter your answers in the space provided.
These exercises require the flights
data set from the nycflights13 library.
Where does
arrange
sort theNA
values in a column? (Experiment with a data set that has missing values.) How could you forcearrange
to sort all of theNA
values to the top of the list? (Try usingis.na
.)What was the longest delay of any flight?
What was the earliest departure time of any flight?
What flight averaged the fastest speed while in the air? (Average speed is the total distance traveled divided by the total time spent in the air.) What flight averaged the slowest speed?
What flight traveled the farthest distance? Which one traveled the shortest distance?
2.3 The Pipe
We will often need to use two or more transformation functions in succession. For example, suppose we want to filter out the cars in mpg
that get less than 20 miles per gallon on the highway and then sort the remaining cars according to engine size (displ
) in descending order. There are a couple of naive ways to do this. One is:
arrange(filter(mpg, hwy >= 20), desc(displ))
## # A tibble: 156 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 chevrolet corvette 7 2008 8 manual(m6) r 15 24 p 2seater
## 2 chevrolet corvette 6.2 2008 8 manual(m6) r 16 26 p 2seater
## 3 chevrolet corvette 6.2 2008 8 auto(s6) r 15 25 p 2seater
## 4 chevrolet corvette 5.7 1999 8 manual(m6) r 16 26 p 2seater
## 5 chevrolet corvette 5.7 1999 8 auto(l4) r 15 23 p 2seater
## 6 ford mustang 5.4 2008 8 manual(m6) r 14 20 p subcompact
## 7 chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r 14 20 r suv
## 8 chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r 14 20 r suv
## 9 pontiac grand prix 5.3 2008 8 auto(s4) f 16 25 p midsize
## 10 ford mustang 4.6 1999 8 auto(l4) r 15 21 r subcompact
## # i 146 more rows
Nesting functions in this way is often not a good practice because it becomes hard to keep track of parentheses, especially when you nest several functions. A slightly better (but still not ideal) method is to perform one transformation at a time, storing the intermediate results along the way:
<- filter(mpg, hwy >= 20)
filtered_mpg arrange(filtered_mpg, desc(displ))
## # A tibble: 156 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 chevrolet corvette 7 2008 8 manual(m6) r 15 24 p 2seater
## 2 chevrolet corvette 6.2 2008 8 manual(m6) r 16 26 p 2seater
## 3 chevrolet corvette 6.2 2008 8 auto(s6) r 15 25 p 2seater
## 4 chevrolet corvette 5.7 1999 8 manual(m6) r 16 26 p 2seater
## 5 chevrolet corvette 5.7 1999 8 auto(l4) r 15 23 p 2seater
## 6 ford mustang 5.4 2008 8 manual(m6) r 14 20 p subcompact
## 7 chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r 14 20 r suv
## 8 chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r 14 20 r suv
## 9 pontiac grand prix 5.3 2008 8 auto(s4) f 16 25 p midsize
## 10 ford mustang 4.6 1999 8 auto(l4) r 15 21 r subcompact
## # i 146 more rows
This approach avoids nesting, but it introduces a redundancy: the name filtered_mpg
is introduced and then immediately used as an argument inside arrange
. It would be much better to combine these steps. Luckily, tidyverse has a way to do this using a construct called the pipe.
The pipe provides a way to bypass the need to name an intermediate data set, only to immediately feed it into the next transformation function in line. Here’s how the above double transformation is accomplished using the pipe:
%>%
mpg filter(hwy >= 20) %>%
arrange(desc(displ))
## # A tibble: 156 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 chevrolet corvette 7 2008 8 manual(m6) r 15 24 p 2seater
## 2 chevrolet corvette 6.2 2008 8 manual(m6) r 16 26 p 2seater
## 3 chevrolet corvette 6.2 2008 8 auto(s6) r 15 25 p 2seater
## 4 chevrolet corvette 5.7 1999 8 manual(m6) r 16 26 p 2seater
## 5 chevrolet corvette 5.7 1999 8 auto(l4) r 15 23 p 2seater
## 6 ford mustang 5.4 2008 8 manual(m6) r 14 20 p subcompact
## 7 chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r 14 20 r suv
## 8 chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r 14 20 r suv
## 9 pontiac grand prix 5.3 2008 8 auto(s4) f 16 25 p midsize
## 10 ford mustang 4.6 1999 8 auto(l4) r 15 21 r subcompact
## # i 146 more rows
We should read this as, “Start with mpg
, feed it into filter
, then feed the output of filter
into arrange
.” We avoid nesting and introducing intermediate data set names. The operator %>%
takes the data set that precedes it and feeds it into the function that follows it.
The pipe leads to much clearer and readable code, and we’ll use it constantly.
Exercises
To complete the exercises below, download this R Notebook file and open it in RStudio using the Visual Editor. Then enter your answers in the space provided.
Use the pipe to answer the following questions, which require the nycflights13 package.
What is the farthest airport flown to by American Airlines?
What is the farthest airport flown to out of LaGuardia (LGA) by American Airlines?
What was the most delayed flight during the winter (January, February, and December)? What was the most delayed flight during the summer (June, July, and August)?
2.4 Selecting Columns
A very common procedure when analyzing a data set is to get rid of columns that are irrelevant to your analysis. This is accomplished using the select
function. For example, suppose we want to pare down the mpg
data set so that the only variables displayed are manufacturer
, model
, displ
, and hwy
. The way to do this is to use the select
function:
select(mpg, manufacturer, model, displ, hwy)
## # A tibble: 234 x 4
## manufacturer model displ hwy
## <chr> <chr> <dbl> <int>
## 1 audi a4 1.8 29
## 2 audi a4 1.8 29
## 3 audi a4 2 31
## 4 audi a4 2 30
## 5 audi a4 2.8 26
## 6 audi a4 2.8 26
## 7 audi a4 3.1 27
## 8 audi a4 quattro 1.8 26
## 9 audi a4 quattro 1.8 25
## 10 audi a4 quattro 2 28
## # i 224 more rows
There are a lot of built-in “helper” functions available within select
. You can read the documentation (?select
) to see them, but there are a few particularly useful ones. For one, it’s often easier to specify the columns you don’t want. For example, suppose we want to keep every column of mpg
except trans
and drv
:
select(mpg, -trans, -drv)
## # A tibble: 234 x 9
## manufacturer model displ year cyl cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 18 29 p compact
## 2 audi a4 1.8 1999 4 21 29 p compact
## 3 audi a4 2 2008 4 20 31 p compact
## 4 audi a4 2 2008 4 21 30 p compact
## 5 audi a4 2.8 1999 6 16 26 p compact
## 6 audi a4 2.8 1999 6 18 26 p compact
## 7 audi a4 3.1 2008 6 18 27 p compact
## 8 audi a4 quattro 1.8 1999 4 18 26 p compact
## 9 audi a4 quattro 1.8 1999 4 16 25 p compact
## 10 audi a4 quattro 2 2008 4 20 28 p compact
## # i 224 more rows
It’s also helpful to specify a set of columns to keep that appear consecutively. For example, suppose we want to keep manufacturer
, model
, displ
, year
, and cyl
. Since they appear in consecutive columns, there’s a shortcut we can use to select them:
select(mpg, manufacturer:cyl)
## # A tibble: 234 x 5
## manufacturer model displ year cyl
## <chr> <chr> <dbl> <int> <int>
## 1 audi a4 1.8 1999 4
## 2 audi a4 1.8 1999 4
## 3 audi a4 2 2008 4
## 4 audi a4 2 2008 4
## 5 audi a4 2.8 1999 6
## 6 audi a4 2.8 1999 6
## 7 audi a4 3.1 2008 6
## 8 audi a4 quattro 1.8 1999 4
## 9 audi a4 quattro 1.8 1999 4
## 10 audi a4 quattro 2 2008 4
## # i 224 more rows
Of course, we can use select
together with the other transformation functions via the pipe. Let’s rank the Chevrolets in mpg
in descending order of city gas mileage and show only the car model, engine size, year, number of cylinders, transmission type, and city mileage:
%>%
mpg filter(manufacturer == "chevrolet") %>%
arrange(desc(cty)) %>%
select(model:trans, cty)
## # A tibble: 19 x 6
## model displ year cyl trans cty
## <chr> <dbl> <int> <int> <chr> <int>
## 1 malibu 2.4 2008 4 auto(l4) 22
## 2 malibu 2.4 1999 4 auto(l4) 19
## 3 malibu 3.1 1999 6 auto(l4) 18
## 4 malibu 3.5 2008 6 auto(l4) 18
## 5 malibu 3.6 2008 6 auto(s6) 17
## 6 corvette 5.7 1999 8 manual(m6) 16
## 7 corvette 6.2 2008 8 manual(m6) 16
## 8 corvette 5.7 1999 8 auto(l4) 15
## 9 corvette 6.2 2008 8 auto(s6) 15
## 10 corvette 7 2008 8 manual(m6) 15
## 11 c1500 suburban 2wd 5.3 2008 8 auto(l4) 14
## 12 c1500 suburban 2wd 5.3 2008 8 auto(l4) 14
## 13 k1500 tahoe 4wd 5.3 2008 8 auto(l4) 14
## 14 k1500 tahoe 4wd 6.5 1999 8 auto(l4) 14
## 15 c1500 suburban 2wd 5.7 1999 8 auto(l4) 13
## 16 c1500 suburban 2wd 6 2008 8 auto(l4) 12
## 17 c1500 suburban 2wd 5.3 2008 8 auto(l4) 11
## 18 k1500 tahoe 4wd 5.3 2008 8 auto(l4) 11
## 19 k1500 tahoe 4wd 5.7 1999 8 auto(l4) 11
Exercises
To complete the exercises below, download this R Notebook file and open it in RStudio using the Visual Editor. Then enter your answers in the space provided.
Using the
msleep
data set, create a data set consisting of only the carnivores sorted by body weight in descending order. Display only the name of the animal and its body weight. Do all of this using only one pipe statement.Predict what will happen when you run each of the following. Then test your predictions.
select(mpg, -(cyl:fl))
select(msleep, starts_with("sleep"))
select(msleep, ends_with("wt"))
select(msleep, starts_with("sleep") | ends_with("wt"))
select(msleep, starts_with("sleep") & ends_with("wt"))
2.5 Creating New Variables from Old
This section requires the nycflights13 library.
It’s often necessary to use the variables in a data set to create a new variable. The transformation function used to do this is mutate
.
For example, we might want to add a column to flights
that contains each flight’s average air speed. Recalling that average speed is total distance divided by total time, we can calculate a flight’s air speed by calculating distance / air_time
. Well, almost. The units of distance
are miles, and the units of air_time
are minutes, so to calculate air speed in the more familiar units of miles per hour, we’d actually want to compute distance / air_time * 60
.
To make this calculation for each row of flights
and add it as a new column (let’s call this new column air_speed
), we apply the mutate
function as follows:
mutate(flights, air_speed = distance / air_time * 60)
When mutating a new variable onto a data set, it’s a better practice to assign the mutated data set a new name. You can do so by inserting a name assignment <NEW NAME> <-
at the beginning of the code above:
<- mutate(flights, air_speed = distance / air_time * 60) flights_w_speed
You can now refer to flights_w_speed
for subsequent analysis.
The creation of a new variable inside mutate
is always expressed by an equation of the form:
<NEW VARIABLE NAME> = <FORMULA INVOLVING OLD VARIABLES>
The formula is often constructed using basic arithmetic, as with the air speed calculation above. However, it’s worth mentioning a few other useful operations to use within formulas.
Run the following lines of code, and try to guess what the %/%
and %%
operators do:
27 %/% 4
27 %% 4
You may have guessed that %/%
returns the whole number quotient obtained when dividing one number by another, while %%
returns the remainder. We can see right away how these operations might be applied by doing the following: add a column to flights
that computes the number of minutes since midnight until each flight departs and another column that computes the number of minutes since midnight until each flight arrives.
For example, for the very first flight in the data set, the value of dep_time
is 517, and the value of arr_time
is 830. These numbers are really just abbreviations for 5:17am and 8:30am and are not numbers that could be directly used in calculations. Converting these times to minutes since midnight would provide an actual numerical measurement (not a time) that indicates when a flight departed or arrived.
For the 5:17 departure time, we would convert to minutes since midnight by calculating 5*60 + 17
, which is 317. But how would we tell R to do this? We’d have to have a way to separate the first one or two digits from the last two digits. This is where %/%
and %%
come in. Notice that 517 %/% 100 = 5
and 517 %% 100 = 17
. Thus, the conversion can be accomplished by the formula (517 %/% 100)*60 + (517 %% 100)
.
We can now add our minutes since midnight variables as follows. (Notice that you can create more than one new variable within a single mutate
call.) We’ll also select the relevant columns so they’re visible in the output. We’ll use the pipe to mutate
then select
:
<- flights %>%
flights_w_mins_midnight mutate(dep_mins_midnight = (dep_time %/% 100)*60 + (dep_time %% 100),
arr_mins_midnight = (arr_time %/% 100)*60 + (arr_time %% 100)) %>%
select(dep_time, dep_mins_midnight, arr_time, arr_mins_midnight)
flights_w_mins_midnight
## # A tibble: 336,776 x 4
## dep_time dep_mins_midnight arr_time arr_mins_midnight
## <int> <dbl> <int> <dbl>
## 1 517 317 830 510
## 2 533 333 850 530
## 3 542 342 923 563
## 4 544 344 1004 604
## 5 554 354 812 492
## 6 554 354 740 460
## 7 555 355 913 553
## 8 557 357 709 429
## 9 557 357 838 518
## 10 558 358 753 473
## # i 336,766 more rows
Another very useful operation to use within a mutate
formula is ifelse
. The ifelse
function returns a value that depends on the truth or falsehood of some condition. The basic syntax is
ifelse(<CONDITION>,
<VALUE IF CONDITION IS TRUE>,
<VALUE IF CONDITION IS FALSE)
For example, the ifelse
below says to check whether x < y
. If it is, then return the value of x^2
. If it’s not, then return the string “hello.”
<- 4
x <- 2
y
ifelse(x < y,
^2,
x"hello")
## [1] "hello"
ifelse
statements can be used within mutate
to create new variables in a way that depends on conditions on the old variables. It’s an especially useful way to categorize continuous variables, i.e, to assign categorical values to continuous ones. For example, maybe we want to add a column to flights
that states whether a flight arrived late or not late. (We’ll just select the relevant columns to make our new column visible.)
<- flights %>%
flights_w_arr_status mutate(arr_status = ifelse(arr_delay > 0,
"late",
" not late")) %>%
select(month, day, arr_delay, origin, dest, arr_status)
flights_w_arr_status
## # A tibble: 336,776 x 6
## month day arr_delay origin dest arr_status
## <int> <int> <dbl> <chr> <chr> <chr>
## 1 1 1 11 EWR IAH "late"
## 2 1 1 20 LGA IAH "late"
## 3 1 1 33 JFK MIA "late"
## 4 1 1 -18 JFK BQN " not late"
## 5 1 1 -25 LGA ATL " not late"
## 6 1 1 12 EWR ORD "late"
## 7 1 1 19 EWR FLL "late"
## 8 1 1 -14 LGA IAD " not late"
## 9 1 1 -8 JFK MCO " not late"
## 10 1 1 8 LGA ORD "late"
## # i 336,766 more rows
ifelse
statements work well when creating binary (two-value) categorical variables, such as a flight’s arrival status, because they are set up to produce two outputs. However, to create categorical variables with more than two possible values, ifelse
is not ideal.
There is a much better option for categorizations with more than two categories: case_when
. Suppose we want to label each flight as one of “canceled,” “not late,” or “late,” depending on the value of arr_delay
. Here’s how it works:
<- flights %>%
flights_w_arr_status2 mutate(arr_status = case_when(is.na(arr_delay) ~ "canceled",
<= 0 ~ "not late",
arr_delay >0 ~ "late")) %>%
arr_delay select(month, day, arr_delay, origin, dest, arr_status)
flights_w_arr_status2
## # A tibble: 336,776 x 6
## month day arr_delay origin dest arr_status
## <int> <int> <dbl> <chr> <chr> <chr>
## 1 1 1 11 EWR IAH late
## 2 1 1 20 LGA IAH late
## 3 1 1 33 JFK MIA late
## 4 1 1 -18 JFK BQN not late
## 5 1 1 -25 LGA ATL not late
## 6 1 1 12 EWR ORD late
## 7 1 1 19 EWR FLL late
## 8 1 1 -14 LGA IAD not late
## 9 1 1 -8 JFK MCO not late
## 10 1 1 8 LGA ORD late
## # i 336,766 more rows
In the case_when
statement above, the third condition arr_delay > 0
is the only option left after the first two, so it seems redundant to have to actually check it. If we don’t want to specify the last remaining condition explicitly but instead just let it catch all the remaining cases, we can just enter TRUE
for this condition. The following case_when
is thus equivalent to the one above:
case_when(is.na(arr_delay) ~ "canceled",
<= 0 ~ "not late",
arr_delay TRUE ~ "late")
Exercises
To complete the exercises below, download this R Notebook file and open it in RStudio using the Visual Editor. Then enter your answers in the space provided.
These exercises require the nycflights13 and Lahman libraries.
Find the fastest and slowest flights (in terms of air speed). State the actual average air speeds in miles per hour.
Use the variables
dep_mins_midnight
andarr_mins_midnight
to calculate the number of minutes fromdep_time
toarr_time
for each flight. Compare your answers to theair_time
values. Try to guess why they don’t match. (Notice that for some flights, the value ofair_time
is way more than the number of minutes betweendep_time
andarr_time
; for example, look at the United flight from JFK to LAX on January 1 that left at 5:58am. How would you explain this discrepancy?)Add a column to
flights
that indicates whether a flight was canceled. The new column’s values should be eithercanceled
ornot canceled
. Then find the first date on which a flight was canceled. (Have R do this for you; don’t just scroll through the data set.)Some of the flights in the data set
flights_w_arr_status
we created above will have anNA
in thearr_status
column. Why? (Hint: What happens when you ask R whetherNA > 0
?)Count the number of
NA
s in thearr_status
column offlights_w_arr_status
. (Have R do this; don’t manually count them yourself.)Add a column to
flights
that gives the season for each flight. For purposes of this exercise, let’s say “winter” is January, February, and December; “spring” is March, April, and May; “summer” is June, July, and August; and “fall” is September, October, and November.Categorize the observations in
diamonds
as “expensive,” “very expensive,” “insanely expensive,” or “priceless.” Decide for yourself how to define these labels. Then create a visualization that shows the distribution of these labels.View the
Batting
data set from the Lahman library and read its documentation. How many observations and variables does it have?Batting
does not have a column for a player’s batting average, which measures the percentage of a player’s at bats that result in a hit. The formula isH / AB
. Create a batting average column inBatting
and find the player(s) with the all-time highest single season batting average. What’s strange about your result? (You can just refer to a player by theirplayerID
value if you don’t know their real name.)To fix the problem from the previous exercise, a minimum number of at bats are required for a batting average to be considered “official.” Re-do Exercise 6, but only rank the players who had at least 350 at bats.
The modern era of baseball is generally considered to have begun in 1947 (after World War II ended). Who’s had the best single season batting average of the modern era (assuming at least 350 at bats)? Answer this question entirely by using transformation functions rather than by just scrolling through the data.
When was the last year a player had a batting average of at least 0.400 with a minimum of 350 at bats? (Again, answer this question using only transformation functions.)
2.6 Grouped Summaries
Suppose we want to create a data table that ranks the manufacturers in mpg
according to the average highway gas mileage for each manufacturer. This would entail grouping the cars by the manufacturer
values and then computing the average value of hwy
for each group. This is an example of a grouped summary and is done with a combination of the group_by
and summarize
functions. Here’s how it works:
%>%
mpg group_by(manufacturer) %>%
summarize(average_highway_mileage = mean(hwy)) %>%
arrange(desc(average_highway_mileage))
## # A tibble: 15 x 2
## manufacturer average_highway_mileage
## <chr> <dbl>
## 1 honda 32.6
## 2 volkswagen 29.2
## 3 hyundai 26.9
## 4 audi 26.4
## 5 pontiac 26.4
## 6 subaru 25.6
## 7 toyota 24.9
## 8 nissan 24.6
## 9 chevrolet 21.9
## 10 ford 19.4
## 11 mercury 18
## 12 dodge 17.9
## 13 jeep 17.6
## 14 lincoln 17
## 15 land rover 16.5
A couple of notes:
group_by
andsummarize
are always used together.summarize
works similarly tomutate
in that it creates a new column with a name that the user provides (average_highway_mileage
above).mean
is a built-in function that computes the averagehwy
value in each group.The displayed data table will only show the grouped column and any computed summary columns.
You can group by more than one variable as well. Suppose we want to rank highway gas mileage by manufacturer and model:
%>%
mpg group_by(manufacturer, model) %>%
summarize(average_highway_mileage = mean(hwy)) %>%
arrange(desc(average_highway_mileage))
## # A tibble: 38 x 3
## # Groups: manufacturer [15]
## manufacturer model average_highway_mileage
## <chr> <chr> <dbl>
## 1 toyota corolla 34
## 2 volkswagen new beetle 32.8
## 3 honda civic 32.6
## 4 volkswagen jetta 29.1
## 5 nissan altima 28.7
## 6 audi a4 28.3
## 7 toyota camry 28.3
## 8 toyota camry solara 28.1
## 9 hyundai sonata 27.7
## 10 chevrolet malibu 27.6
## # i 28 more rows
mean
is just one example of a summary statistic that can be used inside summarize
. Here are a few more common ones:
statistic | syntax | use |
---|---|---|
mean | mean(<VARIABLE>) |
computes the average value of <VARIABLE> within each group |
sum | sum(<VARIABLE>) |
adds the values within each group |
count | n() |
counts the number of observations within each group |
distinct count | n_distinct(<VARIABLE>) |
counts the number of different values of <VARIABLE> within each group |
maximum | max(<VARIABLE>) |
returns the largest value of <VARIABLE> within each group |
minimum | min(<VARIABLE>) |
returns the smallest value of <VARIABLE> within each group |
standard deviation | sd(<VARIABLE>) |
computes the standard deviation of <VARIABLE> within each group |
median | median(<VARIABLE>) |
computes the median value of <VARIABLE> within each group |
There are many others as well. It’s certainly possible to compute more than one summary statistic within a single summarize
call. In fact, whenever you’re computing a mean value, it’s a good idea to also compute the count. This is because comparing averages is really only appropriate when the groups have roughly the same number of observations. If we do this for the average highway mileage example above, we get:
%>%
mpg group_by(manufacturer) %>%
summarize(average_highway_mileage = mean(hwy),
count = n()) %>%
arrange(desc(average_highway_mileage))
## # A tibble: 15 x 3
## manufacturer average_highway_mileage count
## <chr> <dbl> <int>
## 1 honda 32.6 9
## 2 volkswagen 29.2 27
## 3 hyundai 26.9 14
## 4 audi 26.4 18
## 5 pontiac 26.4 5
## 6 subaru 25.6 14
## 7 toyota 24.9 34
## 8 nissan 24.6 13
## 9 chevrolet 21.9 19
## 10 ford 19.4 25
## 11 mercury 18 4
## 12 dodge 17.9 37
## 13 jeep 17.6 8
## 14 lincoln 17 3
## 15 land rover 16.5 4
We would therefore want to refrain from declaring that Pontiacs have better fuel efficiencies than Toyotas since Pontiacs are relatively underrepresented in the data set.
Referring to the flights
data set, suppose we want to know which destination airport has the worst arrival delays. We could group flights
by the dest
variable and then compute the average value of arr_delay
for each group. Following the above advice, we’ll also compute the count for each group.
%>%
flights group_by(dest) %>%
summarize(avg_arr_delay = mean(arr_delay),
count = n()) %>%
arrange(desc(avg_arr_delay))
## # A tibble: 105 x 3
## dest avg_arr_delay count
## <chr> <dbl> <int>
## 1 SBN 6.5 10
## 2 EYW 6.35 17
## 3 ABQ 4.38 254
## 4 ANC -2.5 8
## 5 LEX -22 1
## 6 ACK NA 265
## 7 ALB NA 439
## 8 ATL NA 17215
## 9 AUS NA 2439
## 10 AVL NA 275
## # i 95 more rows
What’s the deal with all the NA
s in the avg_arr_delay
column? Remember that when a flight is canceled, there is no arrival and hence no arrival delay. Thus, canceled flights have NA
in the arr_delay
column. So in any dest
group which contains at least one canceled flight, the mean
function will be attempting to calculate an average value when not all of the values are available. This means that the average itself will be reported as NA
.
Luckily, there’s an easy, built-in way to avoid this. We can remove the NA
s from the mean computation by inserting the argument na.rm = TRUE
into mean
:
%>%
flights group_by(dest) %>%
summarize(avg_arr_delay = mean(arr_delay, na.rm = TRUE),
count = n()) %>%
arrange(desc(avg_arr_delay))
## # A tibble: 105 x 3
## dest avg_arr_delay count
## <chr> <dbl> <int>
## 1 CAE 41.8 116
## 2 TUL 33.7 315
## 3 OKC 30.6 346
## 4 JAC 28.1 25
## 5 TYS 24.1 631
## 6 MSN 20.2 572
## 7 RIC 20.1 2454
## 8 CAK 19.7 864
## 9 DSM 19.0 569
## 10 GRR 18.2 765
## # i 95 more rows
It looks like CAE (Columbia Metropolitan Airport) has the highest average arrival delay, although there were only 116 flights there. On the other hand, RIC (Richmond International Airport) maintained a high average arrival delay over almost 2500 flights. RIC might get my vote for the worst airport to fly into from New York.
If you look at the end of the list, you see that, on average, flights to Blue Grass Airport in Lexington, KY, arrived an average of 22 minutes early! This seems amazing until you notice that there was only one flight there.
## # A tibble: 6 x 3
## dest avg_arr_delay count
## <chr> <dbl> <int>
## 1 ANC -2.5 8
## 2 STT -3.84 522
## 3 SNA -7.87 825
## 4 PSP -12.7 19
## 5 LEX -22 1
## 6 LGA NaN 1
Last on the list is LGA, with an average arrival delay value of NaN
. This stands for “not a number.” If we extract out this observation, we might be able to see why:
%>%
flights filter(dest == "LGA") %>%
select(month, day, dest, dep_delay, arr_delay)
## # A tibble: 1 x 5
## month day dest dep_delay arr_delay
## <int> <int> <chr> <dbl> <dbl>
## 1 7 27 LGA NA NA
It looks like there was only one flight to LGA, and it was canceled, giving us an NA
in the arr_delay
column. When we then compute the average arrival delay and use na.rm = TRUE
to remove the NA
, we’re left with no numbers to take the average of! This would lead to a division by 0, which is not arithmetically possible. Hence the NaN
. (Notice that removing the NA
s from the mean
computation does not affect the value of count
, though.)
Exercises
To complete the exercises below, download this R Notebook file and open it in RStudio using the Visual Editor. Then enter your answers in the space provided.
These exercises will require both the nycflights13 and Lahman libraries.
Create a data set from
flights
that shows the average departure delay on each day of the year.Which carrier from
flights
had the best on-time arrival rate in 2013? Which had the worst? (See Exercise 4 from Section 2.1). Be sure to include counts so we know whether the rates are comparable.For each destination in
flights
, can you find any flights that were suspiciously fast, possibly indicating a data entry error? (Hint: Think about whether any of the summary statistics mentioned in this section would help you spot flights that were suspiciously fast.)Which carrier in
flights
flew to the most different destinations?Which specific plane (identified by
tailnum
) had the most variation in the distances it flew? (Be sure to include counts in your analysis.)Which day of the year had the most cancellations? Which day had the least?
Using the
Batting
data set from Lahman, determine the maximum number of home runs hit by a player every year since 1871.Create a visualization of your findings from the previous exercise.
Create a data set that shows a ranked list of the all-time leaders in career batting average. (A player’s career batting average is his total number of hits divided by his total number of at bats.) Note that to be eligible for this list, a player must have at least 3000 total at bats. Be sure to take this into account when you create your list.
2.7 Project
Project Description: The purpose of this project is to explore and explain observations in the flights
data set by transforming and visualizing the data.
Instructions:
Answer the following questions using the appropriate transformation functions and visualizations.
Create a data set that lists the days during 2013 on which the percentage of flights that were either canceled or delayed on departure by more than one hour exceeded 35%.
Do a Google search to see what the weather was like in New York on each of the days from the previous problem in 2013. Do your findings explain the large number of delays/cancellations?
Create a data set that shows, for each scheduled departure hour during the day, the percentage of flights that were canceled or delayed by more than one hour.
Obtain a plot that shows the relationship between the hour and the percentage of flights canceled or delayed by more than an hour.
Explain the relationship in the previous problem. In particular:
- Why does the plot have the given shape?
- Are there outliers? If so, why might they be there?
- When is the riskiest time of day to fly if you want to avoid cancellations and long delays, and why might that be?
- Why does the plot have the given shape?
Above, you found dates on which a lot of cancellations and long delays occurred. However, several flights were able to leave on time or early on those dates. Create a data set that contains the flights that left on time or early on those dates.
It’s a reasonable guess that the flights that were able to leave early despite bad weather were probably the ones that left in the morning before the weather hit. For each date in the data set from the previous problem, find the average scheduled hour of departure. Do your findings mostly confirm this guess?
Did any of the problematic dates have an average departure hour in the afternoon? If so, why might that have been?
Guidelines:
Heading: Include a descriptive title for your report (do not just call it “Project 2”), your name, and the date in the heading.
Introduction:
- Begin with a description of the problem you’re solving or question you’re answering in your report. This should be concise but detailed enough to set the stage for the work you’ll be doing.
- Describe your data set. Say what it contains, how many observations and variables it contains, and where it comes from.
- Create a data dictionary, at least for the variables which will be referenced in your report.
- Use the
datatable
function to display your data set. (If the data set is too big to display when you try to knit your .Rmd file, you don’t have to display it.) - Include a code chunk in which you load the libraries you’ll need, and briefly say what each one is for.
Body:
- Follow the introduction with the body of your report which will contain your work, broken into relevant section headings.
- The body should include text that provides a running narrative that guides the reader through your work. This should be thorough and complete, but you should avoid large blocks of text.
- Do not use the problem-numbering in the Instructions above in the body. That’s just for your benefit as you prepare your work. Your narrative structure – not an enumerated list – should lead the reader from one question to the next.
- The body of the report should show all of your R code and its output, but it should not show any warning or error messages. (This is for my benefit. A professional report might not include the code, depending on the audience.)
- All graphics should look nice, be easy to read, and be fully labeled.
- You should include insightful statements about the meaning you’re extracting from your graphics rather than just superficial descriptions of your visualizations.
Conclusion:
- End with an overall concluding statement which summarizes your findings.
- Provide a clear answer to the question you set out to answer.
- If you were not able to answer the question fully, possibly because of insufficient data, explain why.
Report Preparation: The problems for the project should be worked out in a scratch work .R file which you will not turn in. Once you’re done, write up your results in an R Notebook report, making sure to follow the guidelines above. Then publish your report online using RPubs and send me the URL to submit your work.
Grading Rubric:
Transformations: Do you use the right transformation functions to answer the questions? (30 points)
Insights: Are insights fully explained and well-written? Are the insights derived from your results? (30 points)
Narrative: Is it clear what you’re trying to do in this project? Do you maintain a readable narrative throughout that guides the reader through your analysis? (20 points)
Professionalism: Does your report look nice? Do you provide insights based on your analysis? Is your code clear and readable? Did you follow the guidelines? (15 points)
Depending on the version of R you’re running, you might first also have to install the Rtools package:
install.packages("Rtools")
↩︎