Chapter 2 Data Transformation
In this chapter, we will use a set of functions from the dplyr package (which is part of tidyverse) to transform a data set in various ways. In particular, we will see how to filter out observations, sort the rows of a data set, remove unnecessary variables, create new variables from old ones, and group observations together. We’ll also introduce a tidyverse construct called “the pipe” which provides a streamlined way to perform several transformations in succession.
We’ll need to install two new packages:
install.packages("nycflights13")
install.packages("Lahman")
The first package contains a data set with records from all of the flights that originated from a major New York City airport in 2013. The second contains Major League Baseball data sets dating back to 1871. Be sure to load these libraries, along with tidyverse.
library(tidyverse)
library(nycflights13)
library(Lahman)
2.1 Filtering a Data Set
Often, we will want to filter a data set so that we only see the observations that satisfy a certain condition. For example, suppose we want to see only the cars in the mpg
data set that have at least 6-cylinder engines. The code for this is:
filter(mpg, cyl >= 6)
## # A tibble: 149 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
## 2 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
## 3 audi a4 3.1 2008 6 auto(av) f 18 27 p compact
## 4 audi a4 quattro 2.8 1999 6 auto(l5) 4 15 25 p compact
## 5 audi a4 quattro 2.8 1999 6 manual(m5) 4 17 25 p compact
## 6 audi a4 quattro 3.1 2008 6 auto(s6) 4 17 25 p compact
## 7 audi a4 quattro 3.1 2008 6 manual(m6) 4 15 25 p compact
## 8 audi a6 quattro 2.8 1999 6 auto(l5) 4 15 24 p midsize
## 9 audi a6 quattro 3.1 2008 6 auto(s6) 4 17 25 p midsize
## 10 audi a6 quattro 4.2 2008 8 auto(s6) 4 16 23 p midsize
## # ... with 139 more rows
## # i Use `print(n = ...)` to see more rows
You can see the full filtered data set by assigning it a name and then viewing it:
<- filter(mpg, cyl >= 6)
cyl6 View(cyl)
In general, the syntax for the filter
function is:
filter(<DATA SET>, <CONDITIONS>)
The condition arguments in filter
are statements whose values are either TRUE
or FALSE
. These statements are often constructed as equalities and inequalities. For example, suppose we make the assignments x <- 7
and y <- 9
. Then we would have:
relation | statement | value |
---|---|---|
less than | x < y |
TRUE |
greater than | x > y |
FALSE |
less than or equal to | x+3 <= y |
FALSE |
greater than or equal to | x >= y-2 |
TRUE |
equal to | 3*x+5 == 4*y-10 |
TRUE |
Notice that the double equal sign ==
is used to check equality, while the single equal sign =
is used to make assignments of values to variables.
You can also negate or form compound statements. For example:
relation | statement | value |
---|---|---|
not equal to | x != y |
TRUE |
and | (x < y) & (x+2 == y) |
TRUE |
(x < y) & (x != x) |
FALSE |
|
or | (x == 8) | (y == 9) |
TRUE |
(x >= y) | (y < 0) |
FALSE |
Notice that an “and” statement is only true when both of the conditions are true. An “or” statement is true as long as at least one of the conditions is true.
The values TRUE
and FALSE
can actually be handled arithmetically in R. The value TRUE
is given the value 1, and FALSE
is given 0. For example:
TRUE + TRUE
## [1] 2
FALSE * 28
## [1] 0
This is useful when you want to count how many values in a column satisfy a certain condition. For example, suppose we want to know how many cars in mpg
have 4-wheel drive. In the following code chunk, mpg$drv
refers to the drv
column within mpg
.
sum(mpg$drv == "4")
## [1] 103
The condition mpg$drv == "4"
assigns a value of TRUE
or FALSE
to all of the values in the drv
column of mpg
depending on whether the drv
value is “4” or not. (Since drv
is a non-numeric variable, we need quotes around its values, even though 4 itself is actually a number.) The sum
function then adds these TRUE
/FALSE
values, treating TRUE
as 1 and FALSE
as 0.
We can also use the mean
function to calculate the percentage of values in a column that satisfy a condition. For example, the percentage of 4-wheel drive cars in mpg
is given by:
mean(mpg$drv == "4")
## [1] 0.4401709
The reason this works is that mean
finds the average of all of the 1’s and 0’s assigned to the values of the drv
column by the statement mpg$drv == "4"
. In doing so, it finds the sum of these 1’s and 0’s (which results in the number of 4-wheel drive cars) and then divides by the total number of 1’s and 0’s. Thus, we’re dividing the number of 4-wheel drive cars by the total number of cars, resulting in the percentage of 4-wheel drives.
It’s often the case that a data set contains missing values for various reasons. In R, these missing values are indicated with the letters NA
, which stand for “not available.” For example, examine the built-in msleep
data set, which contains data relating to the sleep habits of various mammals.
msleep
Missing data is a problem for data analysts, and we’ll eventually talk about different methods for dealing with it. One method, though, is to filter out the observations that are missing values. Suppose, for example, that we want to filter out the observations in msleep
whose conservation status is missing. Here’s the syntax:
filter(msleep, !is.na(conservation))
## # A tibble: 54 x 11
## name genus vore order conse~1 sleep~2 sleep~3 sleep~4 awake brainwt bodywt
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Cheetah Acin~ carni Carn~ lc 12.1 NA NA 11.9 NA 50
## 2 Mountain beaver Aplo~ herbi Rode~ nt 14.4 2.4 NA 9.6 NA 1.35
## 3 Greater short-tai~ Blar~ omni Sori~ lc 14.9 2.3 0.133 9.1 0.00029 0.019
## 4 Cow Bos herbi Arti~ domest~ 4 0.7 0.667 20 0.423 600
## 5 Northern fur seal Call~ carni Carn~ vu 8.7 1.4 0.383 15.3 NA 20.5
## 6 Dog Canis carni Carn~ domest~ 10.1 2.9 0.333 13.9 0.07 14
## 7 Roe deer Capr~ herbi Arti~ lc 3 NA NA 21 0.0982 14.8
## 8 Goat Capri herbi Arti~ lc 5.3 0.6 NA 18.7 0.115 33.5
## 9 Guinea pig Cavis herbi Rode~ domest~ 9.4 0.8 0.217 14.6 0.0055 0.728
## 10 Grivet Cerc~ omni Prim~ lc 10 0.7 NA 14 NA 4.75
## # ... with 44 more rows, and abbreviated variable names 1: conservation, 2: sleep_total,
## # 3: sleep_rem, 4: sleep_cycle
## # i Use `print(n = ...)` to see more rows
Let’s examine the condition !is.na(conservation)
more closely. First, notice that the way to check whether a value is missing is not to enter conservation == NA
. The reason is that NA
is an indication that a value is not available; it’s not a value itself. The code conservation == NA
would thus be the meaningless statement that says the value of conservation
is valueless. For this reason, the code is.na(conservation)
has to be used instead. Recalling that the !
preceding a condition negates the condition, we then see that !is.na(conservation)
is the condition that says the value of conservation
is not NA
. The above filter
call thus returns the observations which do not have an NA
in the conservation column.
2.1.1 Exercises
These exercises require the nycflights13 package you installed earlier. Be sure the library is loaded.
The nycflights13 package contains a data set called
flights
. Load this data set and read its documentation:?flights
. How many observations does it have? How many variables?Find all of the flights that
- departed in February.
- were operated by United or American Airlines.
- departed in summer (June, July, and August).
- arrived more than two hours late, but did not leave late.
- were delayed by more than an hour, but made up over 30 minutes during the flight.
- departed between midnight and 6am.
How many flights were canceled? (Think about how a canceled flight might be detected from the data set.)
What was the on-time arrival rate for Delta Airlines during 2013? What was it during the winter months (January, February, and December)? A flight that did not arrive, due to a cancellation, crash, emergency landing, etc, should not figure into the on-time arrival rate.
2.2 Sorting a Data Set
The function used to sort a data set is arrange
. For example, suppose we want to sort the observations in mpg
according to the highway gas mileage:
arrange(mpg, hwy)
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 dodge dakota pickup 4wd 4.7 2008 8 auto(l5) 4 9 12 e pick~
## 2 dodge durango 4wd 4.7 2008 8 auto(l5) 4 9 12 e suv
## 3 dodge ram 1500 pickup 4wd 4.7 2008 8 auto(l5) 4 9 12 e pick~
## 4 dodge ram 1500 pickup 4wd 4.7 2008 8 manual(m6) 4 9 12 e pick~
## 5 jeep grand cherokee 4wd 4.7 2008 8 auto(l5) 4 9 12 e suv
## 6 chevrolet k1500 tahoe 4wd 5.3 2008 8 auto(l4) 4 11 14 e suv
## 7 jeep grand cherokee 4wd 6.1 2008 8 auto(l5) 4 11 14 p suv
## 8 chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r 11 15 e suv
## 9 chevrolet k1500 tahoe 4wd 5.7 1999 8 auto(l4) 4 11 15 r suv
## 10 dodge dakota pickup 4wd 5.2 1999 8 auto(l4) 4 11 15 r pick~
## # ... with 224 more rows
## # i Use `print(n = ...)` to see more rows
You can see that by default, arrange
sorts the observations in ascending order. To sort in descending order, do the following:
arrange(mpg, desc(hwy))
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 volkswagen jetta 1.9 1999 4 manual(m5) f 33 44 d compact
## 2 volkswagen new beetle 1.9 1999 4 manual(m5) f 35 44 d subcompact
## 3 volkswagen new beetle 1.9 1999 4 auto(l4) f 29 41 d subcompact
## 4 toyota corolla 1.8 2008 4 manual(m5) f 28 37 r compact
## 5 honda civic 1.8 2008 4 auto(l5) f 25 36 r subcompact
## 6 honda civic 1.8 2008 4 auto(l5) f 24 36 c subcompact
## 7 toyota corolla 1.8 1999 4 manual(m5) f 26 35 r compact
## 8 toyota corolla 1.8 2008 4 auto(l4) f 26 35 r compact
## 9 honda civic 1.8 2008 4 manual(m5) f 26 34 r subcompact
## 10 honda civic 1.6 1999 4 manual(m5) f 28 33 r subcompact
## # ... with 224 more rows
## # i Use `print(n = ...)` to see more rows
You can also include a “tie-breaker” variable in arrange
. Notice that the Volkswagen Jetta and the Volkswagen New Beetle manual tie for the best highway mileage. Let’s use city mileage (in descending order) as the tie-breaker. You just have to add the tie-breaker variable as a third argument inside arrange
:
arrange(mpg, desc(hwy), desc(cty))
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 volkswagen new beetle 1.9 1999 4 manual(m5) f 35 44 d subcompact
## 2 volkswagen jetta 1.9 1999 4 manual(m5) f 33 44 d compact
## 3 volkswagen new beetle 1.9 1999 4 auto(l4) f 29 41 d subcompact
## 4 toyota corolla 1.8 2008 4 manual(m5) f 28 37 r compact
## 5 honda civic 1.8 2008 4 auto(l5) f 25 36 r subcompact
## 6 honda civic 1.8 2008 4 auto(l5) f 24 36 c subcompact
## 7 toyota corolla 1.8 1999 4 manual(m5) f 26 35 r compact
## 8 toyota corolla 1.8 2008 4 auto(l4) f 26 35 r compact
## 9 honda civic 1.8 2008 4 manual(m5) f 26 34 r subcompact
## 10 honda civic 1.6 1999 4 manual(m5) f 28 33 r subcompact
## # ... with 224 more rows
## # i Use `print(n = ...)` to see more rows
We can also sort by a function of a variable or variables. For example, suppose we want to sort the mpg
data according to the difference between the highway and city gas mileages (in descending order). We just have to use hwy - cty
as the sort variable:
arrange(mpg, desc(hwy - cty))
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 honda civic 1.8 2008 4 auto(l5) f 24 36 c subcompact
## 2 volkswagen new beetle 1.9 1999 4 auto(l4) f 29 41 d subcompact
## 3 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
## 4 audi a4 2 2008 4 manual(m6) f 20 31 p compact
## 5 chevrolet malibu 3.5 2008 6 auto(l4) f 18 29 r midsize
## 6 honda civic 1.8 2008 4 auto(l5) f 25 36 r subcompact
## 7 volkswagen jetta 1.9 1999 4 manual(m5) f 33 44 d compact
## 8 volkswagen passat 1.8 1999 4 auto(l5) f 18 29 p midsize
## 9 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
## 10 audi a4 quattro 2.8 1999 6 auto(l5) 4 15 25 p compact
## # ... with 224 more rows
## # i Use `print(n = ...)` to see more rows
The 2008 Honda Civic thus has the largest difference between city and highway gas mileage.
Notice in the previous sort that there is no column that actually lists the highway/city mileage differences. A better practice would be to create a new variable that records these values and then to sort by that new variable. We’ll see how to create new variables in Section 2.5.
2.2.1 Exercises
These exercises require the flights
data set from the nycflights13 library.
Where does
arrange
sort theNA
values in a column? (Experiment with a data set that has missing values.) How could you forcearrange
to sort all of theNA
values to the top of the list? (Try usingis.na
.)What was the longest delay of any flight?
What flight left the earliest in the day?
What flight averaged the fastest speed while in the air? (Average speed is the total distance traveled divided by the total time spent in the air.) What flight averaged the slowest speed?
What flight traveled the farthest distance? Which one traveled the shortest distance?
2.3 The Pipe
We often will want to use two or more transformation functions in succession. For example, suppose we want to filter out the cars in mpg
that get less than 20 miles per gallon on the highway and then sort the remaining cars according to engine size (displ
) in descending order. There are a couple of naive ways to do this. One is:
arrange(filter(mpg, hwy >= 20), desc(displ))
## # A tibble: 156 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 chevrolet corvette 7 2008 8 manual(m6) r 15 24 p 2seat~
## 2 chevrolet corvette 6.2 2008 8 manual(m6) r 16 26 p 2seat~
## 3 chevrolet corvette 6.2 2008 8 auto(s6) r 15 25 p 2seat~
## 4 chevrolet corvette 5.7 1999 8 manual(m6) r 16 26 p 2seat~
## 5 chevrolet corvette 5.7 1999 8 auto(l4) r 15 23 p 2seat~
## 6 ford mustang 5.4 2008 8 manual(m6) r 14 20 p subco~
## 7 chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r 14 20 r suv
## 8 chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r 14 20 r suv
## 9 pontiac grand prix 5.3 2008 8 auto(s4) f 16 25 p midsi~
## 10 ford mustang 4.6 1999 8 auto(l4) r 15 21 r subco~
## # ... with 146 more rows
## # i Use `print(n = ...)` to see more rows
Nesting functions in this way is often not a good practice because it becomes hard to keep track of parentheses, especially when you nest several functions. A slightly better (but still not ideal) method is to perform one transformation at a time, storing the intermediate results along the way:
<- filter(mpg, hwy >= 20)
filtered_mpg arrange(filtered_mpg, desc(cyl))
## # A tibble: 156 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a6 quattro 4.2 2008 8 auto(s6) 4 16 23 p midsi~
## 2 chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r 14 20 r suv
## 3 chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r 14 20 r suv
## 4 chevrolet corvette 5.7 1999 8 manual(m6) r 16 26 p 2seat~
## 5 chevrolet corvette 5.7 1999 8 auto(l4) r 15 23 p 2seat~
## 6 chevrolet corvette 6.2 2008 8 manual(m6) r 16 26 p 2seat~
## 7 chevrolet corvette 6.2 2008 8 auto(s6) r 15 25 p 2seat~
## 8 chevrolet corvette 7 2008 8 manual(m6) r 15 24 p 2seat~
## 9 ford mustang 4.6 1999 8 auto(l4) r 15 21 r subco~
## 10 ford mustang 4.6 1999 8 manual(m5) r 15 22 r subco~
## # ... with 146 more rows
## # i Use `print(n = ...)` to see more rows
This approach avoids nesting, but it introduces a redundancy: the name filtered_mpg
is introduced and then immediately used as an argument inside arrange
. It would be much better to combine these steps. Luckily, tidyverse has a way to do this using a construct called the pipe.
The pipe provides a way to bypass the need to name an intermediate data set, only to immediately feed it into the next transformation function in line. Here’s how the above double transformation is accomplished using the pipe:
%>%
mpg filter(hwy >= 20) %>%
arrange(desc(cyl))
## # A tibble: 156 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a6 quattro 4.2 2008 8 auto(s6) 4 16 23 p midsi~
## 2 chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r 14 20 r suv
## 3 chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r 14 20 r suv
## 4 chevrolet corvette 5.7 1999 8 manual(m6) r 16 26 p 2seat~
## 5 chevrolet corvette 5.7 1999 8 auto(l4) r 15 23 p 2seat~
## 6 chevrolet corvette 6.2 2008 8 manual(m6) r 16 26 p 2seat~
## 7 chevrolet corvette 6.2 2008 8 auto(s6) r 15 25 p 2seat~
## 8 chevrolet corvette 7 2008 8 manual(m6) r 15 24 p 2seat~
## 9 ford mustang 4.6 1999 8 auto(l4) r 15 21 r subco~
## 10 ford mustang 4.6 1999 8 manual(m5) r 15 22 r subco~
## # ... with 146 more rows
## # i Use `print(n = ...)` to see more rows
We should read this as, “Start with mpg
, feed it into filter
, then feed the result into arrange
.” We avoid nesting and introducing intermediate data set names. The operator %>%
takes the data set that precedes it and feeds it into the function that follows it.
The pipe leads to much clearer and readable code, and we’ll use it constantly.
2.3.1 Exercises
Use the pipe to answer the following questions.
What is the farthest airport flown to by American Airlines?
What is the farthest airport flown to out of LaGuardia (LGA) by American Airlines?
What was the most delayed flight during the winter (January, February, and December)? What was the most delayed during the summer (June, July, and August)?
2.4 Selecting Columns
A very common procedure when analyzing a data set is to get rid of columns that are irrelevant to your analysis. This is accomplished using the select
function. For example, suppose we want to pare down the mpg
data set so that the only variables displayed are manufacturer
, model
, displ
, and hwy
:
select(mpg, manufacturer, model, displ, hwy)
## # A tibble: 234 x 4
## manufacturer model displ hwy
## <chr> <chr> <dbl> <int>
## 1 audi a4 1.8 29
## 2 audi a4 1.8 29
## 3 audi a4 2 31
## 4 audi a4 2 30
## 5 audi a4 2.8 26
## 6 audi a4 2.8 26
## 7 audi a4 3.1 27
## 8 audi a4 quattro 1.8 26
## 9 audi a4 quattro 1.8 25
## 10 audi a4 quattro 2 28
## # ... with 224 more rows
## # i Use `print(n = ...)` to see more rows
There are a lot of built-in “helper” functions available within select
. You can read the documentation (?select
) to see them, but there are a few particularly useful ones. For one, it’s often easier to specify the columns you don’t want. For example, suppose we want to keep every column of mpg
except trans
and drv
:
select(mpg, -trans, -drv)
## # A tibble: 234 x 9
## manufacturer model displ year cyl cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 18 29 p compact
## 2 audi a4 1.8 1999 4 21 29 p compact
## 3 audi a4 2 2008 4 20 31 p compact
## 4 audi a4 2 2008 4 21 30 p compact
## 5 audi a4 2.8 1999 6 16 26 p compact
## 6 audi a4 2.8 1999 6 18 26 p compact
## 7 audi a4 3.1 2008 6 18 27 p compact
## 8 audi a4 quattro 1.8 1999 4 18 26 p compact
## 9 audi a4 quattro 1.8 1999 4 16 25 p compact
## 10 audi a4 quattro 2 2008 4 20 28 p compact
## # ... with 224 more rows
## # i Use `print(n = ...)` to see more rows
It’s also helpful to specify a set of columns to keep that appear consecutively. For example, suppose we want to keep manufacturer
, model
, displ
, year
, and cyl
. Since they appear in consecutive columns, there’s a shortcut we can use to select them:
select(mpg, manufacturer:cyl)
## # A tibble: 234 x 5
## manufacturer model displ year cyl
## <chr> <chr> <dbl> <int> <int>
## 1 audi a4 1.8 1999 4
## 2 audi a4 1.8 1999 4
## 3 audi a4 2 2008 4
## 4 audi a4 2 2008 4
## 5 audi a4 2.8 1999 6
## 6 audi a4 2.8 1999 6
## 7 audi a4 3.1 2008 6
## 8 audi a4 quattro 1.8 1999 4
## 9 audi a4 quattro 1.8 1999 4
## 10 audi a4 quattro 2 2008 4
## # ... with 224 more rows
## # i Use `print(n = ...)` to see more rows
Of course, we can use select
together with the other transformation functions via the pipe. Let’s rank the Chevrolets in mpg
in descending order of city gas mileage and show only the car model, engine size, year, number of cylinders, transmission type, and city mileage:
%>%
mpg filter(manufacturer == "chevrolet") %>%
arrange(desc(cty)) %>%
select(model:trans, cty)
## # A tibble: 19 x 6
## model displ year cyl trans cty
## <chr> <dbl> <int> <int> <chr> <int>
## 1 malibu 2.4 2008 4 auto(l4) 22
## 2 malibu 2.4 1999 4 auto(l4) 19
## 3 malibu 3.1 1999 6 auto(l4) 18
## 4 malibu 3.5 2008 6 auto(l4) 18
## 5 malibu 3.6 2008 6 auto(s6) 17
## 6 corvette 5.7 1999 8 manual(m6) 16
## 7 corvette 6.2 2008 8 manual(m6) 16
## 8 corvette 5.7 1999 8 auto(l4) 15
## 9 corvette 6.2 2008 8 auto(s6) 15
## 10 corvette 7 2008 8 manual(m6) 15
## 11 c1500 suburban 2wd 5.3 2008 8 auto(l4) 14
## 12 c1500 suburban 2wd 5.3 2008 8 auto(l4) 14
## 13 k1500 tahoe 4wd 5.3 2008 8 auto(l4) 14
## 14 k1500 tahoe 4wd 6.5 1999 8 auto(l4) 14
## 15 c1500 suburban 2wd 5.7 1999 8 auto(l4) 13
## 16 c1500 suburban 2wd 6 2008 8 auto(l4) 12
## 17 c1500 suburban 2wd 5.3 2008 8 auto(l4) 11
## 18 k1500 tahoe 4wd 5.3 2008 8 auto(l4) 11
## 19 k1500 tahoe 4wd 5.7 1999 8 auto(l4) 11
2.5 Creating New Variables from Old
This section requires the nycflights13 library.
It’s often necessary to use the variables in a data set to create a new variable. The transformation function used to do this is mutate
.
For example, we might want to add a column to flights
that contains each flight’s average air speed. Recalling that average speed is total distance divided by total time, we can calculate a flight’s air speed by calculating distance / air_time
. Well, almost. The units of distance
are miles, and the units of air_time
are minutes, so to calculate air speed in the more familiar units of miles per hour, we’d actually want to compute distance / air_time * 60
.
To make this calculation for each row of flights
and add it as a new column (let’s call this new column air_speed
), we apply the mutate
function as follows:
%>%
flights mutate(air_speed = distance / air_time * 60)
When mutating a new variable onto a data set, it’s a better practice to assign the mutated data set a new name. You can do so by inserting a name assignment <NEW NAME> <-
at the beginning of the code above:
<- flights %>%
flights_w_speed mutate(air_speed = distance / air_time * 60)
You can now refer to flights_w_speed
for subsequent analysis.
The argument of mutate
is always of the form:
<NEW VARIABLE NAME> = <FORMULA INVOLVING OLD VARIABLES>
The formula is often constructed using basic arithmetic, as with the air speed calculation above. However, it’s worth mentioning a few other useful operations to use within formulas.
Run the following lines of code, and try to guess what the %/%
and %%
operators do:
27 %/% 4
27 %% 4
You may have guessed that %/%
returns the whole number quotient obtained when dividing one number by another, while %%
returns the remainder. We can see right away how these operations might be applied by doing the following: Add a column to flights
that computes the number of minutes since midnight until each flight departs and another column that computes the number of minutes since midnight until each flight arrives.
For example, for the very first flight in the data set, the value of dep_time
is 517, and the value of arr_time
is 830. These numbers are really just abbreviations for 5:17am and 8:30am, and are not numbers that could be directly used in calculations. Converting these times to minutes since midnight would provide an actual numerical measurement (not a time) that indicates when a flight departed or arrived.
For the 5:17 departure time, we would convert to minutes since midnight by calculating 5*60 + 17
, which is 317. But how would we tell R to do this? We’d have to have a way to separate the first one or two digits from the last two digits. This is where %/%
and %%
come in: Notice that 517 %/% 100 = 5
and 517 %% 100 = 17
. Thus, the conversion can be accomplished by the formula (517 %/% 100)*60 + (517 %% 100)
.
We can now add our minutes since midnight variables as follows. (Notice that you can create more than one new variable within a single mutate
call.) We’ll also select the relevant columns so they’re visible in the output.
<- flights %>%
flights_w_mins_midnight mutate(dep_mins_midnight = (dep_time %/% 100)*60 + (dep_time %% 100),
arr_mins_midnight = (arr_time %/% 100)*60 + (arr_time %% 100)) %>%
select(dep_time, dep_mins_midnight, arr_time, arr_mins_midnight)
flights_w_mins_midnight
## # A tibble: 336,776 x 4
## dep_time dep_mins_midnight arr_time arr_mins_midnight
## <int> <dbl> <int> <dbl>
## 1 517 317 830 510
## 2 533 333 850 530
## 3 542 342 923 563
## 4 544 344 1004 604
## 5 554 354 812 492
## 6 554 354 740 460
## 7 555 355 913 553
## 8 557 357 709 429
## 9 557 357 838 518
## 10 558 358 753 473
## # ... with 336,766 more rows
## # i Use `print(n = ...)` to see more rows
Another very useful operation to use within a mutate
formula is ifelse
. The ifelse
function returns a value that depends on the truth or falsehood of some condition. The basic syntax is
ifelse(<CONDITION>,
<VALUE IF CONDITION IS TRUE>,
<VALUE IF CONDITION IS FALSE)
For example, the ifelse
below says to check whether x < y
. If it is, then return the value of x^2
. If it’s not, then return the string “hello.”
<- 4
x <- 2
y
ifelse(x < y,
^2,
x"hello")
## [1] "hello"
ifelse
statements can be used within mutate
to create new variables in a way that depends on conditions on the old variables. It’s an especially useful way to categorize continuous variables, i.e, to assign categorical values to continuous ones. For example, maybe we want to add a column to flights
that states whether a flight arrived late or on time (or early). (We’ll just select the relevant columns to make our new column visible.)
<- flights %>%
flights_w_arr_status mutate(arr_status = ifelse(arr_delay <= 0,
"on time",
"late")) %>%
select(month, day, arr_delay, origin, dest, arr_status)
flights_w_arr_status
## # A tibble: 336,776 x 6
## month day arr_delay origin dest arr_status
## <int> <int> <dbl> <chr> <chr> <chr>
## 1 1 1 11 EWR IAH late
## 2 1 1 20 LGA IAH late
## 3 1 1 33 JFK MIA late
## 4 1 1 -18 JFK BQN on time
## 5 1 1 -25 LGA ATL on time
## 6 1 1 12 EWR ORD late
## 7 1 1 19 EWR FLL late
## 8 1 1 -14 LGA IAD on time
## 9 1 1 -8 JFK MCO on time
## 10 1 1 8 LGA ORD late
## # ... with 336,766 more rows
## # i Use `print(n = ...)` to see more rows
Simple ifelse
statements like the one above work well when creating binary (two-value) categorical variables like a flight’s arrival status. However, to create categorical variables with more than two possible values, ifelse
is not ideal.
Suppose we want to re-do the above categorization but with an extra “canceled” category for the flights with an NA
in the arr_delay
column. One (non-optimal) way to do this would be with a nested ifelse
statement. The logic would work like this: First, check whether arr_delay
is NA. If it is, assign a value of “canceled.” If it isn’t, then we’re not yet ready to assign a value since it could still be either “on time” or “late.” We have more checking to do. Thus, in the third argument of our ifelse
, we have to start a new ifelse
. This new ifelse
will check whether arr_delay
is negative or 0. If so, we’ll assign “on time.” If not, the only remaining option is for arr_delay
to be positive, so we’ll assign “late.” The filter
at the end is to ensure that a value of “canceled” actually shows up in the visible output.
<- flights %>%
flights_w_arr_status2 mutate(arr_status = ifelse(is.na(arr_delay),
"canceled",
ifelse(arr_delay <= 0,
"on time",
"late"))) %>%
select(month, day, arr_delay, origin, dest, arr_status) %>%
filter(row_number() >= 611)
flights_w_arr_status2
## # A tibble: 336,166 x 6
## month day arr_delay origin dest arr_status
## <int> <int> <dbl> <chr> <chr> <chr>
## 1 1 1 46 JFK RSW late
## 2 1 1 -12 EWR PBI on time
## 3 1 1 3 LGA DTW late
## 4 1 1 -21 JFK PDX on time
## 5 1 1 68 LGA MIA late
## 6 1 1 NA LGA XNA canceled
## 7 1 1 -24 LGA MCO on time
## 8 1 1 66 JFK RDU late
## 9 1 1 -2 EWR MCO on time
## 10 1 1 27 JFK TPA late
## # ... with 336,156 more rows
## # i Use `print(n = ...)` to see more rows
There are some problems with this, the most obvious of which is that nested ifelse
statements are hard to read, especially when there are a lot of categories. (The above only has three categories and is already very cluttered with parentheses.) Another problem is that nested ifelse
statements depend on the order in which the conditions are checked. One of the exercises for this section will illustrate what happens when the NA
condition above is not checked first.
There is a much better option for categorizations with more than two categories: case_when
. Here’s how it works:
<- flights %>%
flights_w_arr_status3 mutate(arr_status = case_when(is.na(arr_delay) ~ "canceled",
<= 0 ~ "on time",
arr_delay >0 ~ "late")) %>%
arr_delay select(month, day, arr_delay, origin, dest, arr_status) %>%
filter(row_number() >= 611)
flights_w_arr_status3
## # A tibble: 336,166 x 6
## month day arr_delay origin dest arr_status
## <int> <int> <dbl> <chr> <chr> <chr>
## 1 1 1 46 JFK RSW late
## 2 1 1 -12 EWR PBI on time
## 3 1 1 3 LGA DTW late
## 4 1 1 -21 JFK PDX on time
## 5 1 1 68 LGA MIA late
## 6 1 1 NA LGA XNA canceled
## 7 1 1 -24 LGA MCO on time
## 8 1 1 66 JFK RDU late
## 9 1 1 -2 EWR MCO on time
## 10 1 1 27 JFK TPA late
## # ... with 336,156 more rows
## # i Use `print(n = ...)` to see more rows
Not only is this much clearer code, it’s more robust since it does not depend on the order in which we list the conditions, as we’ll see in the exercises.
In the case_when
statement above, the third condition arr_delay > 0
is the only option left after the first two. If we don’t want to specify the last remaining condition explicitly but instead just let it catch all the remaining cases, we can just enter TRUE
for this condition. The following case_when
is thus equivalent to the one above:
case_when(is.na(arr_delay) ~ "canceled",
<= 0 ~ "on time",
arr_delay TRUE ~ "late")
2.5.1 Exercises
These exercise require the nycflights13 and Lahman libraries.
Find the fastest and slowest flights (in terms of air speed). State the actual average air speeds in miles per hour.
Use the variables
dep_mins_midnight
andarr_mins_midnight
to calculate the number of minutes fromdep_time
toarr_time
for each flight. Compare your answers to theair_time
values. Why don’t they match?Add a column to
flights
that indicates whether a flight was canceled. The new column’s values should be eithercanceled
ornot canceled
. Then find the first date on which a flight was canceled. (Have R do this for you; don’t just scroll through the data set.)Count the number of
NA
cells in thearr_status
variable inflights_w_arr_status
we created above. (Have R do this; don’t manually count them yourself.) Why did these flights not have a value forarr_status
? (Hint: What happens when you ask R whetherNA <= 0
?)In the nested
ifelse
we used to createflights_w_arr_status2
above, rearrange the conditions so thatis.na(arr_delay)
is not the first condition checked. Then find a row in which thearr_delay
value isNA
and check whether it assigned the correct value ofarr_status
. If it didn’t, try to explain why.Re-do the previous exercise for the
case_when
used to createflights_w_arr_status3
. Does it handle theNA
s inarr_delay
correctly? Use your answer to explain whycase_when
is better than a nestedifelse
.Add a column to
flights
that gives the season for each flight. For purposes of this exercise, let’s say “winter” is January, February, and December, “spring” is March, April, and May, “summer” is June, July, and August, and “fall” is September, October, and November.Categorize the observations in
diamonds
as “expensive,” “very expensive,” “insanely expensive,” or “priceless.” Decide for yourself how to define these labels. Then create a visualization that shows the distribution of these labels.View the
Batting
data set from the Lahman library and read its documentation. How many observations and variables does it have?Batting
does not have a column for a player’s batting average, which measures the percentage of a player’s at bats that result in a hit. The formula isH / AB
. Create a batting average column inBatting
and find the player(s) with the all-time highest single season batting average. What’s strange about your result? (You can just refer to a player by theirplayerID
value if you don’t know their real name.)To fix the problem from the previous exercise, a minimum number of at bats are required for a batting average to be considered “official.” Re-do Exercise 6, but only rank the players who had at least 400 at bats.
The modern era of baseball is generally considered to have begun in 1947 (after World War II ended). Who’s had the best single season batting average of the modern era (assuming at least 400 at bats)? Answer this question entirely by using transformation functions rather than by just scrolling through the data.
When was the last year a player had a batting average of at least 0.400 with a minimum of 400 at bats? (Again, answer this question using only transformation functions.)
2.6 Grouped Summaries
Suppose we want to create a data table that ranks the manufacturers in mpg
according to the average highway gas mileage for each manufacturer. This would entail grouping the cars by the manufacturer
values and then computing the average value of hwy
for each group. This is an example of a grouped summary and is done with a combination of the group_by
and summarize
functions. Here’s how it’s done:
%>%
mpg group_by(manufacturer) %>%
summarize(average_highway_mileage = mean(hwy)) %>%
arrange(desc(average_highway_mileage))
## # A tibble: 15 x 2
## manufacturer average_highway_mileage
## <chr> <dbl>
## 1 honda 32.6
## 2 volkswagen 29.2
## 3 hyundai 26.9
## 4 audi 26.4
## 5 pontiac 26.4
## 6 subaru 25.6
## 7 toyota 24.9
## 8 nissan 24.6
## 9 chevrolet 21.9
## 10 ford 19.4
## 11 mercury 18
## 12 dodge 17.9
## 13 jeep 17.6
## 14 lincoln 17
## 15 land rover 16.5
A couple of notes:
group_by
andsummarize
are always used together.summarize
works similarly tomutate
in that it creates a new column with a name that the user provides (average_highway_mileage
above).mean
is a built-in function that computes the averagehwy
value in each group.The displayed data table will only show the grouped column and any computed summary columns.
You can group by more than one variable as well. Suppose we want to rank highway gas mileage by manufacturer and model:
%>%
mpg group_by(manufacturer, model) %>%
summarize(average_highway_mileage = mean(hwy)) %>%
arrange(desc(average_highway_mileage))
## # A tibble: 38 x 3
## # Groups: manufacturer [15]
## manufacturer model average_highway_mileage
## <chr> <chr> <dbl>
## 1 toyota corolla 34
## 2 volkswagen new beetle 32.8
## 3 honda civic 32.6
## 4 volkswagen jetta 29.1
## 5 nissan altima 28.7
## 6 audi a4 28.3
## 7 toyota camry 28.3
## 8 toyota camry solara 28.1
## 9 hyundai sonata 27.7
## 10 chevrolet malibu 27.6
## # ... with 28 more rows
## # i Use `print(n = ...)` to see more rows
mean
is just one example of a summary statistic that can be used inside summarize
. Here are a few more common ones:
statistic | syntax | use |
---|---|---|
mean | mean(<VARIABLE>) |
computes the average value of <VARIABLE> within each group |
sum | sum(<VARIABLE>) |
adds the values within each group |
count | n() |
counts the number of observations within each group |
distinct count | n_distinct(<VARIABLE>) |
counts the number of distinct values of <VARIABLE> within each group |
maximum | max(<VARIABLE>) |
returns the largest value of <VARIABLE> within each group |
minimum | min(<VARIABLE>) |
returns the smallest value of <VARIABLE> within each group |
standard deviation | sd(<VARIABLE>) |
computes the standard deviation of <VARIABLE> within each group |
median | median(<VARIABLE>) |
computes the median value of <VARIABLE> within each group |
There are many others as well. It’s certainly possible to compute more than one summary statistic within a single summarize
call. In fact, whenever you’re computing a mean value, it’s a good idea to also compute the count. This is because comparing averages is really only appropriate when the groups have roughly the same number of observations. If we do this for the average highway mileage example above, we get:
%>%
mpg group_by(manufacturer) %>%
summarize(average_highway_mileage = mean(hwy),
count = n()) %>%
arrange(desc(average_highway_mileage))
## # A tibble: 15 x 3
## manufacturer average_highway_mileage count
## <chr> <dbl> <int>
## 1 honda 32.6 9
## 2 volkswagen 29.2 27
## 3 hyundai 26.9 14
## 4 audi 26.4 18
## 5 pontiac 26.4 5
## 6 subaru 25.6 14
## 7 toyota 24.9 34
## 8 nissan 24.6 13
## 9 chevrolet 21.9 19
## 10 ford 19.4 25
## 11 mercury 18 4
## 12 dodge 17.9 37
## 13 jeep 17.6 8
## 14 lincoln 17 3
## 15 land rover 16.5 4
We would therefore want to refrain from declaring that Pontiacs have better fuel efficiencies than Toyotas since Pontiacs are relatively underrepresented in the data set.
Referring to the flights
data set, suppose we want to know which destination airport has the worst arrival delays. We could group flights
by the dest
variable and then compute the average value of arr_delay
for each group. Following the above advice, we’ll also compute the count for each group.
%>%
flights group_by(dest) %>%
summarize(avg_arr_delay = mean(arr_delay),
count = n()) %>%
arrange(desc(avg_arr_delay))
## # A tibble: 105 x 3
## dest avg_arr_delay count
## <chr> <dbl> <int>
## 1 SBN 6.5 10
## 2 EYW 6.35 17
## 3 ABQ 4.38 254
## 4 ANC -2.5 8
## 5 LEX -22 1
## 6 ACK NA 265
## 7 ALB NA 439
## 8 ATL NA 17215
## 9 AUS NA 2439
## 10 AVL NA 275
## # ... with 95 more rows
## # i Use `print(n = ...)` to see more rows
What’s the deal with all the NA
s in the avg_arr_delay
column? Remember that when a flight is canceled, there is no arrival and hence to arrival delay. Thus, canceled flights have NA
in the arr_delay
column. So in any dest
group which contains at least one canceled flight, the mean
function will be attempting to calculate an average value when not all of the values are available. This means that the average itself will be reported as NA
.
Luckily, there’s an easy, built-in way to avoid this. We can remove the NA
s from the mean computation by inserting the argument na.rm = TRUE
into mean
:
%>%
flights group_by(dest) %>%
summarize(avg_arr_delay = mean(arr_delay, na.rm = TRUE),
count = n()) %>%
arrange(desc(avg_arr_delay))
## # A tibble: 105 x 3
## dest avg_arr_delay count
## <chr> <dbl> <int>
## 1 CAE 41.8 116
## 2 TUL 33.7 315
## 3 OKC 30.6 346
## 4 JAC 28.1 25
## 5 TYS 24.1 631
## 6 MSN 20.2 572
## 7 RIC 20.1 2454
## 8 CAK 19.7 864
## 9 DSM 19.0 569
## 10 GRR 18.2 765
## # ... with 95 more rows
## # i Use `print(n = ...)` to see more rows
It looks like CAE (Columbia Metropolitan Airport) has the highest average arrival delay, although there were only 116 flights there. On the other hand, RIC (Richmond International Airport) maintained a high average arrival delay over almost 2500 flights. RIC might get my vote for the worst airport to fly into from New York.
If you look at the end of the list, you see that, on average, flights to Blue Grass Airport in Lexington, KY, arrived an average of 22 minutes early! This seems amazing until you notice that there was only one flight there.
## # A tibble: 6 x 3
## dest avg_arr_delay count
## <chr> <dbl> <int>
## 1 ANC -2.5 8
## 2 STT -3.84 522
## 3 SNA -7.87 825
## 4 PSP -12.7 19
## 5 LEX -22 1
## 6 LGA NaN 1
Last on the list is LGA, with an average arrival delay value of NaN
. This stands for “not a number.” If we extract out this observation, we might be able to see why:
%>%
flights filter(dest == "LGA") %>%
select(month, day, dest, dep_delay, arr_delay)
## # A tibble: 1 x 5
## month day dest dep_delay arr_delay
## <int> <int> <chr> <dbl> <dbl>
## 1 7 27 LGA NA NA
It looks like there was only one flight to LGA, and it was canceled, giving us an NA
in the arr_delay
column. When we then compute the average arrival delay and use na.rm = TRUE
to remove the NA
, we’re left with no numbers to take the average of! This would lead to a division by 0, which is not arithmetically possible. Hence the NaN
.
2.6.1 Exercises
These exercises will require both the nycflights13 and Lahman libraries.
Create a data set from
flights
that shows the average departure delay on each day of the year.Which carrier from
flights
had the best on-time arrival rate in 2013? Which had the worst? (Be sure to include counts so we know whether the rates are comparable.)Look at each destination in
flights
. Can you find any flights that were suspiciously fast, possibly indicating a data entry error?Which carrier in
flights
flew to the most distinct destinations?Which specific plane (identified by
tailnum
) had the most variation in the distances it flew? (Be sure to include counts in your analysis.)Which day of the year had the most cancellations? Which day had the least?
Using the
Batting
data set from Lahman, determine the maximum number of home runs hit by a player every year since 1871.Create a visualization of your findings from the previous exercise.
Create a data set that shows a ranked list of the all-time leaders in career batting average. (A player’s career batting average is his total number of hits divided by his total number of at bats.) Note that to be eligible for this list, a player must have at least 3000 total at bats. Be sure to take this into account when you create your list.
2.7 Project
Project Description: The purpose of this project is to explore and explain observations in the flights
data set by transforming and visualizing the data.
Instructions:
Answer the following questions using the appropriate transformation functions and visualizations.
Create a data set that lists the days during 2013 on which the percentage of flights that were either canceled or delayed on departure by more than one hour exceeded 35%.
Do a Google search to see what the weather was like in New York on each of the days from the previous problem in 2013. Do your findings explain the large number of delays/cancellations?
Create a data set that shows, for each scheduled departure hour during the day, the percentage of flights that were canceled or delayed by more than one hour.
Obtain a plot that shows the relationship between the hour and the percentage of flights canceled or delayed by more than an hour.
Explain the relationship in the previous problem. In particular:
- Why does the plot have the given shape?
- Are there outliers? If so, why might they be there?
- When is the riskiest time of day to fly if you want to avoid cancellations and long delays, and why might that be?
- Why does the plot have the given shape?
Above, you found dates on which a lot of cancellations and long delays occurred. However, several flights were able to leave on time or early on those dates. Create a data set that contains the flights that left on time or early on those dates.
It’s a reasonable guess that the flights that were able to leave early despite bad weather were probably the ones that left in the morning before the weather hit. For each date in the data set from the previous problem, find the average scheduled hour of departure. Do your findings mostly confirm the guess?
Did any of the problematic dates have an average departure hour in the afternoon? If so, why might that have been?
Guidelines:
Include a title, your name, and the date in the heading.
Also include a preliminary code chunk in which you load the libraries you’ll need, and briefly say what each one is for.
Begin with an introductory paragraph containing, at least, a description of the data set (including what it contains, its size, and its source) and a nicely displayed data table using the
datatable
function. (If the data set is too big to display when you try to knit your .Rmd file, you don’t have to display it.)Clearly describe what you’ll be doing with the data, and include any questions you’ll be trying to answer.
Follow the introduction and problem statement with the body of your report which will contain your work, broken into relevant section headings.
The compiled report should show all of your R code and its output, but it should not show any warning or error messages. (This is for my benefit. A professional report might not include the code, depending on the audience.)
The body should also include text that provides a running narrative that guides the reader through your work. This should be thorough and complete, but it should avoid large blocks of text.
All graphics should look nice, be easy to read, and be fully labelled.
All data tables should be displayed with the
datatable
function.You should include insightful statements about the meaning you’re extracting from your graphics.
End with an overall concluding statement which summarizes your findings.
Grading Rubric:
Transformations: Do you use the right transformation functions to answer the questions? (30 points)
Insights: Are insights fully explained and well-written? Are the insights derived from your results? (30 points)
Narrative: Is it clear what you’re trying to do in this project? Do you maintain a readable narrative throughout that guides the reader through your analysis? (20 points)
Professionalism: Does your report look nice? Do you provide insights based on your analysis? Is your code clear and readable? (15 points)