Chapter 2 Data Transformation

In this chapter, we will use a set of functions from the dplyr package (which is part of tidyverse) to transform a data set in various ways. In particular, we will see how to filter out observations, sort the rows of a data set, remove unnecessary variables, create new variables from old ones, and group observations together. We’ll also introduce a tidyverse construct called “the pipe” which provides a streamlined way to perform several transformations in succession.

We’ll need to install two new packages:

install.packages("nycflights13")
install.packages("Lahman")

The first package contains a data set with records from all of the flights that originated from a major New York City airport in 2013. The second contains Major League Baseball data sets dating back to 1871. Be sure to load these libraries, along with tidyverse.

library(tidyverse)
library(nycflights13)
library(Lahman)

2.1 Filtering a Data Set

Often, we will want to filter a data set so that we only see the observations that satisfy a certain condition. For example, suppose we want to see only the cars in the mpg data set that have at least 6-cylinder engines. The code for this is:

filter(mpg, cyl >= 6)
## # A tibble: 149 x 11
##    manufacturer model      displ  year   cyl trans      drv     cty   hwy fl    class  
##    <chr>        <chr>      <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr>  
##  1 audi         a4           2.8  1999     6 auto(l5)   f        16    26 p     compact
##  2 audi         a4           2.8  1999     6 manual(m5) f        18    26 p     compact
##  3 audi         a4           3.1  2008     6 auto(av)   f        18    27 p     compact
##  4 audi         a4 quattro   2.8  1999     6 auto(l5)   4        15    25 p     compact
##  5 audi         a4 quattro   2.8  1999     6 manual(m5) 4        17    25 p     compact
##  6 audi         a4 quattro   3.1  2008     6 auto(s6)   4        17    25 p     compact
##  7 audi         a4 quattro   3.1  2008     6 manual(m6) 4        15    25 p     compact
##  8 audi         a6 quattro   2.8  1999     6 auto(l5)   4        15    24 p     midsize
##  9 audi         a6 quattro   3.1  2008     6 auto(s6)   4        17    25 p     midsize
## 10 audi         a6 quattro   4.2  2008     8 auto(s6)   4        16    23 p     midsize
## # ... with 139 more rows
## # i Use `print(n = ...)` to see more rows

You can see the full filtered data set by assigning it a name and then viewing it:

cyl6 <- filter(mpg, cyl >= 6)
View(cyl)

In general, the syntax for the filter function is:


filter(<DATA SET>, <CONDITIONS>)

The condition arguments in filter are statements whose values are either TRUE or FALSE. These statements are often constructed as equalities and inequalities. For example, suppose we make the assignments x <- 7 and y <- 9. Then we would have:

relation statement value
less than x < y TRUE
greater than x > y FALSE
less than or equal to x+3 <= y FALSE
greater than or equal to x >= y-2 TRUE
equal to 3*x+5 == 4*y-10 TRUE

Notice that the double equal sign == is used to check equality, while the single equal sign = is used to make assignments of values to variables.

You can also negate or form compound statements. For example:

relation statement value
not equal to x != y TRUE
and (x < y) & (x+2 == y) TRUE
(x < y) & (x != x) FALSE
or (x == 8) | (y == 9) TRUE
(x >= y) | (y < 0) FALSE

Notice that an “and” statement is only true when both of the conditions are true. An “or” statement is true as long as at least one of the conditions is true.

The values TRUE and FALSE can actually be handled arithmetically in R. The value TRUE is given the value 1, and FALSE is given 0. For example:

TRUE + TRUE
## [1] 2
FALSE * 28
## [1] 0

This is useful when you want to count how many values in a column satisfy a certain condition. For example, suppose we want to know how many cars in mpg have 4-wheel drive. In the following code chunk, mpg$drv refers to the drv column within mpg.

sum(mpg$drv == "4")
## [1] 103

The condition mpg$drv == "4" assigns a value of TRUE or FALSE to all of the values in the drv column of mpg depending on whether the drv value is “4” or not. (Since drv is a non-numeric variable, we need quotes around its values, even though 4 itself is actually a number.) The sum function then adds these TRUE/FALSE values, treating TRUE as 1 and FALSE as 0.

We can also use the mean function to calculate the percentage of values in a column that satisfy a condition. For example, the percentage of 4-wheel drive cars in mpg is given by:

mean(mpg$drv == "4")
## [1] 0.4401709

The reason this works is that mean finds the average of all of the 1’s and 0’s assigned to the values of the drv column by the statement mpg$drv == "4". In doing so, it finds the sum of these 1’s and 0’s (which results in the number of 4-wheel drive cars) and then divides by the total number of 1’s and 0’s. Thus, we’re dividing the number of 4-wheel drive cars by the total number of cars, resulting in the percentage of 4-wheel drives.


It’s often the case that a data set contains missing values for various reasons. In R, these missing values are indicated with the letters NA, which stand for “not available.” For example, examine the built-in msleep data set, which contains data relating to the sleep habits of various mammals.

msleep

Missing data is a problem for data analysts, and we’ll eventually talk about different methods for dealing with it. One method, though, is to filter out the observations that are missing values. Suppose, for example, that we want to filter out the observations in msleep whose conservation status is missing. Here’s the syntax:

filter(msleep, !is.na(conservation))
## # A tibble: 54 x 11
##    name               genus vore  order conse~1 sleep~2 sleep~3 sleep~4 awake  brainwt  bodywt
##    <chr>              <chr> <chr> <chr> <chr>     <dbl>   <dbl>   <dbl> <dbl>    <dbl>   <dbl>
##  1 Cheetah            Acin~ carni Carn~ lc         12.1    NA    NA      11.9 NA        50    
##  2 Mountain beaver    Aplo~ herbi Rode~ nt         14.4     2.4  NA       9.6 NA         1.35 
##  3 Greater short-tai~ Blar~ omni  Sori~ lc         14.9     2.3   0.133   9.1  0.00029   0.019
##  4 Cow                Bos   herbi Arti~ domest~     4       0.7   0.667  20    0.423   600    
##  5 Northern fur seal  Call~ carni Carn~ vu          8.7     1.4   0.383  15.3 NA        20.5  
##  6 Dog                Canis carni Carn~ domest~    10.1     2.9   0.333  13.9  0.07     14    
##  7 Roe deer           Capr~ herbi Arti~ lc          3      NA    NA      21    0.0982   14.8  
##  8 Goat               Capri herbi Arti~ lc          5.3     0.6  NA      18.7  0.115    33.5  
##  9 Guinea pig         Cavis herbi Rode~ domest~     9.4     0.8   0.217  14.6  0.0055    0.728
## 10 Grivet             Cerc~ omni  Prim~ lc         10       0.7  NA      14   NA         4.75 
## # ... with 44 more rows, and abbreviated variable names 1: conservation, 2: sleep_total,
## #   3: sleep_rem, 4: sleep_cycle
## # i Use `print(n = ...)` to see more rows

Let’s examine the condition !is.na(conservation) more closely. First, notice that the way to check whether a value is missing is not to enter conservation == NA. The reason is that NA is an indication that a value is not available; it’s not a value itself. The code conservation == NA would thus be the meaningless statement that says the value of conservation is valueless. For this reason, the code is.na(conservation) has to be used instead. Recalling that the ! preceding a condition negates the condition, we then see that !is.na(conservation) is the condition that says the value of conservation is not NA. The above filter call thus returns the observations which do not have an NA in the conservation column.

2.1.1 Exercises

These exercises require the nycflights13 package you installed earlier. Be sure the library is loaded.

  1. The nycflights13 package contains a data set called flights. Load this data set and read its documentation: ?flights. How many observations does it have? How many variables?

  2. Find all of the flights that

    1. departed in February.
    2. were operated by United or American Airlines.
    3. departed in summer (June, July, and August).
    4. arrived more than two hours late, but did not leave late.
    5. were delayed by more than an hour, but made up over 30 minutes during the flight.
    6. departed between midnight and 6am.
  3. How many flights were canceled? (Think about how a canceled flight might be detected from the data set.)

  4. What was the on-time arrival rate for Delta Airlines during 2013? What was it during the winter months (January, February, and December)? A flight that did not arrive, due to a cancellation, crash, emergency landing, etc, should not figure into the on-time arrival rate.

2.2 Sorting a Data Set

The function used to sort a data set is arrange. For example, suppose we want to sort the observations in mpg according to the highway gas mileage:

arrange(mpg, hwy)
## # A tibble: 234 x 11
##    manufacturer model               displ  year   cyl trans      drv     cty   hwy fl    class
##    <chr>        <chr>               <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr>
##  1 dodge        dakota pickup 4wd     4.7  2008     8 auto(l5)   4         9    12 e     pick~
##  2 dodge        durango 4wd           4.7  2008     8 auto(l5)   4         9    12 e     suv  
##  3 dodge        ram 1500 pickup 4wd   4.7  2008     8 auto(l5)   4         9    12 e     pick~
##  4 dodge        ram 1500 pickup 4wd   4.7  2008     8 manual(m6) 4         9    12 e     pick~
##  5 jeep         grand cherokee 4wd    4.7  2008     8 auto(l5)   4         9    12 e     suv  
##  6 chevrolet    k1500 tahoe 4wd       5.3  2008     8 auto(l4)   4        11    14 e     suv  
##  7 jeep         grand cherokee 4wd    6.1  2008     8 auto(l5)   4        11    14 p     suv  
##  8 chevrolet    c1500 suburban 2wd    5.3  2008     8 auto(l4)   r        11    15 e     suv  
##  9 chevrolet    k1500 tahoe 4wd       5.7  1999     8 auto(l4)   4        11    15 r     suv  
## 10 dodge        dakota pickup 4wd     5.2  1999     8 auto(l4)   4        11    15 r     pick~
## # ... with 224 more rows
## # i Use `print(n = ...)` to see more rows

You can see that by default, arrange sorts the observations in ascending order. To sort in descending order, do the following:

arrange(mpg, desc(hwy))
## # A tibble: 234 x 11
##    manufacturer model      displ  year   cyl trans      drv     cty   hwy fl    class     
##    <chr>        <chr>      <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr>     
##  1 volkswagen   jetta        1.9  1999     4 manual(m5) f        33    44 d     compact   
##  2 volkswagen   new beetle   1.9  1999     4 manual(m5) f        35    44 d     subcompact
##  3 volkswagen   new beetle   1.9  1999     4 auto(l4)   f        29    41 d     subcompact
##  4 toyota       corolla      1.8  2008     4 manual(m5) f        28    37 r     compact   
##  5 honda        civic        1.8  2008     4 auto(l5)   f        25    36 r     subcompact
##  6 honda        civic        1.8  2008     4 auto(l5)   f        24    36 c     subcompact
##  7 toyota       corolla      1.8  1999     4 manual(m5) f        26    35 r     compact   
##  8 toyota       corolla      1.8  2008     4 auto(l4)   f        26    35 r     compact   
##  9 honda        civic        1.8  2008     4 manual(m5) f        26    34 r     subcompact
## 10 honda        civic        1.6  1999     4 manual(m5) f        28    33 r     subcompact
## # ... with 224 more rows
## # i Use `print(n = ...)` to see more rows

You can also include a “tie-breaker” variable in arrange. Notice that the Volkswagen Jetta and the Volkswagen New Beetle manual tie for the best highway mileage. Let’s use city mileage (in descending order) as the tie-breaker. You just have to add the tie-breaker variable as a third argument inside arrange:

arrange(mpg, desc(hwy), desc(cty))
## # A tibble: 234 x 11
##    manufacturer model      displ  year   cyl trans      drv     cty   hwy fl    class     
##    <chr>        <chr>      <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr>     
##  1 volkswagen   new beetle   1.9  1999     4 manual(m5) f        35    44 d     subcompact
##  2 volkswagen   jetta        1.9  1999     4 manual(m5) f        33    44 d     compact   
##  3 volkswagen   new beetle   1.9  1999     4 auto(l4)   f        29    41 d     subcompact
##  4 toyota       corolla      1.8  2008     4 manual(m5) f        28    37 r     compact   
##  5 honda        civic        1.8  2008     4 auto(l5)   f        25    36 r     subcompact
##  6 honda        civic        1.8  2008     4 auto(l5)   f        24    36 c     subcompact
##  7 toyota       corolla      1.8  1999     4 manual(m5) f        26    35 r     compact   
##  8 toyota       corolla      1.8  2008     4 auto(l4)   f        26    35 r     compact   
##  9 honda        civic        1.8  2008     4 manual(m5) f        26    34 r     subcompact
## 10 honda        civic        1.6  1999     4 manual(m5) f        28    33 r     subcompact
## # ... with 224 more rows
## # i Use `print(n = ...)` to see more rows

We can also sort by a function of a variable or variables. For example, suppose we want to sort the mpg data according to the difference between the highway and city gas mileages (in descending order). We just have to use hwy - cty as the sort variable:

arrange(mpg, desc(hwy - cty))
## # A tibble: 234 x 11
##    manufacturer model      displ  year   cyl trans      drv     cty   hwy fl    class     
##    <chr>        <chr>      <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr>     
##  1 honda        civic        1.8  2008     4 auto(l5)   f        24    36 c     subcompact
##  2 volkswagen   new beetle   1.9  1999     4 auto(l4)   f        29    41 d     subcompact
##  3 audi         a4           1.8  1999     4 auto(l5)   f        18    29 p     compact   
##  4 audi         a4           2    2008     4 manual(m6) f        20    31 p     compact   
##  5 chevrolet    malibu       3.5  2008     6 auto(l4)   f        18    29 r     midsize   
##  6 honda        civic        1.8  2008     4 auto(l5)   f        25    36 r     subcompact
##  7 volkswagen   jetta        1.9  1999     4 manual(m5) f        33    44 d     compact   
##  8 volkswagen   passat       1.8  1999     4 auto(l5)   f        18    29 p     midsize   
##  9 audi         a4           2.8  1999     6 auto(l5)   f        16    26 p     compact   
## 10 audi         a4 quattro   2.8  1999     6 auto(l5)   4        15    25 p     compact   
## # ... with 224 more rows
## # i Use `print(n = ...)` to see more rows

The 2008 Honda Civic thus has the largest difference between city and highway gas mileage.

Notice in the previous sort that there is no column that actually lists the highway/city mileage differences. A better practice would be to create a new variable that records these values and then to sort by that new variable. We’ll see how to create new variables in Section 2.5.

2.2.1 Exercises

These exercises require the flights data set from the nycflights13 library.

  1. Where does arrange sort the NA values in a column? (Experiment with a data set that has missing values.) How could you force arrange to sort all of the NA values to the top of the list? (Try using is.na.)

  2. What was the longest delay of any flight?

  3. What flight left the earliest in the day?

  4. What flight averaged the fastest speed while in the air? (Average speed is the total distance traveled divided by the total time spent in the air.) What flight averaged the slowest speed?

  5. What flight traveled the farthest distance? Which one traveled the shortest distance?

2.3 The Pipe

We often will want to use two or more transformation functions in succession. For example, suppose we want to filter out the cars in mpg that get less than 20 miles per gallon on the highway and then sort the remaining cars according to engine size (displ) in descending order. There are a couple of naive ways to do this. One is:

arrange(filter(mpg, hwy >= 20), desc(displ))
## # A tibble: 156 x 11
##    manufacturer model              displ  year   cyl trans      drv     cty   hwy fl    class 
##    <chr>        <chr>              <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
##  1 chevrolet    corvette             7    2008     8 manual(m6) r        15    24 p     2seat~
##  2 chevrolet    corvette             6.2  2008     8 manual(m6) r        16    26 p     2seat~
##  3 chevrolet    corvette             6.2  2008     8 auto(s6)   r        15    25 p     2seat~
##  4 chevrolet    corvette             5.7  1999     8 manual(m6) r        16    26 p     2seat~
##  5 chevrolet    corvette             5.7  1999     8 auto(l4)   r        15    23 p     2seat~
##  6 ford         mustang              5.4  2008     8 manual(m6) r        14    20 p     subco~
##  7 chevrolet    c1500 suburban 2wd   5.3  2008     8 auto(l4)   r        14    20 r     suv   
##  8 chevrolet    c1500 suburban 2wd   5.3  2008     8 auto(l4)   r        14    20 r     suv   
##  9 pontiac      grand prix           5.3  2008     8 auto(s4)   f        16    25 p     midsi~
## 10 ford         mustang              4.6  1999     8 auto(l4)   r        15    21 r     subco~
## # ... with 146 more rows
## # i Use `print(n = ...)` to see more rows

Nesting functions in this way is often not a good practice because it becomes hard to keep track of parentheses, especially when you nest several functions. A slightly better (but still not ideal) method is to perform one transformation at a time, storing the intermediate results along the way:

filtered_mpg <- filter(mpg, hwy >= 20)
arrange(filtered_mpg, desc(cyl))
## # A tibble: 156 x 11
##    manufacturer model              displ  year   cyl trans      drv     cty   hwy fl    class 
##    <chr>        <chr>              <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
##  1 audi         a6 quattro           4.2  2008     8 auto(s6)   4        16    23 p     midsi~
##  2 chevrolet    c1500 suburban 2wd   5.3  2008     8 auto(l4)   r        14    20 r     suv   
##  3 chevrolet    c1500 suburban 2wd   5.3  2008     8 auto(l4)   r        14    20 r     suv   
##  4 chevrolet    corvette             5.7  1999     8 manual(m6) r        16    26 p     2seat~
##  5 chevrolet    corvette             5.7  1999     8 auto(l4)   r        15    23 p     2seat~
##  6 chevrolet    corvette             6.2  2008     8 manual(m6) r        16    26 p     2seat~
##  7 chevrolet    corvette             6.2  2008     8 auto(s6)   r        15    25 p     2seat~
##  8 chevrolet    corvette             7    2008     8 manual(m6) r        15    24 p     2seat~
##  9 ford         mustang              4.6  1999     8 auto(l4)   r        15    21 r     subco~
## 10 ford         mustang              4.6  1999     8 manual(m5) r        15    22 r     subco~
## # ... with 146 more rows
## # i Use `print(n = ...)` to see more rows

This approach avoids nesting, but it introduces a redundancy: the name filtered_mpg is introduced and then immediately used as an argument inside arrange. It would be much better to combine these steps. Luckily, tidyverse has a way to do this using a construct called the pipe.

The pipe provides a way to bypass the need to name an intermediate data set, only to immediately feed it into the next transformation function in line. Here’s how the above double transformation is accomplished using the pipe:

mpg %>%
  filter(hwy >= 20) %>%
  arrange(desc(cyl))
## # A tibble: 156 x 11
##    manufacturer model              displ  year   cyl trans      drv     cty   hwy fl    class 
##    <chr>        <chr>              <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
##  1 audi         a6 quattro           4.2  2008     8 auto(s6)   4        16    23 p     midsi~
##  2 chevrolet    c1500 suburban 2wd   5.3  2008     8 auto(l4)   r        14    20 r     suv   
##  3 chevrolet    c1500 suburban 2wd   5.3  2008     8 auto(l4)   r        14    20 r     suv   
##  4 chevrolet    corvette             5.7  1999     8 manual(m6) r        16    26 p     2seat~
##  5 chevrolet    corvette             5.7  1999     8 auto(l4)   r        15    23 p     2seat~
##  6 chevrolet    corvette             6.2  2008     8 manual(m6) r        16    26 p     2seat~
##  7 chevrolet    corvette             6.2  2008     8 auto(s6)   r        15    25 p     2seat~
##  8 chevrolet    corvette             7    2008     8 manual(m6) r        15    24 p     2seat~
##  9 ford         mustang              4.6  1999     8 auto(l4)   r        15    21 r     subco~
## 10 ford         mustang              4.6  1999     8 manual(m5) r        15    22 r     subco~
## # ... with 146 more rows
## # i Use `print(n = ...)` to see more rows

We should read this as, “Start with mpg, feed it into filter, then feed the result into arrange.” We avoid nesting and introducing intermediate data set names. The operator %>% takes the data set that precedes it and feeds it into the function that follows it.

The pipe leads to much clearer and readable code, and we’ll use it constantly.

2.3.1 Exercises

Use the pipe to answer the following questions.

  1. What is the farthest airport flown to by American Airlines?

  2. What is the farthest airport flown to out of LaGuardia (LGA) by American Airlines?

  3. What was the most delayed flight during the winter (January, February, and December)? What was the most delayed during the summer (June, July, and August)?

2.4 Selecting Columns

A very common procedure when analyzing a data set is to get rid of columns that are irrelevant to your analysis. This is accomplished using the select function. For example, suppose we want to pare down the mpg data set so that the only variables displayed are manufacturer, model, displ, and hwy:

select(mpg, manufacturer, model, displ, hwy)
## # A tibble: 234 x 4
##    manufacturer model      displ   hwy
##    <chr>        <chr>      <dbl> <int>
##  1 audi         a4           1.8    29
##  2 audi         a4           1.8    29
##  3 audi         a4           2      31
##  4 audi         a4           2      30
##  5 audi         a4           2.8    26
##  6 audi         a4           2.8    26
##  7 audi         a4           3.1    27
##  8 audi         a4 quattro   1.8    26
##  9 audi         a4 quattro   1.8    25
## 10 audi         a4 quattro   2      28
## # ... with 224 more rows
## # i Use `print(n = ...)` to see more rows

There are a lot of built-in “helper” functions available within select. You can read the documentation (?select) to see them, but there are a few particularly useful ones. For one, it’s often easier to specify the columns you don’t want. For example, suppose we want to keep every column of mpg except trans and drv:

select(mpg, -trans, -drv)
## # A tibble: 234 x 9
##    manufacturer model      displ  year   cyl   cty   hwy fl    class  
##    <chr>        <chr>      <dbl> <int> <int> <int> <int> <chr> <chr>  
##  1 audi         a4           1.8  1999     4    18    29 p     compact
##  2 audi         a4           1.8  1999     4    21    29 p     compact
##  3 audi         a4           2    2008     4    20    31 p     compact
##  4 audi         a4           2    2008     4    21    30 p     compact
##  5 audi         a4           2.8  1999     6    16    26 p     compact
##  6 audi         a4           2.8  1999     6    18    26 p     compact
##  7 audi         a4           3.1  2008     6    18    27 p     compact
##  8 audi         a4 quattro   1.8  1999     4    18    26 p     compact
##  9 audi         a4 quattro   1.8  1999     4    16    25 p     compact
## 10 audi         a4 quattro   2    2008     4    20    28 p     compact
## # ... with 224 more rows
## # i Use `print(n = ...)` to see more rows

It’s also helpful to specify a set of columns to keep that appear consecutively. For example, suppose we want to keep manufacturer, model, displ, year, and cyl. Since they appear in consecutive columns, there’s a shortcut we can use to select them:

select(mpg, manufacturer:cyl)
## # A tibble: 234 x 5
##    manufacturer model      displ  year   cyl
##    <chr>        <chr>      <dbl> <int> <int>
##  1 audi         a4           1.8  1999     4
##  2 audi         a4           1.8  1999     4
##  3 audi         a4           2    2008     4
##  4 audi         a4           2    2008     4
##  5 audi         a4           2.8  1999     6
##  6 audi         a4           2.8  1999     6
##  7 audi         a4           3.1  2008     6
##  8 audi         a4 quattro   1.8  1999     4
##  9 audi         a4 quattro   1.8  1999     4
## 10 audi         a4 quattro   2    2008     4
## # ... with 224 more rows
## # i Use `print(n = ...)` to see more rows

Of course, we can use select together with the other transformation functions via the pipe. Let’s rank the Chevrolets in mpg in descending order of city gas mileage and show only the car model, engine size, year, number of cylinders, transmission type, and city mileage:

mpg %>%
  filter(manufacturer == "chevrolet") %>%
  arrange(desc(cty)) %>%
  select(model:trans, cty)
## # A tibble: 19 x 6
##    model              displ  year   cyl trans        cty
##    <chr>              <dbl> <int> <int> <chr>      <int>
##  1 malibu               2.4  2008     4 auto(l4)      22
##  2 malibu               2.4  1999     4 auto(l4)      19
##  3 malibu               3.1  1999     6 auto(l4)      18
##  4 malibu               3.5  2008     6 auto(l4)      18
##  5 malibu               3.6  2008     6 auto(s6)      17
##  6 corvette             5.7  1999     8 manual(m6)    16
##  7 corvette             6.2  2008     8 manual(m6)    16
##  8 corvette             5.7  1999     8 auto(l4)      15
##  9 corvette             6.2  2008     8 auto(s6)      15
## 10 corvette             7    2008     8 manual(m6)    15
## 11 c1500 suburban 2wd   5.3  2008     8 auto(l4)      14
## 12 c1500 suburban 2wd   5.3  2008     8 auto(l4)      14
## 13 k1500 tahoe 4wd      5.3  2008     8 auto(l4)      14
## 14 k1500 tahoe 4wd      6.5  1999     8 auto(l4)      14
## 15 c1500 suburban 2wd   5.7  1999     8 auto(l4)      13
## 16 c1500 suburban 2wd   6    2008     8 auto(l4)      12
## 17 c1500 suburban 2wd   5.3  2008     8 auto(l4)      11
## 18 k1500 tahoe 4wd      5.3  2008     8 auto(l4)      11
## 19 k1500 tahoe 4wd      5.7  1999     8 auto(l4)      11

2.5 Creating New Variables from Old

This section requires the nycflights13 library.

It’s often necessary to use the variables in a data set to create a new variable. The transformation function used to do this is mutate.

For example, we might want to add a column to flights that contains each flight’s average air speed. Recalling that average speed is total distance divided by total time, we can calculate a flight’s air speed by calculating distance / air_time. Well, almost. The units of distance are miles, and the units of air_time are minutes, so to calculate air speed in the more familiar units of miles per hour, we’d actually want to compute distance / air_time * 60.

To make this calculation for each row of flights and add it as a new column (let’s call this new column air_speed), we apply the mutate function as follows:

flights %>%
  mutate(air_speed = distance / air_time * 60)

When mutating a new variable onto a data set, it’s a better practice to assign the mutated data set a new name. You can do so by inserting a name assignment <NEW NAME> <- at the beginning of the code above:

flights_w_speed <- flights %>%
  mutate(air_speed = distance / air_time * 60)

You can now refer to flights_w_speed for subsequent analysis.


The argument of mutate is always of the form:

<NEW VARIABLE NAME> = <FORMULA INVOLVING OLD VARIABLES>

The formula is often constructed using basic arithmetic, as with the air speed calculation above. However, it’s worth mentioning a few other useful operations to use within formulas.

Run the following lines of code, and try to guess what the %/% and %% operators do:

27 %/% 4
27 %% 4

You may have guessed that %/% returns the whole number quotient obtained when dividing one number by another, while %% returns the remainder. We can see right away how these operations might be applied by doing the following: Add a column to flights that computes the number of minutes since midnight until each flight departs and another column that computes the number of minutes since midnight until each flight arrives.

For example, for the very first flight in the data set, the value of dep_time is 517, and the value of arr_time is 830. These numbers are really just abbreviations for 5:17am and 8:30am, and are not numbers that could be directly used in calculations. Converting these times to minutes since midnight would provide an actual numerical measurement (not a time) that indicates when a flight departed or arrived.

For the 5:17 departure time, we would convert to minutes since midnight by calculating 5*60 + 17, which is 317. But how would we tell R to do this? We’d have to have a way to separate the first one or two digits from the last two digits. This is where %/% and %% come in: Notice that 517 %/% 100 = 5 and 517 %% 100 = 17. Thus, the conversion can be accomplished by the formula (517 %/% 100)*60 + (517 %% 100).

We can now add our minutes since midnight variables as follows. (Notice that you can create more than one new variable within a single mutate call.) We’ll also select the relevant columns so they’re visible in the output.

flights_w_mins_midnight <- flights %>%
  mutate(dep_mins_midnight = (dep_time %/% 100)*60 + (dep_time %% 100),
         arr_mins_midnight = (arr_time %/% 100)*60 + (arr_time %% 100)) %>%
  select(dep_time, dep_mins_midnight, arr_time, arr_mins_midnight)

flights_w_mins_midnight
## # A tibble: 336,776 x 4
##    dep_time dep_mins_midnight arr_time arr_mins_midnight
##       <int>             <dbl>    <int>             <dbl>
##  1      517               317      830               510
##  2      533               333      850               530
##  3      542               342      923               563
##  4      544               344     1004               604
##  5      554               354      812               492
##  6      554               354      740               460
##  7      555               355      913               553
##  8      557               357      709               429
##  9      557               357      838               518
## 10      558               358      753               473
## # ... with 336,766 more rows
## # i Use `print(n = ...)` to see more rows

Another very useful operation to use within a mutate formula is ifelse. The ifelse function returns a value that depends on the truth or falsehood of some condition. The basic syntax is


ifelse(<CONDITION>, 
       <VALUE IF CONDITION IS TRUE>,
       <VALUE IF CONDITION IS FALSE)

For example, the ifelse below says to check whether x < y. If it is, then return the value of x^2. If it’s not, then return the string “hello.”

x <- 4
y <- 2

ifelse(x < y, 
       x^2, 
       "hello")
## [1] "hello"

ifelse statements can be used within mutate to create new variables in a way that depends on conditions on the old variables. It’s an especially useful way to categorize continuous variables, i.e, to assign categorical values to continuous ones. For example, maybe we want to add a column to flights that states whether a flight arrived late or on time (or early). (We’ll just select the relevant columns to make our new column visible.)

flights_w_arr_status <- flights %>%
  mutate(arr_status = ifelse(arr_delay <= 0, 
                             "on time", 
                             "late")) %>%
  select(month, day, arr_delay, origin, dest, arr_status)

flights_w_arr_status
## # A tibble: 336,776 x 6
##    month   day arr_delay origin dest  arr_status
##    <int> <int>     <dbl> <chr>  <chr> <chr>     
##  1     1     1        11 EWR    IAH   late      
##  2     1     1        20 LGA    IAH   late      
##  3     1     1        33 JFK    MIA   late      
##  4     1     1       -18 JFK    BQN   on time   
##  5     1     1       -25 LGA    ATL   on time   
##  6     1     1        12 EWR    ORD   late      
##  7     1     1        19 EWR    FLL   late      
##  8     1     1       -14 LGA    IAD   on time   
##  9     1     1        -8 JFK    MCO   on time   
## 10     1     1         8 LGA    ORD   late      
## # ... with 336,766 more rows
## # i Use `print(n = ...)` to see more rows

Simple ifelse statements like the one above work well when creating binary (two-value) categorical variables like a flight’s arrival status. However, to create categorical variables with more than two possible values, ifelse is not ideal.

Suppose we want to re-do the above categorization but with an extra “canceled” category for the flights with an NA in the arr_delay column. One (non-optimal) way to do this would be with a nested ifelse statement. The logic would work like this: First, check whether arr_delay is NA. If it is, assign a value of “canceled.” If it isn’t, then we’re not yet ready to assign a value since it could still be either “on time” or “late.” We have more checking to do. Thus, in the third argument of our ifelse, we have to start a new ifelse. This new ifelse will check whether arr_delay is negative or 0. If so, we’ll assign “on time.” If not, the only remaining option is for arr_delay to be positive, so we’ll assign “late.” The filter at the end is to ensure that a value of “canceled” actually shows up in the visible output.

flights_w_arr_status2 <- flights %>%
  mutate(arr_status = ifelse(is.na(arr_delay), 
                             "canceled",
                             ifelse(arr_delay <= 0,
                                    "on time",
                                    "late"))) %>%
  select(month, day, arr_delay, origin, dest, arr_status) %>%
  filter(row_number() >= 611)

flights_w_arr_status2
## # A tibble: 336,166 x 6
##    month   day arr_delay origin dest  arr_status
##    <int> <int>     <dbl> <chr>  <chr> <chr>     
##  1     1     1        46 JFK    RSW   late      
##  2     1     1       -12 EWR    PBI   on time   
##  3     1     1         3 LGA    DTW   late      
##  4     1     1       -21 JFK    PDX   on time   
##  5     1     1        68 LGA    MIA   late      
##  6     1     1        NA LGA    XNA   canceled  
##  7     1     1       -24 LGA    MCO   on time   
##  8     1     1        66 JFK    RDU   late      
##  9     1     1        -2 EWR    MCO   on time   
## 10     1     1        27 JFK    TPA   late      
## # ... with 336,156 more rows
## # i Use `print(n = ...)` to see more rows

There are some problems with this, the most obvious of which is that nested ifelse statements are hard to read, especially when there are a lot of categories. (The above only has three categories and is already very cluttered with parentheses.) Another problem is that nested ifelse statements depend on the order in which the conditions are checked. One of the exercises for this section will illustrate what happens when the NA condition above is not checked first.

There is a much better option for categorizations with more than two categories: case_when. Here’s how it works:

flights_w_arr_status3 <- flights %>%
  mutate(arr_status = case_when(is.na(arr_delay) ~ "canceled",
                                arr_delay <= 0 ~ "on time",
                                arr_delay >0 ~ "late")) %>%
  select(month, day, arr_delay, origin, dest, arr_status) %>%
  filter(row_number() >= 611)

flights_w_arr_status3
## # A tibble: 336,166 x 6
##    month   day arr_delay origin dest  arr_status
##    <int> <int>     <dbl> <chr>  <chr> <chr>     
##  1     1     1        46 JFK    RSW   late      
##  2     1     1       -12 EWR    PBI   on time   
##  3     1     1         3 LGA    DTW   late      
##  4     1     1       -21 JFK    PDX   on time   
##  5     1     1        68 LGA    MIA   late      
##  6     1     1        NA LGA    XNA   canceled  
##  7     1     1       -24 LGA    MCO   on time   
##  8     1     1        66 JFK    RDU   late      
##  9     1     1        -2 EWR    MCO   on time   
## 10     1     1        27 JFK    TPA   late      
## # ... with 336,156 more rows
## # i Use `print(n = ...)` to see more rows

Not only is this much clearer code, it’s more robust since it does not depend on the order in which we list the conditions, as we’ll see in the exercises.

In the case_when statement above, the third condition arr_delay > 0 is the only option left after the first two. If we don’t want to specify the last remaining condition explicitly but instead just let it catch all the remaining cases, we can just enter TRUE for this condition. The following case_when is thus equivalent to the one above:

case_when(is.na(arr_delay) ~ "canceled",
          arr_delay <= 0 ~ "on time",
          TRUE ~ "late")

2.5.1 Exercises

These exercise require the nycflights13 and Lahman libraries.

  1. Find the fastest and slowest flights (in terms of air speed). State the actual average air speeds in miles per hour.

  2. Use the variables dep_mins_midnight and arr_mins_midnight to calculate the number of minutes from dep_time to arr_time for each flight. Compare your answers to the air_time values. Why don’t they match?

  3. Add a column to flights that indicates whether a flight was canceled. The new column’s values should be either canceled or not canceled. Then find the first date on which a flight was canceled. (Have R do this for you; don’t just scroll through the data set.)

  4. Count the number of NA cells in the arr_status variable in flights_w_arr_status we created above. (Have R do this; don’t manually count them yourself.) Why did these flights not have a value for arr_status? (Hint: What happens when you ask R whether NA <= 0?)

  5. In the nested ifelse we used to create flights_w_arr_status2 above, rearrange the conditions so that is.na(arr_delay) is not the first condition checked. Then find a row in which the arr_delay value is NA and check whether it assigned the correct value of arr_status. If it didn’t, try to explain why.

  6. Re-do the previous exercise for the case_when used to create flights_w_arr_status3. Does it handle the NAs in arr_delay correctly? Use your answer to explain why case_when is better than a nested ifelse.

  7. Add a column to flights that gives the season for each flight. For purposes of this exercise, let’s say “winter” is January, February, and December, “spring” is March, April, and May, “summer” is June, July, and August, and “fall” is September, October, and November.

  8. Categorize the observations in diamonds as “expensive,” “very expensive,” “insanely expensive,” or “priceless.” Decide for yourself how to define these labels. Then create a visualization that shows the distribution of these labels.

  9. View the Batting data set from the Lahman library and read its documentation. How many observations and variables does it have?

  10. Batting does not have a column for a player’s batting average, which measures the percentage of a player’s at bats that result in a hit. The formula is H / AB. Create a batting average column in Batting and find the player(s) with the all-time highest single season batting average. What’s strange about your result? (You can just refer to a player by their playerID value if you don’t know their real name.)

  11. To fix the problem from the previous exercise, a minimum number of at bats are required for a batting average to be considered “official.” Re-do Exercise 6, but only rank the players who had at least 400 at bats.

  12. The modern era of baseball is generally considered to have begun in 1947 (after World War II ended). Who’s had the best single season batting average of the modern era (assuming at least 400 at bats)? Answer this question entirely by using transformation functions rather than by just scrolling through the data.

  13. When was the last year a player had a batting average of at least 0.400 with a minimum of 400 at bats? (Again, answer this question using only transformation functions.)

2.6 Grouped Summaries

Suppose we want to create a data table that ranks the manufacturers in mpg according to the average highway gas mileage for each manufacturer. This would entail grouping the cars by the manufacturer values and then computing the average value of hwy for each group. This is an example of a grouped summary and is done with a combination of the group_by and summarize functions. Here’s how it’s done:

mpg %>%
  group_by(manufacturer) %>%
  summarize(average_highway_mileage = mean(hwy)) %>%
  arrange(desc(average_highway_mileage))
## # A tibble: 15 x 2
##    manufacturer average_highway_mileage
##    <chr>                          <dbl>
##  1 honda                           32.6
##  2 volkswagen                      29.2
##  3 hyundai                         26.9
##  4 audi                            26.4
##  5 pontiac                         26.4
##  6 subaru                          25.6
##  7 toyota                          24.9
##  8 nissan                          24.6
##  9 chevrolet                       21.9
## 10 ford                            19.4
## 11 mercury                         18  
## 12 dodge                           17.9
## 13 jeep                            17.6
## 14 lincoln                         17  
## 15 land rover                      16.5

A couple of notes:

  • group_by and summarize are always used together.

  • summarize works similarly to mutate in that it creates a new column with a name that the user provides (average_highway_mileage above).

  • mean is a built-in function that computes the average hwy value in each group.

  • The displayed data table will only show the grouped column and any computed summary columns.

You can group by more than one variable as well. Suppose we want to rank highway gas mileage by manufacturer and model:

mpg %>%
  group_by(manufacturer, model) %>%
  summarize(average_highway_mileage = mean(hwy)) %>%
  arrange(desc(average_highway_mileage))
## # A tibble: 38 x 3
## # Groups:   manufacturer [15]
##    manufacturer model        average_highway_mileage
##    <chr>        <chr>                          <dbl>
##  1 toyota       corolla                         34  
##  2 volkswagen   new beetle                      32.8
##  3 honda        civic                           32.6
##  4 volkswagen   jetta                           29.1
##  5 nissan       altima                          28.7
##  6 audi         a4                              28.3
##  7 toyota       camry                           28.3
##  8 toyota       camry solara                    28.1
##  9 hyundai      sonata                          27.7
## 10 chevrolet    malibu                          27.6
## # ... with 28 more rows
## # i Use `print(n = ...)` to see more rows

mean is just one example of a summary statistic that can be used inside summarize. Here are a few more common ones:

statistic syntax use
mean mean(<VARIABLE>) computes the average value of <VARIABLE> within each group
sum sum(<VARIABLE>) adds the values within each group
count n() counts the number of observations within each group
distinct count n_distinct(<VARIABLE>) counts the number of distinct values of <VARIABLE> within each group
maximum max(<VARIABLE>) returns the largest value of <VARIABLE> within each group
minimum min(<VARIABLE>) returns the smallest value of <VARIABLE> within each group
standard deviation sd(<VARIABLE>) computes the standard deviation of <VARIABLE> within each group
median median(<VARIABLE>) computes the median value of <VARIABLE> within each group

There are many others as well. It’s certainly possible to compute more than one summary statistic within a single summarize call. In fact, whenever you’re computing a mean value, it’s a good idea to also compute the count. This is because comparing averages is really only appropriate when the groups have roughly the same number of observations. If we do this for the average highway mileage example above, we get:

mpg %>%
  group_by(manufacturer) %>%
  summarize(average_highway_mileage = mean(hwy),
            count = n()) %>%
  arrange(desc(average_highway_mileage))
## # A tibble: 15 x 3
##    manufacturer average_highway_mileage count
##    <chr>                          <dbl> <int>
##  1 honda                           32.6     9
##  2 volkswagen                      29.2    27
##  3 hyundai                         26.9    14
##  4 audi                            26.4    18
##  5 pontiac                         26.4     5
##  6 subaru                          25.6    14
##  7 toyota                          24.9    34
##  8 nissan                          24.6    13
##  9 chevrolet                       21.9    19
## 10 ford                            19.4    25
## 11 mercury                         18       4
## 12 dodge                           17.9    37
## 13 jeep                            17.6     8
## 14 lincoln                         17       3
## 15 land rover                      16.5     4

We would therefore want to refrain from declaring that Pontiacs have better fuel efficiencies than Toyotas since Pontiacs are relatively underrepresented in the data set.


Referring to the flights data set, suppose we want to know which destination airport has the worst arrival delays. We could group flights by the dest variable and then compute the average value of arr_delay for each group. Following the above advice, we’ll also compute the count for each group.

flights %>%
  group_by(dest) %>%
  summarize(avg_arr_delay = mean(arr_delay),
            count = n()) %>%
  arrange(desc(avg_arr_delay))
## # A tibble: 105 x 3
##    dest  avg_arr_delay count
##    <chr>         <dbl> <int>
##  1 SBN            6.5     10
##  2 EYW            6.35    17
##  3 ABQ            4.38   254
##  4 ANC           -2.5      8
##  5 LEX          -22        1
##  6 ACK           NA      265
##  7 ALB           NA      439
##  8 ATL           NA    17215
##  9 AUS           NA     2439
## 10 AVL           NA      275
## # ... with 95 more rows
## # i Use `print(n = ...)` to see more rows

What’s the deal with all the NAs in the avg_arr_delay column? Remember that when a flight is canceled, there is no arrival and hence to arrival delay. Thus, canceled flights have NA in the arr_delay column. So in any dest group which contains at least one canceled flight, the mean function will be attempting to calculate an average value when not all of the values are available. This means that the average itself will be reported as NA.

Luckily, there’s an easy, built-in way to avoid this. We can remove the NAs from the mean computation by inserting the argument na.rm = TRUE into mean:

flights %>%
  group_by(dest) %>%
  summarize(avg_arr_delay = mean(arr_delay, na.rm = TRUE),
            count = n()) %>%
  arrange(desc(avg_arr_delay))
## # A tibble: 105 x 3
##    dest  avg_arr_delay count
##    <chr>         <dbl> <int>
##  1 CAE            41.8   116
##  2 TUL            33.7   315
##  3 OKC            30.6   346
##  4 JAC            28.1    25
##  5 TYS            24.1   631
##  6 MSN            20.2   572
##  7 RIC            20.1  2454
##  8 CAK            19.7   864
##  9 DSM            19.0   569
## 10 GRR            18.2   765
## # ... with 95 more rows
## # i Use `print(n = ...)` to see more rows

It looks like CAE (Columbia Metropolitan Airport) has the highest average arrival delay, although there were only 116 flights there. On the other hand, RIC (Richmond International Airport) maintained a high average arrival delay over almost 2500 flights. RIC might get my vote for the worst airport to fly into from New York.

If you look at the end of the list, you see that, on average, flights to Blue Grass Airport in Lexington, KY, arrived an average of 22 minutes early! This seems amazing until you notice that there was only one flight there.

## # A tibble: 6 x 3
##   dest  avg_arr_delay count
##   <chr>         <dbl> <int>
## 1 ANC           -2.5      8
## 2 STT           -3.84   522
## 3 SNA           -7.87   825
## 4 PSP          -12.7     19
## 5 LEX          -22        1
## 6 LGA          NaN        1

Last on the list is LGA, with an average arrival delay value of NaN. This stands for “not a number.” If we extract out this observation, we might be able to see why:

flights %>%
  filter(dest == "LGA") %>%
  select(month, day, dest, dep_delay, arr_delay)
## # A tibble: 1 x 5
##   month   day dest  dep_delay arr_delay
##   <int> <int> <chr>     <dbl>     <dbl>
## 1     7    27 LGA          NA        NA

It looks like there was only one flight to LGA, and it was canceled, giving us an NA in the arr_delay column. When we then compute the average arrival delay and use na.rm = TRUE to remove the NA, we’re left with no numbers to take the average of! This would lead to a division by 0, which is not arithmetically possible. Hence the NaN.

2.6.1 Exercises

These exercises will require both the nycflights13 and Lahman libraries.

  1. Create a data set from flights that shows the average departure delay on each day of the year.

  2. Which carrier from flights had the best on-time arrival rate in 2013? Which had the worst? (Be sure to include counts so we know whether the rates are comparable.)

  3. Look at each destination in flights. Can you find any flights that were suspiciously fast, possibly indicating a data entry error?

  4. Which carrier in flights flew to the most distinct destinations?

  5. Which specific plane (identified by tailnum) had the most variation in the distances it flew? (Be sure to include counts in your analysis.)

  6. Which day of the year had the most cancellations? Which day had the least?

  7. Using the Batting data set from Lahman, determine the maximum number of home runs hit by a player every year since 1871.

  8. Create a visualization of your findings from the previous exercise.

  9. Create a data set that shows a ranked list of the all-time leaders in career batting average. (A player’s career batting average is his total number of hits divided by his total number of at bats.) Note that to be eligible for this list, a player must have at least 3000 total at bats. Be sure to take this into account when you create your list.

2.7 Project

Project Description: The purpose of this project is to explore and explain observations in the flights data set by transforming and visualizing the data.

Instructions:

Answer the following questions using the appropriate transformation functions and visualizations.

  1. Create a data set that lists the days during 2013 on which the percentage of flights that were either canceled or delayed on departure by more than one hour exceeded 35%.

  2. Do a Google search to see what the weather was like in New York on each of the days from the previous problem in 2013. Do your findings explain the large number of delays/cancellations?

  3. Create a data set that shows, for each scheduled departure hour during the day, the percentage of flights that were canceled or delayed by more than one hour.

  4. Obtain a plot that shows the relationship between the hour and the percentage of flights canceled or delayed by more than an hour.

  5. Explain the relationship in the previous problem. In particular:

    1. Why does the plot have the given shape?
    2. Are there outliers? If so, why might they be there?
    3. When is the riskiest time of day to fly if you want to avoid cancellations and long delays, and why might that be?
  6. Above, you found dates on which a lot of cancellations and long delays occurred. However, several flights were able to leave on time or early on those dates. Create a data set that contains the flights that left on time or early on those dates.

  7. It’s a reasonable guess that the flights that were able to leave early despite bad weather were probably the ones that left in the morning before the weather hit. For each date in the data set from the previous problem, find the average scheduled hour of departure. Do your findings mostly confirm the guess?

  8. Did any of the problematic dates have an average departure hour in the afternoon? If so, why might that have been?

Guidelines:

  • Include a title, your name, and the date in the heading.

  • Also include a preliminary code chunk in which you load the libraries you’ll need, and briefly say what each one is for.

  • Begin with an introductory paragraph containing, at least, a description of the data set (including what it contains, its size, and its source) and a nicely displayed data table using the datatable function. (If the data set is too big to display when you try to knit your .Rmd file, you don’t have to display it.)

  • Clearly describe what you’ll be doing with the data, and include any questions you’ll be trying to answer.

  • Follow the introduction and problem statement with the body of your report which will contain your work, broken into relevant section headings.

  • The compiled report should show all of your R code and its output, but it should not show any warning or error messages. (This is for my benefit. A professional report might not include the code, depending on the audience.)

  • The body should also include text that provides a running narrative that guides the reader through your work. This should be thorough and complete, but it should avoid large blocks of text.

  • All graphics should look nice, be easy to read, and be fully labelled.

  • All data tables should be displayed with the datatable function.

  • You should include insightful statements about the meaning you’re extracting from your graphics.

  • End with an overall concluding statement which summarizes your findings.

Grading Rubric:

  • Transformations: Do you use the right transformation functions to answer the questions? (30 points)

  • Insights: Are insights fully explained and well-written? Are the insights derived from your results? (30 points)

  • Narrative: Is it clear what you’re trying to do in this project? Do you maintain a readable narrative throughout that guides the reader through your analysis? (20 points)

  • Professionalism: Does your report look nice? Do you provide insights based on your analysis? Is your code clear and readable? (15 points)