Chapter 2 Data Transformation

In this chapter, we will use a set of functions from the dplyr package (which is part of tidyverse) to transform a data set in various ways. In particular, we will see how to filter out observations, sort the rows of a data set, remove unnecessary variables, create new variables from old ones, and group observations together. We’ll also introduce a tidyverse construct called “the pipe” which provides a streamlined way to perform several transformations in succession.

We’ll need to install two new packages1:

install.packages("nycflights13")
install.packages("Lahman")

The first package contains a data set with records from all of the flights that originated from a major New York City airport in 2013. The second contains Major League Baseball data sets dating back to 1871. Be sure to load these libraries, along with tidyverse.

library(tidyverse)
library(nycflights13)
library(Lahman)

2.1 Filtering a Data Set

Often, we will want to filter a data set so that we only see the observations that satisfy a certain condition. For example, suppose we want to see only the cars in the mpg data set that have at least 6-cylinder engines. The code for this is:

filter(mpg, cyl >= 6)
## # A tibble: 149 x 11
##    manufacturer model      displ  year   cyl trans      drv     cty   hwy fl    class  
##    <chr>        <chr>      <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr>  
##  1 audi         a4           2.8  1999     6 auto(l5)   f        16    26 p     compact
##  2 audi         a4           2.8  1999     6 manual(m5) f        18    26 p     compact
##  3 audi         a4           3.1  2008     6 auto(av)   f        18    27 p     compact
##  4 audi         a4 quattro   2.8  1999     6 auto(l5)   4        15    25 p     compact
##  5 audi         a4 quattro   2.8  1999     6 manual(m5) 4        17    25 p     compact
##  6 audi         a4 quattro   3.1  2008     6 auto(s6)   4        17    25 p     compact
##  7 audi         a4 quattro   3.1  2008     6 manual(m6) 4        15    25 p     compact
##  8 audi         a6 quattro   2.8  1999     6 auto(l5)   4        15    24 p     midsize
##  9 audi         a6 quattro   3.1  2008     6 auto(s6)   4        17    25 p     midsize
## 10 audi         a6 quattro   4.2  2008     8 auto(s6)   4        16    23 p     midsize
## # i 139 more rows

You can see the full filtered data set by assigning it a name and then viewing it:

cyl6 <- filter(mpg, cyl >= 6)
View(cyl6)

In general, the syntax for the filter function is:


filter(<DATA SET>, <CONDITIONS>)

The condition arguments in filter are logical statements, meaning they are statements whose values are either TRUE or FALSE. Notice that the filter function keeps the observations for which the conditions have a value of TRUE and discards those with a value of FALSE. (In other words, we specify the observations we want to keep, not the ones we want to get rid of.)

Logical statements are often expressed as equations or inequalities. For example, suppose we make the assignments x <- 7 and y <- 9. The following table lists various ways to construct logical statements in terms of these two variables.

relation statement value
less than x < y TRUE
greater than x > y FALSE
less than or equal to x+3 <= y FALSE
greater than or equal to x >= y-2 TRUE
equal to 3*x+5 == 4*y-10 TRUE

Notice that the double equal sign == is used to check equality. You can also negate statements or form compound statements. For example:

relation statement value
not equal to x != y TRUE
and (x < y) & (x+2 == y) TRUE
(x < y) & (x != x) FALSE
or (x == 8) | (y == 9) TRUE
(x >= y) | (y < 0) FALSE

Notice that an “and” statement is only true when both of the conditions are true. An “or” statement is true as long as at least one of the conditions is true.

The values TRUE and FALSE can actually be handled arithmetically in R. The value TRUE is given the numerical value 1, and FALSE is given 0. For example:

TRUE + TRUE
## [1] 2
FALSE * 28
## [1] 0

This is useful when you want to count how many values in a column satisfy a certain condition. For example, suppose we want to know how many cars in mpg have 4-wheel drive. In the following code chunk, mpg$drv refers to the drv column within the mpg data set.

sum(mpg$drv == "4")
## [1] 103

The condition mpg$drv == "4" assigns a value of TRUE or FALSE to all of the values in the drv column of mpg depending on whether the drv value is “4” or not. (Since drv is a non-numeric variable, we need quotes around its values, even though 4 itself is actually a number.) The sum function then adds these TRUE/FALSE values, treating TRUE as 1 and FALSE as 0. Therefore, we just found that there are 103 cars with 4-wheel drive in mpg.

We can also use the mean function to calculate the percentage of values in a column that satisfy a condition. For example, the percentage of 4-wheel drive cars in mpg is calculated by:

mean(mpg$drv == "4")
## [1] 0.4401709

The reason this works is that mean finds the average of all of the 1’s and 0’s assigned to the values of the drv column by the statement mpg$drv == "4". In doing so, it finds the sum of these 1’s and 0’s (which results in the number of 4-wheel drive cars) and then divides by the total number of 1’s and 0’s. Thus, we’re dividing the number of 4-wheel drive cars by the total number of cars, resulting in the percentage of 4-wheel drives.


It’s often the case that a data set contains missing values for various reasons. In R, these missing values are indicated with the letters NA, which stand for “not available.” For example, examine the built-in msleep data set, which contains data relating to the sleep habits of various mammals.

msleep

Missing data is a problem for data analysts, and we’ll eventually talk about different methods for dealing with it. One method, though, is to filter out the observations that are missing values. Suppose, for example, that we want to filter out the observations in msleep whose conservation status is missing. Here’s the syntax:

filter(msleep, !is.na(conservation))
## # A tibble: 54 x 11
##    name                  genus vore  order conservation sleep_total sleep_rem sleep_cycle awake  brainwt  bodywt
##    <chr>                 <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>    <dbl>   <dbl>
##  1 Cheetah               Acin~ carni Carn~ lc                  12.1      NA        NA      11.9 NA        50    
##  2 Mountain beaver       Aplo~ herbi Rode~ nt                  14.4       2.4      NA       9.6 NA         1.35 
##  3 Greater short-tailed~ Blar~ omni  Sori~ lc                  14.9       2.3       0.133   9.1  0.00029   0.019
##  4 Cow                   Bos   herbi Arti~ domesticated         4         0.7       0.667  20    0.423   600    
##  5 Northern fur seal     Call~ carni Carn~ vu                   8.7       1.4       0.383  15.3 NA        20.5  
##  6 Dog                   Canis carni Carn~ domesticated        10.1       2.9       0.333  13.9  0.07     14    
##  7 Roe deer              Capr~ herbi Arti~ lc                   3        NA        NA      21    0.0982   14.8  
##  8 Goat                  Capri herbi Arti~ lc                   5.3       0.6      NA      18.7  0.115    33.5  
##  9 Guinea pig            Cavis herbi Rode~ domesticated         9.4       0.8       0.217  14.6  0.0055    0.728
## 10 Grivet                Cerc~ omni  Prim~ lc                  10         0.7      NA      14   NA         4.75 
## # i 44 more rows

Let’s examine the condition !is.na(conservation) more closely. First, notice that the way to check whether a value is missing is not to enter conservation == NA. The reason is that NA is an indication that a value is not available; it’s not a value itself. The code conservation == NA would thus be the meaningless statement that says the value of conservation is valueless. For this reason, the code is.na(conservation) has to be used instead. Recalling that the ! preceding a condition negates the condition, we then see that !is.na(conservation) is the condition that says the value of conservation is not NA. The above filter call thus returns the observations which do not have an NA in the conservation column.

Exercises

To complete the exercises below, download this R Notebook file and open it in RStudio using the Visual Editor. Then enter your answers in the space provided.

These exercises require the nycflights13 package you installed earlier. Be sure the library is loaded.

  1. The nycflights13 package contains a data set called flights. Load this data set and read its documentation: ?flights. How many observations does it have? How many variables?

  2. Use filter to find all of the flights that

    1. departed in February.
    2. were operated by United or American Airlines.
    3. departed in summer (June, July, and August).
    4. arrived more than two hours late, but did not leave late.
    5. were delayed by more than an hour, but made up over 30 minutes during the flight.
    6. departed between midnight and 6am.
  3. How many flights were canceled? (Think about how a canceled flight might be detected from the data set.)

  4. An airline’s on-time arrival rate is the percentage of its flights that arrived on time or early. What was the on-time arrival rate for Delta Airlines during 2013? What was it during the winter months (January, February, and December)? A flight that did not arrive, due to a cancellation, crash, emergency landing, etc, should not figure into the on-time arrival rate.

2.2 Sorting a Data Set

The function used to sort a data set is arrange. For example, suppose we want to sort the observations in mpg according to the highway gas mileage:

arrange(mpg, hwy)
## # A tibble: 234 x 11
##    manufacturer model               displ  year   cyl trans      drv     cty   hwy fl    class 
##    <chr>        <chr>               <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
##  1 dodge        dakota pickup 4wd     4.7  2008     8 auto(l5)   4         9    12 e     pickup
##  2 dodge        durango 4wd           4.7  2008     8 auto(l5)   4         9    12 e     suv   
##  3 dodge        ram 1500 pickup 4wd   4.7  2008     8 auto(l5)   4         9    12 e     pickup
##  4 dodge        ram 1500 pickup 4wd   4.7  2008     8 manual(m6) 4         9    12 e     pickup
##  5 jeep         grand cherokee 4wd    4.7  2008     8 auto(l5)   4         9    12 e     suv   
##  6 chevrolet    k1500 tahoe 4wd       5.3  2008     8 auto(l4)   4        11    14 e     suv   
##  7 jeep         grand cherokee 4wd    6.1  2008     8 auto(l5)   4        11    14 p     suv   
##  8 chevrolet    c1500 suburban 2wd    5.3  2008     8 auto(l4)   r        11    15 e     suv   
##  9 chevrolet    k1500 tahoe 4wd       5.7  1999     8 auto(l4)   4        11    15 r     suv   
## 10 dodge        dakota pickup 4wd     5.2  1999     8 auto(l4)   4        11    15 r     pickup
## # i 224 more rows

You can see that by default, arrange sorts the observations in ascending order. To sort in descending order, do the following:

arrange(mpg, desc(hwy))
## # A tibble: 234 x 11
##    manufacturer model      displ  year   cyl trans      drv     cty   hwy fl    class     
##    <chr>        <chr>      <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr>     
##  1 volkswagen   jetta        1.9  1999     4 manual(m5) f        33    44 d     compact   
##  2 volkswagen   new beetle   1.9  1999     4 manual(m5) f        35    44 d     subcompact
##  3 volkswagen   new beetle   1.9  1999     4 auto(l4)   f        29    41 d     subcompact
##  4 toyota       corolla      1.8  2008     4 manual(m5) f        28    37 r     compact   
##  5 honda        civic        1.8  2008     4 auto(l5)   f        25    36 r     subcompact
##  6 honda        civic        1.8  2008     4 auto(l5)   f        24    36 c     subcompact
##  7 toyota       corolla      1.8  1999     4 manual(m5) f        26    35 r     compact   
##  8 toyota       corolla      1.8  2008     4 auto(l4)   f        26    35 r     compact   
##  9 honda        civic        1.8  2008     4 manual(m5) f        26    34 r     subcompact
## 10 honda        civic        1.6  1999     4 manual(m5) f        28    33 r     subcompact
## # i 224 more rows

You can also include a “tie-breaker” variable in arrange. Notice that the Volkswagen Jetta and the Volkswagen New Beetle manual tie for the best highway mileage. Let’s use city mileage (in descending order) as the tie-breaker. You just have to add the tie-breaker variable as a third argument inside arrange:

arrange(mpg, desc(hwy), desc(cty))
## # A tibble: 234 x 11
##    manufacturer model      displ  year   cyl trans      drv     cty   hwy fl    class     
##    <chr>        <chr>      <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr>     
##  1 volkswagen   new beetle   1.9  1999     4 manual(m5) f        35    44 d     subcompact
##  2 volkswagen   jetta        1.9  1999     4 manual(m5) f        33    44 d     compact   
##  3 volkswagen   new beetle   1.9  1999     4 auto(l4)   f        29    41 d     subcompact
##  4 toyota       corolla      1.8  2008     4 manual(m5) f        28    37 r     compact   
##  5 honda        civic        1.8  2008     4 auto(l5)   f        25    36 r     subcompact
##  6 honda        civic        1.8  2008     4 auto(l5)   f        24    36 c     subcompact
##  7 toyota       corolla      1.8  1999     4 manual(m5) f        26    35 r     compact   
##  8 toyota       corolla      1.8  2008     4 auto(l4)   f        26    35 r     compact   
##  9 honda        civic        1.8  2008     4 manual(m5) f        26    34 r     subcompact
## 10 honda        civic        1.6  1999     4 manual(m5) f        28    33 r     subcompact
## # i 224 more rows

We can also sort by a function of a variable or variables. For example, suppose we want to sort the mpg data according to the difference between the highway and city gas mileages (in descending order). We just have to use hwy - cty as the sort variable:

arrange(mpg, desc(hwy - cty))
## # A tibble: 234 x 11
##    manufacturer model      displ  year   cyl trans      drv     cty   hwy fl    class     
##    <chr>        <chr>      <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr>     
##  1 honda        civic        1.8  2008     4 auto(l5)   f        24    36 c     subcompact
##  2 volkswagen   new beetle   1.9  1999     4 auto(l4)   f        29    41 d     subcompact
##  3 audi         a4           1.8  1999     4 auto(l5)   f        18    29 p     compact   
##  4 audi         a4           2    2008     4 manual(m6) f        20    31 p     compact   
##  5 chevrolet    malibu       3.5  2008     6 auto(l4)   f        18    29 r     midsize   
##  6 honda        civic        1.8  2008     4 auto(l5)   f        25    36 r     subcompact
##  7 volkswagen   jetta        1.9  1999     4 manual(m5) f        33    44 d     compact   
##  8 volkswagen   passat       1.8  1999     4 auto(l5)   f        18    29 p     midsize   
##  9 audi         a4           2.8  1999     6 auto(l5)   f        16    26 p     compact   
## 10 audi         a4 quattro   2.8  1999     6 auto(l5)   4        15    25 p     compact   
## # i 224 more rows

The 2008 Honda Civic and 1999 Volkswagen New Beetle thus have the largest difference between city and highway gas mileage.

Notice in the previous sort that there is no column that actually lists the highway/city mileage differences. A better practice would be to create a new variable that records these values and then to sort by that new variable. We’ll see how to create new variables in Section 2.5.

Exercises

To complete the exercises below, download this R Notebook file and open it in RStudio using the Visual Editor. Then enter your answers in the space provided.

These exercises require the flights data set from the nycflights13 library.

  1. Where does arrange sort the NA values in a column? (Experiment with a data set that has missing values.) How could you force arrange to sort all of the NA values to the top of the list? (Try using is.na.)

  2. What was the longest delay of any flight?

  3. What was the earliest departure time of any flight?

  4. What flight averaged the fastest speed while in the air? (Average speed is the total distance traveled divided by the total time spent in the air.) What flight averaged the slowest speed?

  5. What flight traveled the farthest distance? Which one traveled the shortest distance?

2.3 The Pipe

We will often need to use two or more transformation functions in succession. For example, suppose we want to filter out the cars in mpg that get less than 20 miles per gallon on the highway and then sort the remaining cars according to engine size (displ) in descending order. There are a couple of naive ways to do this. One is:

arrange(filter(mpg, hwy >= 20), desc(displ))
## # A tibble: 156 x 11
##    manufacturer model              displ  year   cyl trans      drv     cty   hwy fl    class     
##    <chr>        <chr>              <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr>     
##  1 chevrolet    corvette             7    2008     8 manual(m6) r        15    24 p     2seater   
##  2 chevrolet    corvette             6.2  2008     8 manual(m6) r        16    26 p     2seater   
##  3 chevrolet    corvette             6.2  2008     8 auto(s6)   r        15    25 p     2seater   
##  4 chevrolet    corvette             5.7  1999     8 manual(m6) r        16    26 p     2seater   
##  5 chevrolet    corvette             5.7  1999     8 auto(l4)   r        15    23 p     2seater   
##  6 ford         mustang              5.4  2008     8 manual(m6) r        14    20 p     subcompact
##  7 chevrolet    c1500 suburban 2wd   5.3  2008     8 auto(l4)   r        14    20 r     suv       
##  8 chevrolet    c1500 suburban 2wd   5.3  2008     8 auto(l4)   r        14    20 r     suv       
##  9 pontiac      grand prix           5.3  2008     8 auto(s4)   f        16    25 p     midsize   
## 10 ford         mustang              4.6  1999     8 auto(l4)   r        15    21 r     subcompact
## # i 146 more rows

Nesting functions in this way is often not a good practice because it becomes hard to keep track of parentheses, especially when you nest several functions. A slightly better (but still not ideal) method is to perform one transformation at a time, storing the intermediate results along the way:

filtered_mpg <- filter(mpg, hwy >= 20)
arrange(filtered_mpg, desc(displ))
## # A tibble: 156 x 11
##    manufacturer model              displ  year   cyl trans      drv     cty   hwy fl    class     
##    <chr>        <chr>              <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr>     
##  1 chevrolet    corvette             7    2008     8 manual(m6) r        15    24 p     2seater   
##  2 chevrolet    corvette             6.2  2008     8 manual(m6) r        16    26 p     2seater   
##  3 chevrolet    corvette             6.2  2008     8 auto(s6)   r        15    25 p     2seater   
##  4 chevrolet    corvette             5.7  1999     8 manual(m6) r        16    26 p     2seater   
##  5 chevrolet    corvette             5.7  1999     8 auto(l4)   r        15    23 p     2seater   
##  6 ford         mustang              5.4  2008     8 manual(m6) r        14    20 p     subcompact
##  7 chevrolet    c1500 suburban 2wd   5.3  2008     8 auto(l4)   r        14    20 r     suv       
##  8 chevrolet    c1500 suburban 2wd   5.3  2008     8 auto(l4)   r        14    20 r     suv       
##  9 pontiac      grand prix           5.3  2008     8 auto(s4)   f        16    25 p     midsize   
## 10 ford         mustang              4.6  1999     8 auto(l4)   r        15    21 r     subcompact
## # i 146 more rows

This approach avoids nesting, but it introduces a redundancy: the name filtered_mpg is introduced and then immediately used as an argument inside arrange. It would be much better to combine these steps. Luckily, tidyverse has a way to do this using a construct called the pipe.

The pipe provides a way to bypass the need to name an intermediate data set, only to immediately feed it into the next transformation function in line. Here’s how the above double transformation is accomplished using the pipe:

mpg %>%
  filter(hwy >= 20) %>%
  arrange(desc(displ))
## # A tibble: 156 x 11
##    manufacturer model              displ  year   cyl trans      drv     cty   hwy fl    class     
##    <chr>        <chr>              <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr>     
##  1 chevrolet    corvette             7    2008     8 manual(m6) r        15    24 p     2seater   
##  2 chevrolet    corvette             6.2  2008     8 manual(m6) r        16    26 p     2seater   
##  3 chevrolet    corvette             6.2  2008     8 auto(s6)   r        15    25 p     2seater   
##  4 chevrolet    corvette             5.7  1999     8 manual(m6) r        16    26 p     2seater   
##  5 chevrolet    corvette             5.7  1999     8 auto(l4)   r        15    23 p     2seater   
##  6 ford         mustang              5.4  2008     8 manual(m6) r        14    20 p     subcompact
##  7 chevrolet    c1500 suburban 2wd   5.3  2008     8 auto(l4)   r        14    20 r     suv       
##  8 chevrolet    c1500 suburban 2wd   5.3  2008     8 auto(l4)   r        14    20 r     suv       
##  9 pontiac      grand prix           5.3  2008     8 auto(s4)   f        16    25 p     midsize   
## 10 ford         mustang              4.6  1999     8 auto(l4)   r        15    21 r     subcompact
## # i 146 more rows

We should read this as, “Start with mpg, feed it into filter, then feed the output of filter into arrange.” We avoid nesting and introducing intermediate data set names. The operator %>% takes the data set that precedes it and feeds it into the function that follows it.

The pipe leads to much clearer and readable code, and we’ll use it constantly.

Exercises

To complete the exercises below, download this R Notebook file and open it in RStudio using the Visual Editor. Then enter your answers in the space provided.

Use the pipe to answer the following questions, which require the nycflights13 package.

  1. What is the farthest airport flown to by American Airlines?

  2. What is the farthest airport flown to out of LaGuardia (LGA) by American Airlines?

  3. What was the most delayed flight during the winter (January, February, and December)? What was the most delayed flight during the summer (June, July, and August)?

2.4 Selecting Columns

A very common procedure when analyzing a data set is to get rid of columns that are irrelevant to your analysis. This is accomplished using the select function. For example, suppose we want to pare down the mpg data set so that the only variables displayed are manufacturer, model, displ, and hwy. The way to do this is to use the select function:

select(mpg, manufacturer, model, displ, hwy)
## # A tibble: 234 x 4
##    manufacturer model      displ   hwy
##    <chr>        <chr>      <dbl> <int>
##  1 audi         a4           1.8    29
##  2 audi         a4           1.8    29
##  3 audi         a4           2      31
##  4 audi         a4           2      30
##  5 audi         a4           2.8    26
##  6 audi         a4           2.8    26
##  7 audi         a4           3.1    27
##  8 audi         a4 quattro   1.8    26
##  9 audi         a4 quattro   1.8    25
## 10 audi         a4 quattro   2      28
## # i 224 more rows

There are a lot of built-in “helper” functions available within select. You can read the documentation (?select) to see them, but there are a few particularly useful ones. For one, it’s often easier to specify the columns you don’t want. For example, suppose we want to keep every column of mpg except trans and drv:

select(mpg, -trans, -drv)
## # A tibble: 234 x 9
##    manufacturer model      displ  year   cyl   cty   hwy fl    class  
##    <chr>        <chr>      <dbl> <int> <int> <int> <int> <chr> <chr>  
##  1 audi         a4           1.8  1999     4    18    29 p     compact
##  2 audi         a4           1.8  1999     4    21    29 p     compact
##  3 audi         a4           2    2008     4    20    31 p     compact
##  4 audi         a4           2    2008     4    21    30 p     compact
##  5 audi         a4           2.8  1999     6    16    26 p     compact
##  6 audi         a4           2.8  1999     6    18    26 p     compact
##  7 audi         a4           3.1  2008     6    18    27 p     compact
##  8 audi         a4 quattro   1.8  1999     4    18    26 p     compact
##  9 audi         a4 quattro   1.8  1999     4    16    25 p     compact
## 10 audi         a4 quattro   2    2008     4    20    28 p     compact
## # i 224 more rows

It’s also helpful to specify a set of columns to keep that appear consecutively. For example, suppose we want to keep manufacturer, model, displ, year, and cyl. Since they appear in consecutive columns, there’s a shortcut we can use to select them:

select(mpg, manufacturer:cyl)
## # A tibble: 234 x 5
##    manufacturer model      displ  year   cyl
##    <chr>        <chr>      <dbl> <int> <int>
##  1 audi         a4           1.8  1999     4
##  2 audi         a4           1.8  1999     4
##  3 audi         a4           2    2008     4
##  4 audi         a4           2    2008     4
##  5 audi         a4           2.8  1999     6
##  6 audi         a4           2.8  1999     6
##  7 audi         a4           3.1  2008     6
##  8 audi         a4 quattro   1.8  1999     4
##  9 audi         a4 quattro   1.8  1999     4
## 10 audi         a4 quattro   2    2008     4
## # i 224 more rows

Of course, we can use select together with the other transformation functions via the pipe. Let’s rank the Chevrolets in mpg in descending order of city gas mileage and show only the car model, engine size, year, number of cylinders, transmission type, and city mileage:

mpg %>%
  filter(manufacturer == "chevrolet") %>%
  arrange(desc(cty)) %>%
  select(model:trans, cty)
## # A tibble: 19 x 6
##    model              displ  year   cyl trans        cty
##    <chr>              <dbl> <int> <int> <chr>      <int>
##  1 malibu               2.4  2008     4 auto(l4)      22
##  2 malibu               2.4  1999     4 auto(l4)      19
##  3 malibu               3.1  1999     6 auto(l4)      18
##  4 malibu               3.5  2008     6 auto(l4)      18
##  5 malibu               3.6  2008     6 auto(s6)      17
##  6 corvette             5.7  1999     8 manual(m6)    16
##  7 corvette             6.2  2008     8 manual(m6)    16
##  8 corvette             5.7  1999     8 auto(l4)      15
##  9 corvette             6.2  2008     8 auto(s6)      15
## 10 corvette             7    2008     8 manual(m6)    15
## 11 c1500 suburban 2wd   5.3  2008     8 auto(l4)      14
## 12 c1500 suburban 2wd   5.3  2008     8 auto(l4)      14
## 13 k1500 tahoe 4wd      5.3  2008     8 auto(l4)      14
## 14 k1500 tahoe 4wd      6.5  1999     8 auto(l4)      14
## 15 c1500 suburban 2wd   5.7  1999     8 auto(l4)      13
## 16 c1500 suburban 2wd   6    2008     8 auto(l4)      12
## 17 c1500 suburban 2wd   5.3  2008     8 auto(l4)      11
## 18 k1500 tahoe 4wd      5.3  2008     8 auto(l4)      11
## 19 k1500 tahoe 4wd      5.7  1999     8 auto(l4)      11

Exercises

To complete the exercises below, download this R Notebook file and open it in RStudio using the Visual Editor. Then enter your answers in the space provided.

  1. Using the msleep data set, create a data set consisting of only the carnivores sorted by body weight in descending order. Display only the name of the animal and its body weight. Do all of this using only one pipe statement.

  2. Predict what will happen when you run each of the following. Then test your predictions.

    1. select(mpg, -(cyl:fl))
    2. select(msleep, starts_with("sleep"))
    3. select(msleep, ends_with("wt"))
    4. select(msleep, starts_with("sleep") | ends_with("wt"))
    5. select(msleep, starts_with("sleep") & ends_with("wt"))

2.5 Creating New Variables from Old

This section requires the nycflights13 library.

It’s often necessary to use the variables in a data set to create a new variable. The transformation function used to do this is mutate.

For example, we might want to add a column to flights that contains each flight’s average air speed. Recalling that average speed is total distance divided by total time, we can calculate a flight’s air speed by calculating distance / air_time. Well, almost. The units of distance are miles, and the units of air_time are minutes, so to calculate air speed in the more familiar units of miles per hour, we’d actually want to compute distance / air_time * 60.

To make this calculation for each row of flights and add it as a new column (let’s call this new column air_speed), we apply the mutate function as follows:

mutate(flights, air_speed = distance / air_time * 60)

When mutating a new variable onto a data set, it’s a better practice to assign the mutated data set a new name. You can do so by inserting a name assignment <NEW NAME> <- at the beginning of the code above:

flights_w_speed <- mutate(flights, air_speed = distance / air_time * 60)

You can now refer to flights_w_speed for subsequent analysis.


The creation of a new variable inside mutate is always expressed by an equation of the form:

<NEW VARIABLE NAME> = <FORMULA INVOLVING OLD VARIABLES>

The formula is often constructed using basic arithmetic, as with the air speed calculation above. However, it’s worth mentioning a few other useful operations to use within formulas.

Run the following lines of code, and try to guess what the %/% and %% operators do:

27 %/% 4
27 %% 4

You may have guessed that %/% returns the whole number quotient obtained when dividing one number by another, while %% returns the remainder. We can see right away how these operations might be applied by doing the following: add a column to flights that computes the number of minutes since midnight until each flight departs and another column that computes the number of minutes since midnight until each flight arrives.

For example, for the very first flight in the data set, the value of dep_time is 517, and the value of arr_time is 830. These numbers are really just abbreviations for 5:17am and 8:30am and are not numbers that could be directly used in calculations. Converting these times to minutes since midnight would provide an actual numerical measurement (not a time) that indicates when a flight departed or arrived.

For the 5:17 departure time, we would convert to minutes since midnight by calculating 5*60 + 17, which is 317. But how would we tell R to do this? We’d have to have a way to separate the first one or two digits from the last two digits. This is where %/% and %% come in. Notice that 517 %/% 100 = 5 and 517 %% 100 = 17. Thus, the conversion can be accomplished by the formula (517 %/% 100)*60 + (517 %% 100).

We can now add our minutes since midnight variables as follows. (Notice that you can create more than one new variable within a single mutate call.) We’ll also select the relevant columns so they’re visible in the output. We’ll use the pipe to mutate then select:

flights_w_mins_midnight <- flights %>%
  mutate(dep_mins_midnight = (dep_time %/% 100)*60 + (dep_time %% 100),
         arr_mins_midnight = (arr_time %/% 100)*60 + (arr_time %% 100)) %>%
  select(dep_time, dep_mins_midnight, arr_time, arr_mins_midnight)

flights_w_mins_midnight
## # A tibble: 336,776 x 4
##    dep_time dep_mins_midnight arr_time arr_mins_midnight
##       <int>             <dbl>    <int>             <dbl>
##  1      517               317      830               510
##  2      533               333      850               530
##  3      542               342      923               563
##  4      544               344     1004               604
##  5      554               354      812               492
##  6      554               354      740               460
##  7      555               355      913               553
##  8      557               357      709               429
##  9      557               357      838               518
## 10      558               358      753               473
## # i 336,766 more rows

Another very useful operation to use within a mutate formula is ifelse. The ifelse function returns a value that depends on the truth or falsehood of some condition. The basic syntax is


ifelse(<CONDITION>, 
       <VALUE IF CONDITION IS TRUE>,
       <VALUE IF CONDITION IS FALSE)

For example, the ifelse below says to check whether x < y. If it is, then return the value of x^2. If it’s not, then return the string “hello.”

x <- 4
y <- 2

ifelse(x < y, 
       x^2, 
       "hello")
## [1] "hello"

ifelse statements can be used within mutate to create new variables in a way that depends on conditions on the old variables. It’s an especially useful way to categorize continuous variables, i.e, to assign categorical values to continuous ones. For example, maybe we want to add a column to flights that states whether a flight arrived late or not late. (We’ll just select the relevant columns to make our new column visible.)

flights_w_arr_status <- flights %>%
  mutate(arr_status = ifelse(arr_delay > 0, 
                             "late", 
                             " not late")) %>%
  select(month, day, arr_delay, origin, dest, arr_status)

flights_w_arr_status
## # A tibble: 336,776 x 6
##    month   day arr_delay origin dest  arr_status 
##    <int> <int>     <dbl> <chr>  <chr> <chr>      
##  1     1     1        11 EWR    IAH   "late"     
##  2     1     1        20 LGA    IAH   "late"     
##  3     1     1        33 JFK    MIA   "late"     
##  4     1     1       -18 JFK    BQN   " not late"
##  5     1     1       -25 LGA    ATL   " not late"
##  6     1     1        12 EWR    ORD   "late"     
##  7     1     1        19 EWR    FLL   "late"     
##  8     1     1       -14 LGA    IAD   " not late"
##  9     1     1        -8 JFK    MCO   " not late"
## 10     1     1         8 LGA    ORD   "late"     
## # i 336,766 more rows

ifelse statements work well when creating binary (two-value) categorical variables, such as a flight’s arrival status, because they are set up to produce two outputs. However, to create categorical variables with more than two possible values, ifelse is not ideal.

There is a much better option for categorizations with more than two categories: case_when. Suppose we want to label each flight as one of “canceled,” “not late,” or “late,” depending on the value of arr_delay. Here’s how it works:

flights_w_arr_status2 <- flights %>%
  mutate(arr_status = case_when(is.na(arr_delay) ~ "canceled",
                                arr_delay <= 0 ~ "not late",
                                arr_delay >0 ~ "late")) %>%
  select(month, day, arr_delay, origin, dest, arr_status) 

flights_w_arr_status2
## # A tibble: 336,776 x 6
##    month   day arr_delay origin dest  arr_status
##    <int> <int>     <dbl> <chr>  <chr> <chr>     
##  1     1     1        11 EWR    IAH   late      
##  2     1     1        20 LGA    IAH   late      
##  3     1     1        33 JFK    MIA   late      
##  4     1     1       -18 JFK    BQN   not late  
##  5     1     1       -25 LGA    ATL   not late  
##  6     1     1        12 EWR    ORD   late      
##  7     1     1        19 EWR    FLL   late      
##  8     1     1       -14 LGA    IAD   not late  
##  9     1     1        -8 JFK    MCO   not late  
## 10     1     1         8 LGA    ORD   late      
## # i 336,766 more rows

In the case_when statement above, the third condition arr_delay > 0 is the only option left after the first two, so it seems redundant to have to actually check it. If we don’t want to specify the last remaining condition explicitly but instead just let it catch all the remaining cases, we can just enter TRUE for this condition. The following case_when is thus equivalent to the one above:

case_when(is.na(arr_delay) ~ "canceled",
          arr_delay <= 0 ~ "not late",
          TRUE ~ "late")

Exercises

To complete the exercises below, download this R Notebook file and open it in RStudio using the Visual Editor. Then enter your answers in the space provided.

These exercises require the nycflights13 and Lahman libraries.

  1. Find the fastest and slowest flights (in terms of air speed). State the actual average air speeds in miles per hour.

  2. Use the variables dep_mins_midnight and arr_mins_midnight to calculate the number of minutes from dep_time to arr_time for each flight. Compare your answers to the air_time values. Try to guess why they don’t match. (Notice that for some flights, the value of air_time is way more than the number of minutes between dep_time and arr_time; for example, look at the United flight from JFK to LAX on January 1 that left at 5:58am. How would you explain this discrepancy?)

  3. Add a column to flights that indicates whether a flight was canceled. The new column’s values should be either canceled or not canceled. Then find the first date on which a flight was canceled. (Have R do this for you; don’t just scroll through the data set.)

  4. Some of the flights in the data set flights_w_arr_status we created above will have an NA in the arr_status column. Why? (Hint: What happens when you ask R whether NA > 0?)

  5. Count the number of NAs in the arr_status column of flights_w_arr_status. (Have R do this; don’t manually count them yourself.)

  6. Add a column to flights that gives the season for each flight. For purposes of this exercise, let’s say “winter” is January, February, and December; “spring” is March, April, and May; “summer” is June, July, and August; and “fall” is September, October, and November.

  7. Categorize the observations in diamonds as “expensive,” “very expensive,” “insanely expensive,” or “priceless.” Decide for yourself how to define these labels. Then create a visualization that shows the distribution of these labels.

  8. View the Batting data set from the Lahman library and read its documentation. How many observations and variables does it have?

  9. Batting does not have a column for a player’s batting average, which measures the percentage of a player’s at bats that result in a hit. The formula is H / AB. Create a batting average column in Batting and find the player(s) with the all-time highest single season batting average. What’s strange about your result? (You can just refer to a player by their playerID value if you don’t know their real name.)

  10. To fix the problem from the previous exercise, a minimum number of at bats are required for a batting average to be considered “official.” Re-do Exercise 6, but only rank the players who had at least 350 at bats.

  11. The modern era of baseball is generally considered to have begun in 1947 (after World War II ended). Who’s had the best single season batting average of the modern era (assuming at least 350 at bats)? Answer this question entirely by using transformation functions rather than by just scrolling through the data.

  12. When was the last year a player had a batting average of at least 0.400 with a minimum of 350 at bats? (Again, answer this question using only transformation functions.)

2.6 Grouped Summaries

Suppose we want to create a data table that ranks the manufacturers in mpg according to the average highway gas mileage for each manufacturer. This would entail grouping the cars by the manufacturer values and then computing the average value of hwy for each group. This is an example of a grouped summary and is done with a combination of the group_by and summarize functions. Here’s how it works:

mpg %>%
  group_by(manufacturer) %>%
  summarize(average_highway_mileage = mean(hwy)) %>%
  arrange(desc(average_highway_mileage))
## # A tibble: 15 x 2
##    manufacturer average_highway_mileage
##    <chr>                          <dbl>
##  1 honda                           32.6
##  2 volkswagen                      29.2
##  3 hyundai                         26.9
##  4 audi                            26.4
##  5 pontiac                         26.4
##  6 subaru                          25.6
##  7 toyota                          24.9
##  8 nissan                          24.6
##  9 chevrolet                       21.9
## 10 ford                            19.4
## 11 mercury                         18  
## 12 dodge                           17.9
## 13 jeep                            17.6
## 14 lincoln                         17  
## 15 land rover                      16.5

A couple of notes:

  • group_by and summarize are always used together.

  • summarize works similarly to mutate in that it creates a new column with a name that the user provides (average_highway_mileage above).

  • mean is a built-in function that computes the average hwy value in each group.

  • The displayed data table will only show the grouped column and any computed summary columns.

You can group by more than one variable as well. Suppose we want to rank highway gas mileage by manufacturer and model:

mpg %>%
  group_by(manufacturer, model) %>%
  summarize(average_highway_mileage = mean(hwy)) %>%
  arrange(desc(average_highway_mileage))
## # A tibble: 38 x 3
## # Groups:   manufacturer [15]
##    manufacturer model        average_highway_mileage
##    <chr>        <chr>                          <dbl>
##  1 toyota       corolla                         34  
##  2 volkswagen   new beetle                      32.8
##  3 honda        civic                           32.6
##  4 volkswagen   jetta                           29.1
##  5 nissan       altima                          28.7
##  6 audi         a4                              28.3
##  7 toyota       camry                           28.3
##  8 toyota       camry solara                    28.1
##  9 hyundai      sonata                          27.7
## 10 chevrolet    malibu                          27.6
## # i 28 more rows

mean is just one example of a summary statistic that can be used inside summarize. Here are a few more common ones:

statistic syntax use
mean mean(<VARIABLE>) computes the average value of <VARIABLE> within each group
sum sum(<VARIABLE>) adds the values within each group
count n() counts the number of observations within each group
distinct count n_distinct(<VARIABLE>) counts the number of different values of <VARIABLE> within each group
maximum max(<VARIABLE>) returns the largest value of <VARIABLE> within each group
minimum min(<VARIABLE>) returns the smallest value of <VARIABLE> within each group
standard deviation sd(<VARIABLE>) computes the standard deviation of <VARIABLE> within each group
median median(<VARIABLE>) computes the median value of <VARIABLE> within each group

There are many others as well. It’s certainly possible to compute more than one summary statistic within a single summarize call. In fact, whenever you’re computing a mean value, it’s a good idea to also compute the count. This is because comparing averages is really only appropriate when the groups have roughly the same number of observations. If we do this for the average highway mileage example above, we get:

mpg %>%
  group_by(manufacturer) %>%
  summarize(average_highway_mileage = mean(hwy),
            count = n()) %>%
  arrange(desc(average_highway_mileage))
## # A tibble: 15 x 3
##    manufacturer average_highway_mileage count
##    <chr>                          <dbl> <int>
##  1 honda                           32.6     9
##  2 volkswagen                      29.2    27
##  3 hyundai                         26.9    14
##  4 audi                            26.4    18
##  5 pontiac                         26.4     5
##  6 subaru                          25.6    14
##  7 toyota                          24.9    34
##  8 nissan                          24.6    13
##  9 chevrolet                       21.9    19
## 10 ford                            19.4    25
## 11 mercury                         18       4
## 12 dodge                           17.9    37
## 13 jeep                            17.6     8
## 14 lincoln                         17       3
## 15 land rover                      16.5     4

We would therefore want to refrain from declaring that Pontiacs have better fuel efficiencies than Toyotas since Pontiacs are relatively underrepresented in the data set.


Referring to the flights data set, suppose we want to know which destination airport has the worst arrival delays. We could group flights by the dest variable and then compute the average value of arr_delay for each group. Following the above advice, we’ll also compute the count for each group.

flights %>%
  group_by(dest) %>%
  summarize(avg_arr_delay = mean(arr_delay),
            count = n()) %>%
  arrange(desc(avg_arr_delay))
## # A tibble: 105 x 3
##    dest  avg_arr_delay count
##    <chr>         <dbl> <int>
##  1 SBN            6.5     10
##  2 EYW            6.35    17
##  3 ABQ            4.38   254
##  4 ANC           -2.5      8
##  5 LEX          -22        1
##  6 ACK           NA      265
##  7 ALB           NA      439
##  8 ATL           NA    17215
##  9 AUS           NA     2439
## 10 AVL           NA      275
## # i 95 more rows

What’s the deal with all the NAs in the avg_arr_delay column? Remember that when a flight is canceled, there is no arrival and hence no arrival delay. Thus, canceled flights have NA in the arr_delay column. So in any dest group which contains at least one canceled flight, the mean function will be attempting to calculate an average value when not all of the values are available. This means that the average itself will be reported as NA.

Luckily, there’s an easy, built-in way to avoid this. We can remove the NAs from the mean computation by inserting the argument na.rm = TRUE into mean:

flights %>%
  group_by(dest) %>%
  summarize(avg_arr_delay = mean(arr_delay, na.rm = TRUE),
            count = n()) %>%
  arrange(desc(avg_arr_delay))
## # A tibble: 105 x 3
##    dest  avg_arr_delay count
##    <chr>         <dbl> <int>
##  1 CAE            41.8   116
##  2 TUL            33.7   315
##  3 OKC            30.6   346
##  4 JAC            28.1    25
##  5 TYS            24.1   631
##  6 MSN            20.2   572
##  7 RIC            20.1  2454
##  8 CAK            19.7   864
##  9 DSM            19.0   569
## 10 GRR            18.2   765
## # i 95 more rows

It looks like CAE (Columbia Metropolitan Airport) has the highest average arrival delay, although there were only 116 flights there. On the other hand, RIC (Richmond International Airport) maintained a high average arrival delay over almost 2500 flights. RIC might get my vote for the worst airport to fly into from New York.

If you look at the end of the list, you see that, on average, flights to Blue Grass Airport in Lexington, KY, arrived an average of 22 minutes early! This seems amazing until you notice that there was only one flight there.

## # A tibble: 6 x 3
##   dest  avg_arr_delay count
##   <chr>         <dbl> <int>
## 1 ANC           -2.5      8
## 2 STT           -3.84   522
## 3 SNA           -7.87   825
## 4 PSP          -12.7     19
## 5 LEX          -22        1
## 6 LGA          NaN        1

Last on the list is LGA, with an average arrival delay value of NaN. This stands for “not a number.” If we extract out this observation, we might be able to see why:

flights %>%
  filter(dest == "LGA") %>%
  select(month, day, dest, dep_delay, arr_delay)
## # A tibble: 1 x 5
##   month   day dest  dep_delay arr_delay
##   <int> <int> <chr>     <dbl>     <dbl>
## 1     7    27 LGA          NA        NA

It looks like there was only one flight to LGA, and it was canceled, giving us an NA in the arr_delay column. When we then compute the average arrival delay and use na.rm = TRUE to remove the NA, we’re left with no numbers to take the average of! This would lead to a division by 0, which is not arithmetically possible. Hence the NaN. (Notice that removing the NAs from the mean computation does not affect the value of count, though.)

Exercises

To complete the exercises below, download this R Notebook file and open it in RStudio using the Visual Editor. Then enter your answers in the space provided.

These exercises will require both the nycflights13 and Lahman libraries.

  1. Create a data set from flights that shows the average departure delay on each day of the year.

  2. Which carrier from flights had the best on-time arrival rate in 2013? Which had the worst? (See Exercise 4 from Section 2.1). Be sure to include counts so we know whether the rates are comparable.

  3. For each destination in flights, can you find any flights that were suspiciously fast, possibly indicating a data entry error? (Hint: Think about whether any of the summary statistics mentioned in this section would help you spot flights that were suspiciously fast.)

  4. Which carrier in flights flew to the most different destinations?

  5. Which specific plane (identified by tailnum) had the most variation in the distances it flew? (Be sure to include counts in your analysis.)

  6. Which day of the year had the most cancellations? Which day had the least?

  7. Using the Batting data set from Lahman, determine the maximum number of home runs hit by a player every year since 1871.

  8. Create a visualization of your findings from the previous exercise.

  9. Create a data set that shows a ranked list of the all-time leaders in career batting average. (A player’s career batting average is his total number of hits divided by his total number of at bats.) Note that to be eligible for this list, a player must have at least 3000 total at bats. Be sure to take this into account when you create your list.

2.7 Project

Project Description: The purpose of this project is to explore and explain observations in the flights data set by transforming and visualizing the data.

Instructions:

Answer the following questions using the appropriate transformation functions and visualizations.

  1. Create a data set that lists the days during 2013 on which the percentage of flights that were either canceled or delayed on departure by more than one hour exceeded 35%.

  2. Do a Google search to see what the weather was like in New York on each of the days from the previous problem in 2013. Do your findings explain the large number of delays/cancellations?

  3. Create a data set that shows, for each scheduled departure hour during the day, the percentage of flights that were canceled or delayed by more than one hour.

  4. Obtain a plot that shows the relationship between the hour and the percentage of flights canceled or delayed by more than an hour.

  5. Explain the relationship in the previous problem. In particular:

    1. Why does the plot have the given shape?
    2. Are there outliers? If so, why might they be there?
    3. When is the riskiest time of day to fly if you want to avoid cancellations and long delays, and why might that be?
  6. Above, you found dates on which a lot of cancellations and long delays occurred. However, several flights were able to leave on time or early on those dates. Create a data set that contains the flights that left on time or early on those dates.

  7. It’s a reasonable guess that the flights that were able to leave early despite bad weather were probably the ones that left in the morning before the weather hit. For each date in the data set from the previous problem, find the average scheduled hour of departure. Do your findings mostly confirm this guess?

  8. Did any of the problematic dates have an average departure hour in the afternoon? If so, why might that have been?

Guidelines:

  • Heading: Include a descriptive title for your report (do not just call it “Project 2”), your name, and the date in the heading.

  • Introduction:

    • Begin with a description of the problem you’re solving or question you’re answering in your report. This should be concise but detailed enough to set the stage for the work you’ll be doing.
    • Describe your data set. Say what it contains, how many observations and variables it contains, and where it comes from.
    • Create a data dictionary, at least for the variables which will be referenced in your report.
    • Use the datatable function to display your data set. (If the data set is too big to display when you try to knit your .Rmd file, you don’t have to display it.)
    • Include a code chunk in which you load the libraries you’ll need, and briefly say what each one is for.
  • Body:

    • Follow the introduction with the body of your report which will contain your work, broken into relevant section headings.
    • The body should include text that provides a running narrative that guides the reader through your work. This should be thorough and complete, but you should avoid large blocks of text.
    • Do not use the problem-numbering in the Instructions above in the body. That’s just for your benefit as you prepare your work. Your narrative structure – not an enumerated list – should lead the reader from one question to the next.
    • The body of the report should show all of your R code and its output, but it should not show any warning or error messages. (This is for my benefit. A professional report might not include the code, depending on the audience.)
    • All graphics should look nice, be easy to read, and be fully labeled.
    • You should include insightful statements about the meaning you’re extracting from your graphics rather than just superficial descriptions of your visualizations.
  • Conclusion:

    • End with an overall concluding statement which summarizes your findings.
    • Provide a clear answer to the question you set out to answer.
    • If you were not able to answer the question fully, possibly because of insufficient data, explain why.
  • Report Preparation: The problems for the project should be worked out in a scratch work .R file which you will not turn in. Once you’re done, write up your results in an R Notebook report, making sure to follow the guidelines above. Then publish your report online using RPubs and send me the URL to submit your work.

Grading Rubric:

  • Transformations: Do you use the right transformation functions to answer the questions? (30 points)

  • Insights: Are insights fully explained and well-written? Are the insights derived from your results? (30 points)

  • Narrative: Is it clear what you’re trying to do in this project? Do you maintain a readable narrative throughout that guides the reader through your analysis? (20 points)

  • Professionalism: Does your report look nice? Do you provide insights based on your analysis? Is your code clear and readable? Did you follow the guidelines? (15 points)


  1. Depending on the version of R you’re running, you might first also have to install the Rtools package: install.packages("Rtools")↩︎