Exercise 1 - data wrangling

Setup

This chunk of R code loads the packages that we will be using.

#Package load

library(tidyverse)
library(stringr)

# data files
library(nycflights13)

Introduction

For this homework exercise, use the functions that were introduced in class to solve the following problems. You shouldn’t have to use any functions that are not covered in class! The course book should be your first point of reference as you tackle the problems.

Questions

1. Assign a variable

Why does this code chunk not work?

stringlenght <- 10

stringlength

Look carefully! (This may seem like an exercise in pointlessness, but training your brain to notice even the tiniest difference will pay off when programming.)

Solution

First, the code chunk was missing the {r} portion.

Second, the object name is spelled differently in each case: it ends with ht in the first line, and th in the second.

2. Reducing the size of your data set

Starting with the {mpg} data frame, use filter() and select() to create a single table that contains only the following records and variables:

  • compact and subcompact Volkswagens from the model year 1999

  • manufacturer, model, mileage in both city and highway

Solution

First, view the data table:

mpg
## # A tibble: 234 × 12
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class mpg_per_cubic_litre
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>               <dbl>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…               16.1 
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…               16.1 
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…               15.5 
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…               15   
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…                9.29
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…                9.29
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…                8.71
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…               14.4 
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…               13.9 
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…               14   
## # ℹ 224 more rows

In this first solution, each evaluation gets its own filter() function:

mpg |>
  filter(manufacturer == "volkswagen")  |> 
  filter(class == "compact" | class == "subcompact")  |>   ##  - use of "OR" `|` instead of `%in%`
  filter(year == 1999) |>
  select(manufacturer, model, cty, hwy)
## # A tibble: 12 × 4
##    manufacturer model        cty   hwy
##    <chr>        <chr>      <int> <int>
##  1 volkswagen   gti           21    29
##  2 volkswagen   gti           19    26
##  3 volkswagen   gti           17    24
##  4 volkswagen   jetta         33    44
##  5 volkswagen   jetta         21    29
##  6 volkswagen   jetta         19    26
##  7 volkswagen   jetta         16    23
##  8 volkswagen   jetta         17    24
##  9 volkswagen   new beetle    35    44
## 10 volkswagen   new beetle    29    41
## 11 volkswagen   new beetle    21    29
## 12 volkswagen   new beetle    19    26

Another possible solution combines the three evaluations into a single filter() function:

mpg |>
  filter(manufacturer == "volkswagen",
         class %in% c("compact", "subcompact"),
         year == 1999) |>
  select(manufacturer, model, cty, hwy)
## # A tibble: 12 × 4
##    manufacturer model        cty   hwy
##    <chr>        <chr>      <int> <int>
##  1 volkswagen   gti           21    29
##  2 volkswagen   gti           19    26
##  3 volkswagen   gti           17    24
##  4 volkswagen   jetta         33    44
##  5 volkswagen   jetta         21    29
##  6 volkswagen   jetta         19    26
##  7 volkswagen   jetta         16    23
##  8 volkswagen   jetta         17    24
##  9 volkswagen   new beetle    35    44
## 10 volkswagen   new beetle    29    41
## 11 volkswagen   new beetle    21    29
## 12 volkswagen   new beetle    19    26

In this variant, the filter() function uses “&” instead of commas at the end of each evaluation:

## variant
mpg |>
  filter(manufacturer == "volkswagen" &   # use of "&" instead of commas in filter
         class %in% c("compact", "subcompact") &
         year == 1999) |>
  select(manufacturer, model, cty, hwy)
## # A tibble: 12 × 4
##    manufacturer model        cty   hwy
##    <chr>        <chr>      <int> <int>
##  1 volkswagen   gti           21    29
##  2 volkswagen   gti           19    26
##  3 volkswagen   gti           17    24
##  4 volkswagen   jetta         33    44
##  5 volkswagen   jetta         21    29
##  6 volkswagen   jetta         19    26
##  7 volkswagen   jetta         16    23
##  8 volkswagen   jetta         17    24
##  9 volkswagen   new beetle    35    44
## 10 volkswagen   new beetle    29    41
## 11 volkswagen   new beetle    21    29
## 12 volkswagen   new beetle    19    26

3. Selecting variables

What happens if you include the name of a variable twice in a select() call? Demonstrate using the {gapminder} data. (The package is installed in the workspace, but you will have to load it to access it…)

Solution

The variable is not duplicated. In the example below, the variable “year” is duplicated but it appears only once in the output from the select() function.

gapminder |>
  select(country, year, lifeExp, year)
## # A tibble: 1,704 × 3
##    country      year lifeExp
##    <fct>       <int>   <dbl>
##  1 Afghanistan  1952    28.8
##  2 Afghanistan  1957    30.3
##  3 Afghanistan  1962    32.0
##  4 Afghanistan  1967    34.0
##  5 Afghanistan  1972    36.1
##  6 Afghanistan  1977    38.4
##  7 Afghanistan  1982    39.9
##  8 Afghanistan  1987    40.8
##  9 Afghanistan  1992    41.7
## 10 Afghanistan  1997    41.8
## # ℹ 1,694 more rows

4. R Markdown

The following chunk of code produces a chart. When this notebook is previewed, the R code inside the chunk is also rendered.

Enter a chunk option so that the chart shows in the output, but the code does not.

(Hint: use the RMarkdown cheatsheet to find the solution!)

# hide this code!
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy))
Solution

The solution is to put the option “echo = FALSE” after the “r” in the opening of the code chunk.

There are other options that can also control the behaviour of the code and its output. For more details, see the R Markdown cheat sheet.

{r, echo=FALSE}

5. Flight data

For Q3a-3d, use the data nycflights13::flights

Sidebar: To see the names of all of the tables in the {nycflights13} package, use the data() function:

data(package = "nycflights13")

Pay close attention to how the variables are coded. Type ?flights into the console to access the details about this data in the “Help” pane.

Using the data in the nycflights13::flights table, find all flights that:

  1. Had an arrival delay of two or more hours

  2. Flew to Houston (there are two airports, with the codes IAH or HOU)

  3. Were operated by United, American, or Delta (you may want to review the separate airlines table to check the codes for each airline)

  4. Departed in summer (July, August, and September)

Solution

Solution a

a. Had an arrival delay of two or more hours

Note that variable arr_delay is measured in minutes, so we need to find the flights where that’s greater than or equal to 120 minutes.

filter(flights, arr_delay >= 120)
## # A tibble: 10,200 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1  2013     1     1      811            630       101     1047            830       137 MQ     
##  2  2013     1     1      848           1835       853     1001           1950       851 MQ     
##  3  2013     1     1      957            733       144     1056            853       123 UA     
##  4  2013     1     1     1114            900       134     1447           1222       145 UA     
##  5  2013     1     1     1505           1310       115     1638           1431       127 EV     
##  6  2013     1     1     1525           1340       105     1831           1626       125 B6     
##  7  2013     1     1     1549           1445        64     1912           1656       136 EV     
##  8  2013     1     1     1558           1359       119     1718           1515       123 EV     
##  9  2013     1     1     1732           1630        62     2028           1825       123 EV     
## 10  2013     1     1     1803           1620       103     2008           1750       138 MQ     
## # ℹ 10,190 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Solution b

b. Flew to Houston (there are two airports, with the codes IAH or HOU)

filter(flights, dest == "IAH" | dest == "HOU")
## # A tibble: 9,313 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1  2013     1     1      517            515         2      830            819        11 UA     
##  2  2013     1     1      533            529         4      850            830        20 UA     
##  3  2013     1     1      623            627        -4      933            932         1 UA     
##  4  2013     1     1      728            732        -4     1041           1038         3 UA     
##  5  2013     1     1      739            739         0     1104           1038        26 UA     
##  6  2013     1     1      908            908         0     1228           1219         9 UA     
##  7  2013     1     1     1028           1026         2     1350           1339        11 UA     
##  8  2013     1     1     1044           1045        -1     1352           1351         1 UA     
##  9  2013     1     1     1114            900       134     1447           1222       145 UA     
## 10  2013     1     1     1205           1200         5     1503           1505        -2 UA     
## # ℹ 9,303 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Note that you could also use %in%

filter(flights, dest %in% c("IAH", "HOU"))
## # A tibble: 9,313 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1  2013     1     1      517            515         2      830            819        11 UA     
##  2  2013     1     1      533            529         4      850            830        20 UA     
##  3  2013     1     1      623            627        -4      933            932         1 UA     
##  4  2013     1     1      728            732        -4     1041           1038         3 UA     
##  5  2013     1     1      739            739         0     1104           1038        26 UA     
##  6  2013     1     1      908            908         0     1228           1219         9 UA     
##  7  2013     1     1     1028           1026         2     1350           1339        11 UA     
##  8  2013     1     1     1044           1045        -1     1352           1351         1 UA     
##  9  2013     1     1     1114            900       134     1447           1222       145 UA     
## 10  2013     1     1     1205           1200         5     1503           1505        -2 UA     
## # ℹ 9,303 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# create named variable "houston"
houston <- c("IAH", "HOU")
filter(flights, dest %in% houston)
## # A tibble: 9,313 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1  2013     1     1      517            515         2      830            819        11 UA     
##  2  2013     1     1      533            529         4      850            830        20 UA     
##  3  2013     1     1      623            627        -4      933            932         1 UA     
##  4  2013     1     1      728            732        -4     1041           1038         3 UA     
##  5  2013     1     1      739            739         0     1104           1038        26 UA     
##  6  2013     1     1      908            908         0     1228           1219         9 UA     
##  7  2013     1     1     1028           1026         2     1350           1339        11 UA     
##  8  2013     1     1     1044           1045        -1     1352           1351         1 UA     
##  9  2013     1     1     1114            900       134     1447           1222       145 UA     
## 10  2013     1     1     1205           1200         5     1503           1505        -2 UA     
## # ℹ 9,303 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Solution c

c. Were operated by United, American, or Delta

This one takes a bit of effort–you have to find the codes that match the values in the variable carrier with the airline names.

In addition to the flights table, the {nycflights13} package has a number of other tables, including airlines

nycflights13::airlines
## # A tibble: 16 × 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.

From this, we can find:

  • United == UA

  • American == AA

  • Delta == DL

# the `%in%` approach -- other methods are possible!
filter(flights, carrier %in% c("AA", "DL", "UA"))
## # A tibble: 139,504 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1  2013     1     1      517            515         2      830            819        11 UA     
##  2  2013     1     1      533            529         4      850            830        20 UA     
##  3  2013     1     1      542            540         2      923            850        33 AA     
##  4  2013     1     1      554            600        -6      812            837       -25 DL     
##  5  2013     1     1      554            558        -4      740            728        12 UA     
##  6  2013     1     1      558            600        -2      753            745         8 AA     
##  7  2013     1     1      558            600        -2      924            917         7 UA     
##  8  2013     1     1      558            600        -2      923            937       -14 UA     
##  9  2013     1     1      559            600        -1      941            910        31 AA     
## 10  2013     1     1      559            600        -1      854            902        -8 UA     
## # ℹ 139,494 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Solution d

  1. Departed in summer (July, August, and September)

The variable month is numeric, so we have to use greater than and less than to define the range.

filter(flights, month >= 7, month <= 9)
## # A tibble: 86,326 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1  2013     7     1        1           2029       212      236           2359       157 B6     
##  2  2013     7     1        2           2359         3      344            344         0 B6     
##  3  2013     7     1       29           2245       104      151              1       110 B6     
##  4  2013     7     1       43           2130       193      322             14       188 B6     
##  5  2013     7     1       44           2150       174      300            100       120 AA     
##  6  2013     7     1       46           2051       235      304           2358       186 B6     
##  7  2013     7     1       48           2001       287      308           2305       243 VX     
##  8  2013     7     1       58           2155       183      335             43       172 B6     
##  9  2013     7     1      100           2146       194      327             30       177 B6     
## 10  2013     7     1      100           2245       135      337            135       122 B6     
## # ℹ 86,316 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Using the : operator would also allow us to make a range.

filter(flights, month %in% 7:9)
## # A tibble: 86,326 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1  2013     7     1        1           2029       212      236           2359       157 B6     
##  2  2013     7     1        2           2359         3      344            344         0 B6     
##  3  2013     7     1       29           2245       104      151              1       110 B6     
##  4  2013     7     1       43           2130       193      322             14       188 B6     
##  5  2013     7     1       44           2150       174      300            100       120 AA     
##  6  2013     7     1       46           2051       235      304           2358       186 B6     
##  7  2013     7     1       48           2001       287      308           2305       243 VX     
##  8  2013     7     1       58           2155       183      335             43       172 B6     
##  9  2013     7     1      100           2146       194      327             30       177 B6     
## 10  2013     7     1      100           2245       135      337            135       122 B6     
## # ℹ 86,316 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

You could also use “or” |, or between()

filter(flights, month == 7 |
         month == 8 |
         month == 9)
## # A tibble: 86,326 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1  2013     7     1        1           2029       212      236           2359       157 B6     
##  2  2013     7     1        2           2359         3      344            344         0 B6     
##  3  2013     7     1       29           2245       104      151              1       110 B6     
##  4  2013     7     1       43           2130       193      322             14       188 B6     
##  5  2013     7     1       44           2150       174      300            100       120 AA     
##  6  2013     7     1       46           2051       235      304           2358       186 B6     
##  7  2013     7     1       48           2001       287      308           2305       243 VX     
##  8  2013     7     1       58           2155       183      335             43       172 B6     
##  9  2013     7     1      100           2146       194      327             30       177 B6     
## 10  2013     7     1      100           2245       135      337            135       122 B6     
## # ℹ 86,316 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights, between(month, 7, 9))
## # A tibble: 86,326 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1  2013     7     1        1           2029       212      236           2359       157 B6     
##  2  2013     7     1        2           2359         3      344            344         0 B6     
##  3  2013     7     1       29           2245       104      151              1       110 B6     
##  4  2013     7     1       43           2130       193      322             14       188 B6     
##  5  2013     7     1       44           2150       174      300            100       120 AA     
##  6  2013     7     1       46           2051       235      304           2358       186 B6     
##  7  2013     7     1       48           2001       287      308           2305       243 VX     
##  8  2013     7     1       58           2155       183      335             43       172 B6     
##  9  2013     7     1      100           2146       194      327             30       177 B6     
## 10  2013     7     1      100           2245       135      337            135       122 B6     
## # ℹ 86,316 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Solution a-d

It is possible to combine all four of the elements in Q6 into a single code sequence:

# with each step as a separate `filter()` statement
flights |> 
  filter(arr_delay >= 120) |> 
  filter(dest == "IAH" | dest == "HOU") |> 
  filter(carrier %in% c("AA", "DL", "UA")) |> 
  filter(month %in% 7:9)
## # A tibble: 47 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1  2013     7     1     1310           1057       133     1551           1338       133 UA     
##  2  2013     7     1     1707           1448       139     1943           1742       121 UA     
##  3  2013     7     1     2058           1735       203     2355           2030       205 AA     
##  4  2013     7     2     2001           1735       146     2335           2030       185 AA     
##  5  2013     7     3     2215           1909       186       45           2200       165 UA     
##  6  2013     7     9     1937           1735       122     2240           2030       130 AA     
##  7  2013     7    10       40           1909       331      301           2200       301 UA     
##  8  2013     7    10     1629           1520        69     2048           1754       174 UA     
##  9  2013     7    10     1913           1721       112     2214           2001       133 UA     
## 10  2013     7    17     1657           1446       131     2007           1745       142 UA     
## # ℹ 37 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

We could go one step further and combine it all into a single filter() function:

flights |> 
  filter(arr_delay >= 120 &
           dest %in% c("IAH", "HOU") & 
           carrier %in% c("AA", "DL", "UA") & 
           month %in% 7:9)
## # A tibble: 47 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1  2013     7     1     1310           1057       133     1551           1338       133 UA     
##  2  2013     7     1     1707           1448       139     1943           1742       121 UA     
##  3  2013     7     1     2058           1735       203     2355           2030       205 AA     
##  4  2013     7     2     2001           1735       146     2335           2030       185 AA     
##  5  2013     7     3     2215           1909       186       45           2200       165 UA     
##  6  2013     7     9     1937           1735       122     2240           2030       130 AA     
##  7  2013     7    10       40           1909       331      301           2200       301 UA     
##  8  2013     7    10     1629           1520        69     2048           1754       174 UA     
##  9  2013     7    10     1913           1721       112     2214           2001       133 UA     
## 10  2013     7    17     1657           1446       131     2007           1745       142 UA     
## # ℹ 37 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

-30-

is licensed under a Creative Commons Attribution 4.0 International License https://creativecommons.org/licenses/by/4.0/ -->