Exercise 1 - data wrangling
Introduction
For this homework exercise, use the functions that were introduced in class to solve the following problems. You shouldn’t have to use any functions that are not covered in class! The course book should be your first point of reference as you tackle the problems.
Questions
1. Assign a variable
Why does this code chunk not work?
stringlenght <- 10
stringlength
Look carefully! (This may seem like an exercise in pointlessness, but training your brain to notice even the tiniest difference will pay off when programming.)
Solution
First, the code chunk was missing the {r} portion.
Second, the object name is spelled differently in each case: it ends with ht
in the first line, and th
in the second.
2. Reducing the size of your data set
Starting with the {mpg} data frame, use filter()
and select()
to create a single table that contains only the following records and variables:
compact and subcompact Volkswagens from the model year 1999
manufacturer, model, mileage in both city and highway
Solution
First, view the data table:
mpg
## # A tibble: 234 × 12
## manufacturer model displ year cyl trans drv cty hwy fl class mpg_per_cubic_litre
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> <dbl>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp… 16.1
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp… 16.1
## 3 audi a4 2 2008 4 manu… f 20 31 p comp… 15.5
## 4 audi a4 2 2008 4 auto… f 21 30 p comp… 15
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp… 9.29
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp… 9.29
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp… 8.71
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp… 14.4
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp… 13.9
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp… 14
## # ℹ 224 more rows
In this first solution, each evaluation gets its own filter()
function:
mpg |>
filter(manufacturer == "volkswagen") |>
filter(class == "compact" | class == "subcompact") |> ## - use of "OR" `|` instead of `%in%`
filter(year == 1999) |>
select(manufacturer, model, cty, hwy)
## # A tibble: 12 × 4
## manufacturer model cty hwy
## <chr> <chr> <int> <int>
## 1 volkswagen gti 21 29
## 2 volkswagen gti 19 26
## 3 volkswagen gti 17 24
## 4 volkswagen jetta 33 44
## 5 volkswagen jetta 21 29
## 6 volkswagen jetta 19 26
## 7 volkswagen jetta 16 23
## 8 volkswagen jetta 17 24
## 9 volkswagen new beetle 35 44
## 10 volkswagen new beetle 29 41
## 11 volkswagen new beetle 21 29
## 12 volkswagen new beetle 19 26
Another possible solution combines the three evaluations into a single filter()
function:
mpg |>
filter(manufacturer == "volkswagen",
class %in% c("compact", "subcompact"),
year == 1999) |>
select(manufacturer, model, cty, hwy)
## # A tibble: 12 × 4
## manufacturer model cty hwy
## <chr> <chr> <int> <int>
## 1 volkswagen gti 21 29
## 2 volkswagen gti 19 26
## 3 volkswagen gti 17 24
## 4 volkswagen jetta 33 44
## 5 volkswagen jetta 21 29
## 6 volkswagen jetta 19 26
## 7 volkswagen jetta 16 23
## 8 volkswagen jetta 17 24
## 9 volkswagen new beetle 35 44
## 10 volkswagen new beetle 29 41
## 11 volkswagen new beetle 21 29
## 12 volkswagen new beetle 19 26
In this variant, the filter()
function uses “&” instead of commas at the end of each evaluation:
## variant
mpg |>
filter(manufacturer == "volkswagen" & # use of "&" instead of commas in filter
class %in% c("compact", "subcompact") &
year == 1999) |>
select(manufacturer, model, cty, hwy)
## # A tibble: 12 × 4
## manufacturer model cty hwy
## <chr> <chr> <int> <int>
## 1 volkswagen gti 21 29
## 2 volkswagen gti 19 26
## 3 volkswagen gti 17 24
## 4 volkswagen jetta 33 44
## 5 volkswagen jetta 21 29
## 6 volkswagen jetta 19 26
## 7 volkswagen jetta 16 23
## 8 volkswagen jetta 17 24
## 9 volkswagen new beetle 35 44
## 10 volkswagen new beetle 29 41
## 11 volkswagen new beetle 21 29
## 12 volkswagen new beetle 19 26
3. Selecting variables
What happens if you include the name of a variable twice in a select()
call? Demonstrate using the {gapminder} data. (The package is installed in the workspace, but you will have to load it to access it…)
Solution
The variable is not duplicated. In the example below, the variable “year” is duplicated but it appears only once in the output from the select()
function.
gapminder |>
select(country, year, lifeExp, year)
## # A tibble: 1,704 × 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Afghanistan 1952 28.8
## 2 Afghanistan 1957 30.3
## 3 Afghanistan 1962 32.0
## 4 Afghanistan 1967 34.0
## 5 Afghanistan 1972 36.1
## 6 Afghanistan 1977 38.4
## 7 Afghanistan 1982 39.9
## 8 Afghanistan 1987 40.8
## 9 Afghanistan 1992 41.7
## 10 Afghanistan 1997 41.8
## # ℹ 1,694 more rows
4. R Markdown
The following chunk of code produces a chart. When this notebook is previewed, the R code inside the chunk is also rendered.
Enter a chunk option so that the chart shows in the output, but the code does not.
(Hint: use the RMarkdown cheatsheet to find the solution!)
# hide this code!
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
Solution
The solution is to put the option “echo = FALSE” after the “r” in the opening of the code chunk.
There are other options that can also control the behaviour of the code and its output. For more details, see the R Markdown cheat sheet.
{r, echo=FALSE}
5. Flight data
For Q3a-3d, use the data nycflights13::flights
Sidebar: To see the names of all of the tables in the {nycflights13} package, use the
data()
function:
data(package = "nycflights13")
Pay close attention to how the variables are coded. Type ?flights
into the console to access the details about this data in the “Help” pane.
Using the data in the nycflights13::flights
table, find all flights that:
Had an arrival delay of two or more hours
Flew to Houston (there are two airports, with the codes IAH or HOU)
Were operated by United, American, or Delta (you may want to review the separate airlines table to check the codes for each airline)
Departed in summer (July, August, and September)
Solution
Solution a
a. Had an arrival delay of two or more hours
Note that variable arr_delay
is measured in minutes, so we need to find the flights where that’s greater than or equal to 120 minutes.
filter(flights, arr_delay >= 120)
## # A tibble: 10,200 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 811 630 101 1047 830 137 MQ
## 2 2013 1 1 848 1835 853 1001 1950 851 MQ
## 3 2013 1 1 957 733 144 1056 853 123 UA
## 4 2013 1 1 1114 900 134 1447 1222 145 UA
## 5 2013 1 1 1505 1310 115 1638 1431 127 EV
## 6 2013 1 1 1525 1340 105 1831 1626 125 B6
## 7 2013 1 1 1549 1445 64 1912 1656 136 EV
## 8 2013 1 1 1558 1359 119 1718 1515 123 EV
## 9 2013 1 1 1732 1630 62 2028 1825 123 EV
## 10 2013 1 1 1803 1620 103 2008 1750 138 MQ
## # ℹ 10,190 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Solution b
b. Flew to Houston (there are two airports, with the codes IAH or HOU)
filter(flights, dest == "IAH" | dest == "HOU")
## # A tibble: 9,313 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 623 627 -4 933 932 1 UA
## 4 2013 1 1 728 732 -4 1041 1038 3 UA
## 5 2013 1 1 739 739 0 1104 1038 26 UA
## 6 2013 1 1 908 908 0 1228 1219 9 UA
## 7 2013 1 1 1028 1026 2 1350 1339 11 UA
## 8 2013 1 1 1044 1045 -1 1352 1351 1 UA
## 9 2013 1 1 1114 900 134 1447 1222 145 UA
## 10 2013 1 1 1205 1200 5 1503 1505 -2 UA
## # ℹ 9,303 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Note that you could also use %in%
## # A tibble: 9,313 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 623 627 -4 933 932 1 UA
## 4 2013 1 1 728 732 -4 1041 1038 3 UA
## 5 2013 1 1 739 739 0 1104 1038 26 UA
## 6 2013 1 1 908 908 0 1228 1219 9 UA
## 7 2013 1 1 1028 1026 2 1350 1339 11 UA
## 8 2013 1 1 1044 1045 -1 1352 1351 1 UA
## 9 2013 1 1 1114 900 134 1447 1222 145 UA
## 10 2013 1 1 1205 1200 5 1503 1505 -2 UA
## # ℹ 9,303 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
## # A tibble: 9,313 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 623 627 -4 933 932 1 UA
## 4 2013 1 1 728 732 -4 1041 1038 3 UA
## 5 2013 1 1 739 739 0 1104 1038 26 UA
## 6 2013 1 1 908 908 0 1228 1219 9 UA
## 7 2013 1 1 1028 1026 2 1350 1339 11 UA
## 8 2013 1 1 1044 1045 -1 1352 1351 1 UA
## 9 2013 1 1 1114 900 134 1447 1222 145 UA
## 10 2013 1 1 1205 1200 5 1503 1505 -2 UA
## # ℹ 9,303 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Solution c
c. Were operated by United, American, or Delta
This one takes a bit of effort–you have to find the codes that match the values in the variable carrier
with the airline names.
In addition to the flights
table, the {nycflights13} package has a number of other tables, including airlines
nycflights13::airlines
## # A tibble: 16 × 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
From this, we can find:
United == UA
American == AA
Delta == DL
# the `%in%` approach -- other methods are possible!
filter(flights, carrier %in% c("AA", "DL", "UA"))
## # A tibble: 139,504 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 554 600 -6 812 837 -25 DL
## 5 2013 1 1 554 558 -4 740 728 12 UA
## 6 2013 1 1 558 600 -2 753 745 8 AA
## 7 2013 1 1 558 600 -2 924 917 7 UA
## 8 2013 1 1 558 600 -2 923 937 -14 UA
## 9 2013 1 1 559 600 -1 941 910 31 AA
## 10 2013 1 1 559 600 -1 854 902 -8 UA
## # ℹ 139,494 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Solution d
- Departed in summer (July, August, and September)
The variable month
is numeric, so we have to use greater than and less than to define the range.
filter(flights, month >= 7, month <= 9)
## # A tibble: 86,326 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 7 1 1 2029 212 236 2359 157 B6
## 2 2013 7 1 2 2359 3 344 344 0 B6
## 3 2013 7 1 29 2245 104 151 1 110 B6
## 4 2013 7 1 43 2130 193 322 14 188 B6
## 5 2013 7 1 44 2150 174 300 100 120 AA
## 6 2013 7 1 46 2051 235 304 2358 186 B6
## 7 2013 7 1 48 2001 287 308 2305 243 VX
## 8 2013 7 1 58 2155 183 335 43 172 B6
## 9 2013 7 1 100 2146 194 327 30 177 B6
## 10 2013 7 1 100 2245 135 337 135 122 B6
## # ℹ 86,316 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Using the :
operator would also allow us to make a range.
## # A tibble: 86,326 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 7 1 1 2029 212 236 2359 157 B6
## 2 2013 7 1 2 2359 3 344 344 0 B6
## 3 2013 7 1 29 2245 104 151 1 110 B6
## 4 2013 7 1 43 2130 193 322 14 188 B6
## 5 2013 7 1 44 2150 174 300 100 120 AA
## 6 2013 7 1 46 2051 235 304 2358 186 B6
## 7 2013 7 1 48 2001 287 308 2305 243 VX
## 8 2013 7 1 58 2155 183 335 43 172 B6
## 9 2013 7 1 100 2146 194 327 30 177 B6
## 10 2013 7 1 100 2245 135 337 135 122 B6
## # ℹ 86,316 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
You could also use “or” |
, or between()
filter(flights, month == 7 |
month == 8 |
month == 9)
## # A tibble: 86,326 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 7 1 1 2029 212 236 2359 157 B6
## 2 2013 7 1 2 2359 3 344 344 0 B6
## 3 2013 7 1 29 2245 104 151 1 110 B6
## 4 2013 7 1 43 2130 193 322 14 188 B6
## 5 2013 7 1 44 2150 174 300 100 120 AA
## 6 2013 7 1 46 2051 235 304 2358 186 B6
## 7 2013 7 1 48 2001 287 308 2305 243 VX
## 8 2013 7 1 58 2155 183 335 43 172 B6
## 9 2013 7 1 100 2146 194 327 30 177 B6
## 10 2013 7 1 100 2245 135 337 135 122 B6
## # ℹ 86,316 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
## # A tibble: 86,326 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 7 1 1 2029 212 236 2359 157 B6
## 2 2013 7 1 2 2359 3 344 344 0 B6
## 3 2013 7 1 29 2245 104 151 1 110 B6
## 4 2013 7 1 43 2130 193 322 14 188 B6
## 5 2013 7 1 44 2150 174 300 100 120 AA
## 6 2013 7 1 46 2051 235 304 2358 186 B6
## 7 2013 7 1 48 2001 287 308 2305 243 VX
## 8 2013 7 1 58 2155 183 335 43 172 B6
## 9 2013 7 1 100 2146 194 327 30 177 B6
## 10 2013 7 1 100 2245 135 337 135 122 B6
## # ℹ 86,316 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Solution a-d
It is possible to combine all four of the elements in Q6 into a single code sequence:
# with each step as a separate `filter()` statement
flights |>
filter(arr_delay >= 120) |>
filter(dest == "IAH" | dest == "HOU") |>
filter(carrier %in% c("AA", "DL", "UA")) |>
filter(month %in% 7:9)
## # A tibble: 47 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 7 1 1310 1057 133 1551 1338 133 UA
## 2 2013 7 1 1707 1448 139 1943 1742 121 UA
## 3 2013 7 1 2058 1735 203 2355 2030 205 AA
## 4 2013 7 2 2001 1735 146 2335 2030 185 AA
## 5 2013 7 3 2215 1909 186 45 2200 165 UA
## 6 2013 7 9 1937 1735 122 2240 2030 130 AA
## 7 2013 7 10 40 1909 331 301 2200 301 UA
## 8 2013 7 10 1629 1520 69 2048 1754 174 UA
## 9 2013 7 10 1913 1721 112 2214 2001 133 UA
## 10 2013 7 17 1657 1446 131 2007 1745 142 UA
## # ℹ 37 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
We could go one step further and combine it all into a single filter()
function:
flights |>
filter(arr_delay >= 120 &
dest %in% c("IAH", "HOU") &
carrier %in% c("AA", "DL", "UA") &
month %in% 7:9)
## # A tibble: 47 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 7 1 1310 1057 133 1551 1338 133 UA
## 2 2013 7 1 1707 1448 139 1943 1742 121 UA
## 3 2013 7 1 2058 1735 203 2355 2030 205 AA
## 4 2013 7 2 2001 1735 146 2335 2030 185 AA
## 5 2013 7 3 2215 1909 186 45 2200 165 UA
## 6 2013 7 9 1937 1735 122 2240 2030 130 AA
## 7 2013 7 10 40 1909 331 301 2200 301 UA
## 8 2013 7 10 1629 1520 69 2048 1754 174 UA
## 9 2013 7 10 1913 1721 112 2214 2001 133 UA
## 10 2013 7 17 1657 1446 131 2007 1745 142 UA
## # ℹ 37 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
-30-