# Exercise Solutions

## Section 1.1.1

1. You should expect these variables to be very closely correlated, with the points clustered very close to a line.
ggplot(data = mpg) +
geom_point(mapping = aes(x = cty, y = hwy)) 1. manufacturer, model, year, cyl, trans, drv, fl, and class are categorical. The others are continuous. (You could make an argument for year and cyl being continuous as well, but because each only has a few values and the possible values cannot occur across a continuous spectrum, they are more appropriately classified as categorical.)

2. You could use either hwy or cty to measure fuel efficiency. The scatter plot below would indicate that front-wheel drives generally have the best fuel efficiency but exhibit the most variation. Rear-wheel drives get the worst fuel efficiency and exhibit the least variation.

ggplot(data = mpg) +
geom_point(mapping = aes(x = drv, y = hwy)) 1. The points are stacked vertically. This happens because drv is categorical. (We’ll see in Section 1.6 that box plots are better ways to visualize a continuous vs. categorical relationship.)

2. In the scatter plot below, a dot just indicates that a car with that class-drv combination exists in the data set. For example, there’s a car with class = 2seater and drv = r. While this is somewhat helpful, it doesn’t convey a trend, which is what scatter plots are supposed to do.

ggplot(data = mpg) +
geom_point(mapping = aes(x = class, y = drv)) 1. Scatter plots are most useful when they illustrate a trend in the data, i.e., a way that a change in one variable brings about a change in a related variable. Changes in variables are easier to think about and visualize when the variables are continuous.

## Section 1.2.1

1. The warning message indicates that you can only map a variable to shape if it has 6 or fewer values since looking at more than 6 shapes can be visually confusing. The 62 rows that were removed must be those for which class = SUV since SUV is the seventh value of class when listed alphabetically.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values because more than 6
## becomes difficult to discriminate; you have 7. Consider specifying shapes manually
## if you must have them.
## Warning: Removed 62 rows containing missing values (geom_point). 1. One example would be to use drv instead of class since it only has 3 values:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = drv)) 1. The scatter plot below shows what happens when the continuous variable cty is mapped to shape. The error message indicates that this is not an option.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = cty))
## Error in scale_f():
## ! A continuous variable can not be mapped to shape 1. First, let’s map the continuous cty to size. We see that the size of the dots corresponds to the magnitude of the values of cty. The legend displays representative dot sizes.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = cty)) If we map the categorical drv to size, we get a warning message that indicates that size is not an appropriate aesthetic for categorical variables. This is because such variables usually have no meaningful notion of magnitude, so the size aesthetic can be misleading. For example, in the plot below, one might get the idea that rear-wheel drives are somehow “bigger” than front-wheel and 4-wheel drives.

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = drv))
## Warning: Using size for a discrete variable is not advised. 1. Mapping the continuous cty to color, we see that a gradation of colors is assigned to the continuous interval of cty values.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = cty)) 1. The scatter plot below maps class to color and drv to shape. The result is a visualization which is somewhat hard to read. Generally speaking, mapping to more than one extra aesthetic crams a little too much information into a visualization.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class, shape = drv)) 1. The scatter plot below maps the drv variable to both color and shape. Mapping drv to two different aesthetics is a good way to further differentiate the scatter plot points that represent the different drv values.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = drv, shape = drv)) 1. This technique gives us a way to visually separate a scatter plot according to a given condition. Below, we can visualize the relationship between displ and hwy separately for cars with fewer than 6 cylinders and those with 6 or more cylinders.
ggplot(mpg) +
geom_point(aes(displ, hwy, color = cyl<6)) ## Section 1.4.1

1. The available aesthetics are found in the documentation by entering ?geom_smooth. The visualizations below map drv to linetype and size.
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv)) ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, size = drv)) ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point(mapping = aes(color = drv)) +
geom_smooth(se = FALSE) +
labs(title = "The Relationship Between Highway and City Fuel Efficiency",
x = "City Fuel Efficiency (miles per gallon)",
y = "Highway Fuel Efficiency (miles per gallon)",
color = "Drive Train Type") 1. The visualization below shows that the color aesthetic should probably not be used with a categorical variable with several values since it leads to a very cluttered plot.
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, color = class)) ## Section 1.6.1

1. Since carat and price are both continuous, a scatter plot is appropriate. Notice below that price is the y-axis variable, and carat is the x-axis variable. This is because carat is an independent variable (meaning that it’s not affected by the other variables) and price is a dependent variable (meaning that it is affected by the other variables).
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price)) 1. Since clarity is categorical, a better way to visualize its affect on price is to use a box plot.
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = clarity, y = price)) 1. Judging by the median value of price, it looks like SI2 is the highest priced clarity value. You could also make an argument for VS2 or VS1 since they have the highest values for Q3. VS2 and VS1 clarities also seem to have the most variation in price since they have the largest interquartile ranges.

2. The fact that, for any given clarity value, all of the outliers lie above the main cluster indicates that the data set contains several diamonds with extremely high prices that are much larger than the median prices. This is often the case with valuable commodities - there are always several items whose values are extremely high compared to the median. (Think houses, baseball cards, cars, antiques, etc.)

3. The fill aesthetic seems to best among the three at visually distinguishing the values of cut.

ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = color, y = price, color = cut)) ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = color, y = price, fill = cut)) ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = color, y = price, linetype = cut)) ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = reorder(clarity, price), y = price)) +
coord_flip() ## Section 1.7.1

1. Since clarity is categorical, we should use a bar graph.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = clarity)) ggplot(data = diamonds) +
geom_bar(mapping = aes(x = clarity, y = stat(prop), group = 1)) 1. Since carat is continuous, we should now use a histogram.
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat)) 1. The first histogram below has bins that are way too narrow. The histogram is too granular to visualize the true distribution. The second one has bins which are way too wide, producing a histogram that is too coarse.
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.01) ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 2) 1. The following bar graph looks so spiky because it’s producing a bar for every single value of price in the data set, of which there are hundreds, at least. Since price is continuous, a histogram would have been the way to visualize its distribution.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = price)) 1. Mapping cut to fill produces a much better result than mapping to color.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = clarity, color = cut)) ggplot(data = diamonds) +
geom_bar(mapping = aes(x = clarity, fill = cut)) 1. Since using the optional position = "dodge" argument produces a bar for each cut value within a given clarity value, it’s easier to see how cut values are distributed within each clarity value since it allows you to compare relative counts more easily.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = clarity, fill = cut), position = "dodge") 1. Mapping cut to two different aesthetics is redundant, but doing so makes it easier to distinguish the bars for the various cut values. (See Exercise 7 from Section 1.2.1.)
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut)) 1. You get a pie chart, where the radius of each piece of pie corresponds to the count for that cut value.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut)) +
coord_polar() 1. In the first bar graph below, the names of the manufacturer values are crowded together on the x-axis, making them hard to read. The second bar graph fixes this problems by adding a coord_flip() layer.
ggplot(data = mpg) +
geom_bar(mapping = aes(x = manufacturer)) ggplot(data = mpg) +
geom_bar(mapping = aes(x = manufacturer)) +
coord_flip() ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat, fill = cut)) 1. A frequency polynomial contains the same information as a histogram but displays it in a more minimal way.
ggplot(data = diamonds) +
geom_freqpoly(mapping = aes(x = price)) 1. First, you may have noticed that mapping cut to fill, as was done in Exercise 11, is not as helpful in geom_freqpoly. Mapping to color produces a nice plot. It seems a little better than the histogram in Exercise 11 since it avoids the stacking of the colored segments of the bars.
ggplot(data = diamonds) +
geom_freqpoly(mapping = aes(x = carat, color = cut)) ggplot(data = mpg) +
geom_bar(mapping = aes(x = class, fill = drv)) +
labs(x = "type of car",
y = "number of cars",
fill = "drive train",
title = "Distribution of Car Type Broken Down by Drive Train") ## Section 2.1.1

1. There are 336,776 observations (rows) and 19 variables (columns).

filter(flights, month == 2)
filter(flights, carrier == "UA" | carrier == "AA")
filter(flights, month >= 6 & month <= 8)
filter(flights, arr_delay > 120 & dep_delay <= 0)
filter(flights, dep_delay > 60 & dep_delay - arr_delay > 30)
filter(flights, dep_time < 600)
2. A canceled flight would have an NA for its departure time, so we can just count the number of NAs in this column as shown below:

sum(is.na(flights$dep_time)) ##  8255 1. We would have to find the number of flights with a 0 or negative arr_delay value and then divide that number by the total number of arriving flights Delta flew. The following shows a way to do this. (Recall that when you compute the mean of a column of TRUE/FALSE values, you’re computing the percentage of TRUEs in the column.) # First create a data set containing the Delta flights that arrived at their destinations: Delta_arr <- filter(flights, carrier == "DL" & !is.na(arr_delay)) # Now find the percentage of rows of this filtered data set with arr_delay <= 0: mean(Delta_arr$arr_delay <= 0)
##  0.6556087

So Delta’s on-time arrival rate in 2013 was about 66%. For the winter on-time arrival rate, we can repeat this calculation after filtering Delta_arr down to just the winter flights.

winter_Delta_arr <- filter(Delta_arr, month == 12 | month <= 2)

mean(winter_Delta_arr$arr_delay <= 0) ##  0.6546032 The on-time arrival rate is about 65%. ## Section 2.2.1 1. Let’s sort the msleep data set by conservation and then View it: View(arrange(msleep, conservation)) Scrolling through the sorted data set, we see the rows with NA in the conservation column appear at the end. To move them to the top, we can do the following, remembering that is.na(conservation) will return a 1 for the rows with an NA in the conservation column and a 0 otherwise. arrange(msleep, desc(is.na(conservation))) 1. The longest delay was 1301 minutes. arrange(flights, desc(dep_delay)) ## # A tibble: 336,776 x 19 ## year month day dep_time sched_d~1 dep_d~2 arr_t~3 sched~4 arr_d~5 carrier flight tailnum ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> ## 1 2013 1 9 641 900 1301 1242 1530 1272 HA 51 N384HA ## 2 2013 6 15 1432 1935 1137 1607 2120 1127 MQ 3535 N504MQ ## 3 2013 1 10 1121 1635 1126 1239 1810 1109 MQ 3695 N517MQ ## 4 2013 9 20 1139 1845 1014 1457 2210 1007 AA 177 N338AA ## 5 2013 7 22 845 1600 1005 1044 1815 989 MQ 3075 N665MQ ## 6 2013 4 10 1100 1900 960 1342 2211 931 DL 2391 N959DL ## 7 2013 3 17 2321 810 911 135 1020 915 DL 2119 N927DA ## 8 2013 6 27 959 1900 899 1236 2226 850 DL 2007 N3762Y ## 9 2013 7 22 2257 759 898 121 1026 895 DL 2047 N6716C ## 10 2013 12 5 756 1700 896 1058 2020 878 AA 172 N5DMAA ## # ... with 336,766 more rows, 7 more variables: origin <chr>, dest <chr>, air_time <dbl>, ## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable ## # names 1: sched_dep_time, 2: dep_delay, 3: arr_time, 4: sched_arr_time, 5: arr_delay ## # i Use print(n = ...) to see more rows, and colnames() to see all variable names 1. It looks like several flights departed at 12:01am. arrange(flights, dep_time) ## # A tibble: 336,776 x 19 ## year month day dep_time sched_d~1 dep_d~2 arr_t~3 sched~4 arr_d~5 carrier flight tailnum ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> ## 1 2013 1 13 1 2249 72 108 2357 71 B6 22 N206JB ## 2 2013 1 31 1 2100 181 124 2225 179 WN 530 N550WN ## 3 2013 11 13 1 2359 2 442 440 2 B6 1503 N627JB ## 4 2013 12 16 1 2359 2 447 437 10 B6 839 N607JB ## 5 2013 12 20 1 2359 2 430 440 -10 B6 1503 N608JB ## 6 2013 12 26 1 2359 2 437 440 -3 B6 1503 N527JB ## 7 2013 12 30 1 2359 2 441 437 4 B6 839 N508JB ## 8 2013 2 11 1 2100 181 111 2225 166 WN 530 N231WN ## 9 2013 2 24 1 2245 76 121 2354 87 B6 608 N216JB ## 10 2013 3 8 1 2355 6 431 440 -9 B6 739 N586JB ## # ... with 336,766 more rows, 7 more variables: origin <chr>, dest <chr>, air_time <dbl>, ## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable ## # names 1: sched_dep_time, 2: dep_delay, 3: arr_time, 4: sched_arr_time, 5: arr_delay ## # i Use print(n = ...) to see more rows, and colnames() to see all variable names 1. The fastest average speed was Delta flight 1499 from LGA to ATL on May 25. arrange(flights, desc(distance/air_time)) The slowest average speed was US Airways flight 1860 from LGA to PHL on January 28. arrange(flights, distance/air_time) 1. The Hawaiian Airlines flights from JFK to HNL flew the farthest distance of 4983 miles. arrange(flights, desc(distance)) The shortest flight in the data set was US Airways flight 1632 from EWR to LGA on July 27. (It was canceled.) arrange(flights, distance) ## Section 2.3.1 1. Checking the dest column of the transformed data set created below, we see that the most distant American Airlines destination is SFO. flights %>% filter(carrier == "AA") %>% arrange(desc(distance)) 1. We can refine the filter from the previous exercise to include only flights with LGA as the origin airport. The most distant destination is DFW. flights %>% filter(carrier == "AA" & origin == "LGA") %>% arrange(desc(distance)) 1. The most delayed winter flight was Hawaiian Airlines flight 51 from JFK to HNL on January 9. flights %>% filter(month == 12 | month <= 2) %>% arrange(desc(dep_delay)) The most delayed summer flight was Envoy Air flight 3535 from JFK to CMH on June 15. flights %>% filter(month >= 6 & month <= 8) %>% arrange(desc(dep_delay)) ## Section 2.5.1 1. The fastest flight was Delta flight 1499 from LGA to ATL on May 25. It’s average air speed was about 703 mph. flights %>% filter(!is.na(dep_time)) %>% # <---- Filters out canceled flights mutate(air_speed = distance/air_time * 60) %>% arrange(desc(air_speed)) %>% select(month, day, carrier, flight, origin, dest, air_speed) The slowest flight was US Airways flight 1860 from LGA to PHL on January 28. It’s average air speed was about 77 mph. flights %>% filter(!is.na(dep_time)) %>% # <---- Filters out canceled flights mutate(air_speed = distance/air_time * 60) %>% arrange(air_speed) %>% select(month, day, carrier, flight, origin, dest, air_speed) 1. The difference column below shows the values of the the newly created flight_minutes variable minus the air_time variable already in flights. A little digging on the internet could tell us that air_time only includes the times during which the plane is actually in the air and does not include taxiing on the runway, etc. dep_time and arr_time are the times when the plane departs from and arrives at the gate, so flight_minutes includes time spent on the runway. This would mean flight_minutes should be consistently larger than air_time, producing all positive values in the difference column. However, this is not what we observe. If we look carefully at the flights with a negative difference value, we might notice that the destination airport is not in the Eastern Time Zone. The arr_time values in the Central Time Zone are thus 60 minutes behind Eastern, Mountain Time Zone arr_times are 120 minutes behind, and Pacific Time Zones are 180 minutes behind. This explains the negative difference values and also why you would see, for example, a flight from Newark to San Francisco with a flight_minutes value of only 205. flights %>% mutate(dep_mins_midnight = (dep_time %/% 100)*60 + (dep_time %% 100), arr_mins_midnight = (arr_time %/% 100)*60 + (arr_time %% 100), flight_minutes = arr_mins_midnight - dep_mins_midnight, difference = flight_minutes - air_time) %>% select(month, day, origin, dest, air_time, flight_minutes, difference) ## # A tibble: 336,776 x 7 ## month day origin dest air_time flight_minutes difference ## <int> <int> <chr> <chr> <dbl> <dbl> <dbl> ## 1 1 1 EWR IAH 227 193 -34 ## 2 1 1 LGA IAH 227 197 -30 ## 3 1 1 JFK MIA 160 221 61 ## 4 1 1 JFK BQN 183 260 77 ## 5 1 1 LGA ATL 116 138 22 ## 6 1 1 EWR ORD 150 106 -44 ## 7 1 1 EWR FLL 158 198 40 ## 8 1 1 LGA IAD 53 72 19 ## 9 1 1 JFK MCO 140 161 21 ## 10 1 1 LGA ORD 138 115 -23 ## # ... with 336,766 more rows ## # i Use print(n = ...) to see more rows 1. The first day of the year with a canceled flight was January 1. flights %>% mutate(status = ifelse(is.na(dep_time), "canceled", "not canceled")) %>% arrange(status) ## # A tibble: 336,776 x 20 ## year month day dep_time sched_d~1 dep_d~2 arr_t~3 sched~4 arr_d~5 carrier flight tailnum ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> ## 1 2013 1 1 NA 1630 NA NA 1815 NA EV 4308 N18120 ## 2 2013 1 1 NA 1935 NA NA 2240 NA AA 791 N3EHAA ## 3 2013 1 1 NA 1500 NA NA 1825 NA AA 1925 N3EVAA ## 4 2013 1 1 NA 600 NA NA 901 NA B6 125 N618JB ## 5 2013 1 2 NA 1540 NA NA 1747 NA EV 4352 N10575 ## 6 2013 1 2 NA 1620 NA NA 1746 NA EV 4406 N13949 ## 7 2013 1 2 NA 1355 NA NA 1459 NA EV 4434 N10575 ## 8 2013 1 2 NA 1420 NA NA 1644 NA EV 4935 N759EV ## 9 2013 1 2 NA 1321 NA NA 1536 NA EV 3849 N13550 ## 10 2013 1 2 NA 1545 NA NA 1910 NA AA 133 <NA> ## # ... with 336,766 more rows, 8 more variables: origin <chr>, dest <chr>, air_time <dbl>, ## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, status <chr>, and abbreviated ## # variable names 1: sched_dep_time, 2: dep_delay, 3: arr_time, 4: sched_arr_time, ## # 5: arr_delay ## # i Use print(n = ...) to see more rows, and colnames() to see all variable names 1. There were 9430 flights for which arr_status is NA. The reason that NA was assigned to arr_status for these flights is that the arr_delay value for these flights is NA, so the condition in the ifelse would be checking whether NA <= 0. The result of this condition can be neither TRUE nor FALSE, it will instead just be NA. flights_w_arr_status <- flights %>% mutate(arr_status = ifelse(arr_delay <= 0, "on time", "late")) sum(is.na(flights_w_arr_status$arr_status))
##  9430
1. We will rearrange the conditions so that the first one checked is arr_delay <= 0 and the second one is is.na(arr_delay). We can then find a row with an NA in the arr_delay column by sorting by is.na(arr_delay) in descending order.

The results below show that arr_status was not correctly assigned a “canceled” value for these flights. This is because the first condition in the nested ifelse is arr_delay <= 0. As we saw in the previous exercise, when the arr_delay value is NA, this first condition is neither TRUE nor FALSE, so neither of the conditional statements will be executed. Instead, a value of NA will be assigned to arr_status.

flights_w_arr_status2 <- flights %>%
mutate(arr_status = ifelse(arr_delay <= 0,
"on time",
ifelse(is.na(arr_delay),
"canceled",
"late"))) %>%
arrange(desc(is.na(arr_delay))) %>%
select(arr_delay, arr_status)

flights_w_arr_status2
## # A tibble: 336,776 x 2
##    arr_delay arr_status
##        <dbl> <chr>
##  1        NA <NA>
##  2        NA <NA>
##  3        NA <NA>
##  4        NA <NA>
##  5        NA <NA>
##  6        NA <NA>
##  7        NA <NA>
##  8        NA <NA>
##  9        NA <NA>
## 10        NA <NA>
## # ... with 336,766 more rows
## # i Use print(n = ...) to see more rows
1. We’ll interchange the first two conditions as we did in the previous exercise and then again sort by is.na(arr_status) in descending order to find the flights for which arr_delay is NA. We can see that arr_status is correctly assigned a canceled value. This shows that the order in which the conditions are listed does not matter in a case_when statement. The previous exercise shows that the order does matter in a nested ifelse, which means that case_when statements are more robust ways to handle conditional statements.
flights_w_arr_status3 <- flights %>%
mutate(arr_status = case_when(arr_delay <= 0 ~ "on time",
is.na(arr_delay) ~ "canceled",
arr_delay >0 ~ "delayed")) %>%
arrange(desc(is.na(arr_delay))) %>%
select(arr_delay, arr_status)

flights_w_arr_status3
## # A tibble: 336,776 x 2
##    arr_delay arr_status
##        <dbl> <chr>
##  1        NA canceled
##  2        NA canceled
##  3        NA canceled
##  4        NA canceled
##  5        NA canceled
##  6        NA canceled
##  7        NA canceled
##  8        NA canceled
##  9        NA canceled
## 10        NA canceled
## # ... with 336,766 more rows
## # i Use print(n = ...) to see more rows
flights %>%
mutate(season = case_when(month >= 3 & month <= 5 ~ "spring",
month >= 6 & month <= 8 ~ "summer",
month >= 9 & month <= 11 ~ "fall",
TRUE ~ "winter"))
1. It might help to recall the distribution of price in diamonds by looking at its histogram shown below.
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price)) The histogram might suggest the following cutoffs: expensive for price < 5000, very expensive for 5000 <= price < 10000, insanely expensive for 10000 <= price < 15000 and priceless for price >= 15000. We can assign these labels with the following case_when:

diamonds2 <- diamonds %>%
mutate(category = case_when(price < 5000 ~ "expensive",
price >= 5000 & price < 10000 ~ "very expensive",
price >= 10000 & price < 15000 ~ "insanely expensive",
TRUE ~ "priceless"))

Finally, we can visualize the distribution with a bar graph:

ggplot(data = diamonds2) +
geom_bar(mapping = aes(x = category)) 1. Batting has 110,495 observations and 22 variables.

2. Our results are surprising because there are several players with perfect batting averages. This is explained by the fact that these players had very few at bats.

Batting %>%
mutate(batting_average = H/AB) %>%
arrange(desc(batting_average))
1. The highest single season batting average of 0.440 belonged to Hugh Duffy in 1894.
Batting %>%
filter(AB >= 350) %>%
mutate(batting_average = H/AB) %>%
arrange(desc(batting_average))
1. The highest batting average of the modern era belonged to Tony Gwynn, who hit 0.394 in 1994.
Batting %>%
filter(yearID >= 1947,
AB >= 350) %>%
mutate(batting_average = H/AB) %>%
arrange(desc(batting_average))
1. The last player to hit 0.400 was Ted Williams in 1941.
Batting %>%
filter(AB >= 350) %>%
mutate(batting_average = H/AB) %>%
filter(batting_average >= 0.4) %>%
arrange(desc(yearID))

## Section 2.6.1

1. Be sure to remove the NA values when computing the mean.
flights %>%
group_by(month, day) %>%
summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE))
1. The carrier with the best on-time rate was AS (Alaska Airlines) at about 73%. However, American Airlines and Delta Airlines should be recognized as well since they maintained comparably high on-time rates while flying tens of thousands more flights than Alaska Airlines.

The carrier with the worst on-time rate was FL (Airtran Airways) at about 40%. It should also be mentioned that Express Jet (EV) flew over 50,000 flights while compiling an on-time rate of just 52%.

arr_rate <- flights %>%
group_by(carrier) %>%
summarize(on_time_rate = mean(arr_delay <= 0, na.rm = TRUE),
count = n())

arrange(arr_rate, desc(on_time_rate))
## # A tibble: 16 x 3
##    carrier on_time_rate count
##    <chr>          <dbl> <int>
##  1 AS             0.733   714
##  2 HA             0.716   342
##  3 AA             0.665 32729
##  4 VX             0.659  5162
##  5 DL             0.656 48110
##  6 OO             0.655    32
##  7 US             0.629 20536
##  8 9E             0.616 18460
##  9 UA             0.615 58665
## 10 B6             0.563 54635
## 11 WN             0.560 12275
## 12 MQ             0.533 26397
## 13 YV             0.526   601
## 14 EV             0.521 54173
## 15 F9             0.424   685
## 16 FL             0.403  3260
arrange(arr_rate, on_time_rate)
## # A tibble: 16 x 3
##    carrier on_time_rate count
##    <chr>          <dbl> <int>
##  1 FL             0.403  3260
##  2 F9             0.424   685
##  3 EV             0.521 54173
##  4 YV             0.526   601
##  5 MQ             0.533 26397
##  6 WN             0.560 12275
##  7 B6             0.563 54635
##  8 UA             0.615 58665
##  9 9E             0.616 18460
## 10 US             0.629 20536
## 11 OO             0.655    32
## 12 DL             0.656 48110
## 13 VX             0.659  5162
## 14 AA             0.665 32729
## 15 HA             0.716   342
## 16 AS             0.733   714
1. One way we can answer this question by comparing the minimum air_time value to the average air_time value at each destination and sorting by the difference to find destinations with minimums way smaller than averages.

It looks like there was a flight to Minneapolis-St. Paul that arrived almost 1 hour faster than the average flight to that destination. (It looks like this flight left the origin gate at 3:58PM EST and arrived at the destination gate at 5:45PM CST, which would mean 107 minutes gate-to-gate. The air_time value of 93 is thus probably not an entry error. The flight had a departure delay of 45 minutes and may have been trying to make up time in the air.)

flights %>%
filter(!is.na(air_time)) %>%
group_by(dest) %>%
summarize(min_air_time = min(air_time),
avg_air_time = mean(air_time)) %>%
arrange(desc(avg_air_time - min_air_time))
1. Express Jet flew to 61 different destinations.
flights %>%
group_by(carrier) %>%
summarize(dist_dest = n_distinct(dest)) %>%
arrange(desc(dist_dest))
1. Standard deviation is the statistic most often used to measure variation in a variable. The plane with tail number N76062 had the highest standard deviation is distances flown at about 1796 miles.
flights %>%
filter(!is.na(tailnum)) %>%
group_by(tailnum) %>%
summarize(variation = sd(distance, na.rm = TRUE),
count = n()) %>%
arrange(desc(variation))
1. February 8 had the most cancellations at 472. There were 7 days with no cancellations, including, luckily, Thanksgiving and the day after Thanksgiving.
cancellations <- flights %>%
group_by(month, day) %>%
summarize(flights_canceled = sum(is.na(dep_time)))

arrange(cancellations, desc(flights_canceled))
## # A tibble: 365 x 3
## # Groups:   month 
##    month   day flights_canceled
##    <int> <int>            <int>
##  1     2     8              472
##  2     2     9              393
##  3     5    23              221
##  4    12    10              204
##  5     9    12              192
##  6     3     6              180
##  7     3     8              180
##  8    12     5              158
##  9    12    14              125
## 10     6    28              123
## # ... with 355 more rows
## # i Use print(n = ...) to see more rows
arrange(cancellations, flights_canceled)
## # A tibble: 365 x 3
## # Groups:   month 
##    month   day flights_canceled
##    <int> <int>            <int>
##  1     4    21                0
##  2     5    17                0
##  3     5    26                0
##  4    10     5                0
##  5    10    20                0
##  6    11    28                0
##  7    11    29                0
##  8     1     6                1
##  9     1    19                1
## 10     2    16                1
## # ... with 355 more rows
## # i Use print(n = ...) to see more rows
HR_by_year <- Batting %>%
group_by(yearID) %>%
summarize(max_HR = max(HR))
1. The huge jump around 1920 is explained by the arrival of Babe Ruth. Home runs totals remained very high in his aftermath, showing how dramatically he changed the game. You might also recognize some high spikes during the PED-fueled years of Mark McGwire, Sammy Sosa, and Barry Bonds, and some low spikes during strike- or COVID- shortened seasons.
ggplot(data = HR_by_year, mapping = aes(x = yearID, y = max_HR)) +
geom_point() +
geom_line() Batting %>%
group_by(playerID) %>%
summarize(at_bats = sum(AB, na.rm = TRUE),
hits = sum(H, na.rm = TRUE),
batting_avg = hits/at_bats) %>%
filter(at_bats >= 3000) %>%
arrange(desc(batting_avg))
## # A tibble: 1,774 x 4
##    playerID  at_bats  hits batting_avg
##    <chr>       <int> <int>       <dbl>
##  1 cobbty01    11436  4189       0.366
##  2 hornsro01    8173  2930       0.358
##  3 jacksjo01    4981  1772       0.356
##  4 odoulle01    3264  1140       0.349
##  5 delahed01    7510  2597       0.346
##  6 speaktr01   10195  3514       0.345
##  7 hamilbi01    6283  2164       0.344
##  8 willite01    7706  2654       0.344
##  9 broutda01    6726  2303       0.342
## 10 ruthba01     8398  2873       0.342
## # ... with 1,764 more rows
## # i Use print(n = ...) to see more rows