Exercise Solutions

Section 1.1.1

  1. You should expect these variables to be very closely correlated, with the points clustered very close to a line.
ggplot(data = mpg) +
  geom_point(mapping = aes(x = cty, y = hwy))

  1. manufacturer, model, year, cyl, trans, drv, fl, and class are categorical. The others are continuous. (You could make an argument for year and cyl being continuous as well, but because each only has a few values and the possible values cannot occur across a continuous spectrum, they are more appropriately classified as categorical.)

  2. You could use either hwy or cty to measure fuel efficiency. The scatter plot below would indicate that front-wheel drives generally have the best fuel efficiency but exhibit the most variation. Rear-wheel drives get the worst fuel efficiency and exhibit the least variation.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = drv, y = hwy))

  1. The points are stacked vertically. This happens because drv is categorical. (We’ll see in Section 1.6 that box plots are better ways to visualize a continuous vs. categorical relationship.)

  2. In the scatter plot below, a dot just indicates that a car with that class-drv combination exists in the data set. For example, there’s a car with class = 2seater and drv = r. While this is somewhat helpful, it doesn’t convey a trend, which is what scatter plots are supposed to do.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = class, y = drv))

  1. Scatter plots are most useful when they illustrate a trend in the data, i.e., a way that a change in one variable brings about a change in a related variable. Changes in variables are easier to think about and visualize when the variables are continuous.

Section 1.2.1

  1. The warning message indicates that you can only map a variable to shape if it has 6 or fewer values since looking at more than 6 shapes can be visually confusing. The 62 rows that were removed must be those for which class = SUV since SUV is the seventh value of class when listed alphabetically.
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values because more than 6
## becomes difficult to discriminate; you have 7. Consider specifying shapes manually
## if you must have them.
## Warning: Removed 62 rows containing missing values (geom_point).

  1. One example would be to use drv instead of class since it only has 3 values:
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, shape = drv))

  1. The scatter plot below shows what happens when the continuous variable cty is mapped to shape. The error message indicates that this is not an option.
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, shape = cty))
## Error in `scale_f()`:
## ! A continuous variable can not be mapped to shape

  1. First, let’s map the continuous cty to size. We see that the size of the dots corresponds to the magnitude of the values of cty. The legend displays representative dot sizes.
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, size = cty))

If we map the categorical drv to size, we get a warning message that indicates that size is not an appropriate aesthetic for categorical variables. This is because such variables usually have no meaningful notion of magnitude, so the size aesthetic can be misleading. For example, in the plot below, one might get the idea that rear-wheel drives are somehow “bigger” than front-wheel and 4-wheel drives.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, size = drv))
## Warning: Using size for a discrete variable is not advised.

  1. Mapping the continuous cty to color, we see that a gradation of colors is assigned to the continuous interval of cty values.
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = cty))

  1. The scatter plot below maps class to color and drv to shape. The result is a visualization which is somewhat hard to read. Generally speaking, mapping to more than one extra aesthetic crams a little too much information into a visualization.
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = class, shape = drv))

  1. The scatter plot below maps the drv variable to both color and shape. Mapping drv to two different aesthetics is a good way to further differentiate the scatter plot points that represent the different drv values.
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = drv, shape = drv))

  1. This technique gives us a way to visually separate a scatter plot according to a given condition. Below, we can visualize the relationship between displ and hwy separately for cars with fewer than 6 cylinders and those with 6 or more cylinders.
ggplot(mpg) +
  geom_point(aes(displ, hwy, color = cyl<6))

Section 1.4.1

  1. The available aesthetics are found in the documentation by entering ?geom_smooth. The visualizations below map drv to linetype and size.
ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, size = drv))

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point(mapping = aes(color = drv)) +
  geom_smooth(se = FALSE) +
  labs(title = "The Relationship Between Highway and City Fuel Efficiency",
       x = "City Fuel Efficiency (miles per gallon)",
       y = "Highway Fuel Efficiency (miles per gallon)",
       color = "Drive Train Type")

  1. The visualization below shows that the color aesthetic should probably not be used with a categorical variable with several values since it leads to a very cluttered plot.
ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, color = class))

Section 1.6.1

  1. Since carat and price are both continuous, a scatter plot is appropriate. Notice below that price is the y-axis variable, and carat is the x-axis variable. This is because carat is an independent variable (meaning that it’s not affected by the other variables) and price is a dependent variable (meaning that it is affected by the other variables).
ggplot(data = diamonds) +
  geom_point(mapping = aes(x = carat, y = price))

  1. Since clarity is categorical, a better way to visualize its affect on price is to use a box plot.
ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = clarity, y = price))

  1. Judging by the median value of price, it looks like SI2 is the highest priced clarity value. You could also make an argument for VS2 or VS1 since they have the highest values for Q3. VS2 and VS1 clarities also seem to have the most variation in price since they have the largest interquartile ranges.

  2. The fact that, for any given clarity value, all of the outliers lie above the main cluster indicates that the data set contains several diamonds with extremely high prices that are much larger than the median prices. This is often the case with valuable commodities - there are always several items whose values are extremely high compared to the median. (Think houses, baseball cards, cars, antiques, etc.)

  3. The fill aesthetic seems to best among the three at visually distinguishing the values of cut.

ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = color, y = price, color = cut))

ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = color, y = price, fill = cut))

ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = color, y = price, linetype = cut))

ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = reorder(clarity, price), y = price)) +
  coord_flip()

Section 1.7.1

  1. Since clarity is categorical, we should use a bar graph.
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = clarity))

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = clarity, y = stat(prop), group = 1))

  1. Since carat is continuous, we should now use a histogram.
ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat))

  1. The first histogram below has bins that are way too narrow. The histogram is too granular to visualize the true distribution. The second one has bins which are way too wide, producing a histogram that is too coarse.
ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), binwidth = 0.01)

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), binwidth = 2)

  1. The following bar graph looks so spiky because it’s producing a bar for every single value of price in the data set, of which there are hundreds, at least. Since price is continuous, a histogram would have been the way to visualize its distribution.
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = price))

  1. Mapping cut to fill produces a much better result than mapping to color.
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = clarity, color = cut))

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = clarity, fill = cut))

  1. Since using the optional position = "dodge" argument produces a bar for each cut value within a given clarity value, it’s easier to see how cut values are distributed within each clarity value since it allows you to compare relative counts more easily.
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = clarity, fill = cut), position = "dodge")

  1. Mapping cut to two different aesthetics is redundant, but doing so makes it easier to distinguish the bars for the various cut values. (See Exercise 7 from Section 1.2.1.)
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = cut))

  1. You get a pie chart, where the radius of each piece of pie corresponds to the count for that cut value.
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = cut)) +
  coord_polar()

  1. In the first bar graph below, the names of the manufacturer values are crowded together on the x-axis, making them hard to read. The second bar graph fixes this problems by adding a coord_flip() layer.
ggplot(data = mpg) +
  geom_bar(mapping = aes(x = manufacturer))

ggplot(data = mpg) +
  geom_bar(mapping = aes(x = manufacturer)) +
  coord_flip()

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat, fill = cut))

  1. A frequency polynomial contains the same information as a histogram but displays it in a more minimal way.
ggplot(data = diamonds) +
  geom_freqpoly(mapping = aes(x = price))

  1. First, you may have noticed that mapping cut to fill, as was done in Exercise 11, is not as helpful in geom_freqpoly. Mapping to color produces a nice plot. It seems a little better than the histogram in Exercise 11 since it avoids the stacking of the colored segments of the bars.
ggplot(data = diamonds) +
  geom_freqpoly(mapping = aes(x = carat, color = cut))

ggplot(data = mpg) +
  geom_bar(mapping = aes(x = class, fill = drv)) +
  labs(x = "type of car",
       y = "number of cars",
       fill = "drive train",
       title = "Distribution of Car Type Broken Down by Drive Train")

Section 2.1.1

  1. There are 336,776 observations (rows) and 19 variables (columns).

    filter(flights, month == 2)
    filter(flights, carrier == "UA" | carrier == "AA")
    filter(flights, month >= 6 & month <= 8)
    filter(flights, arr_delay > 120 & dep_delay <= 0)
    filter(flights, dep_delay > 60 & dep_delay - arr_delay > 30)
    filter(flights, dep_time < 600)
  2. A canceled flight would have an NA for its departure time, so we can just count the number of NAs in this column as shown below:

sum(is.na(flights$dep_time))
## [1] 8255
  1. We would have to find the number of flights with a 0 or negative arr_delay value and then divide that number by the total number of arriving flights Delta flew. The following shows a way to do this. (Recall that when you compute the mean of a column of TRUE/FALSE values, you’re computing the percentage of TRUEs in the column.)
# First create a data set containing the Delta flights that arrived at their destinations:
Delta_arr <- filter(flights, carrier == "DL" & !is.na(arr_delay))

# Now find the percentage of rows of this filtered data set with `arr_delay <= 0`:
mean(Delta_arr$arr_delay <= 0)
## [1] 0.6556087

So Delta’s on-time arrival rate in 2013 was about 66%. For the winter on-time arrival rate, we can repeat this calculation after filtering Delta_arr down to just the winter flights.

winter_Delta_arr <- filter(Delta_arr, month == 12 | month <= 2)

mean(winter_Delta_arr$arr_delay <= 0)
## [1] 0.6546032

The on-time arrival rate is about 65%.

Section 2.2.1

  1. Let’s sort the msleep data set by conservation and then View it:
View(arrange(msleep, conservation))

Scrolling through the sorted data set, we see the rows with NA in the conservation column appear at the end. To move them to the top, we can do the following, remembering that is.na(conservation) will return a 1 for the rows with an NA in the conservation column and a 0 otherwise.

arrange(msleep, desc(is.na(conservation)))
  1. The longest delay was 1301 minutes.
arrange(flights, desc(dep_delay))
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_d~1 dep_d~2 arr_t~3 sched~4 arr_d~5 carrier flight tailnum
##    <int> <int> <int>    <int>     <int>   <dbl>   <int>   <int>   <dbl> <chr>    <int> <chr>  
##  1  2013     1     9      641       900    1301    1242    1530    1272 HA          51 N384HA 
##  2  2013     6    15     1432      1935    1137    1607    2120    1127 MQ        3535 N504MQ 
##  3  2013     1    10     1121      1635    1126    1239    1810    1109 MQ        3695 N517MQ 
##  4  2013     9    20     1139      1845    1014    1457    2210    1007 AA         177 N338AA 
##  5  2013     7    22      845      1600    1005    1044    1815     989 MQ        3075 N665MQ 
##  6  2013     4    10     1100      1900     960    1342    2211     931 DL        2391 N959DL 
##  7  2013     3    17     2321       810     911     135    1020     915 DL        2119 N927DA 
##  8  2013     6    27      959      1900     899    1236    2226     850 DL        2007 N3762Y 
##  9  2013     7    22     2257       759     898     121    1026     895 DL        2047 N6716C 
## 10  2013    12     5      756      1700     896    1058    2020     878 AA         172 N5DMAA 
## # ... with 336,766 more rows, 7 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable
## #   names 1: sched_dep_time, 2: dep_delay, 3: arr_time, 4: sched_arr_time, 5: arr_delay
## # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
  1. It looks like several flights departed at 12:01am.
arrange(flights, dep_time)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_d~1 dep_d~2 arr_t~3 sched~4 arr_d~5 carrier flight tailnum
##    <int> <int> <int>    <int>     <int>   <dbl>   <int>   <int>   <dbl> <chr>    <int> <chr>  
##  1  2013     1    13        1      2249      72     108    2357      71 B6          22 N206JB 
##  2  2013     1    31        1      2100     181     124    2225     179 WN         530 N550WN 
##  3  2013    11    13        1      2359       2     442     440       2 B6        1503 N627JB 
##  4  2013    12    16        1      2359       2     447     437      10 B6         839 N607JB 
##  5  2013    12    20        1      2359       2     430     440     -10 B6        1503 N608JB 
##  6  2013    12    26        1      2359       2     437     440      -3 B6        1503 N527JB 
##  7  2013    12    30        1      2359       2     441     437       4 B6         839 N508JB 
##  8  2013     2    11        1      2100     181     111    2225     166 WN         530 N231WN 
##  9  2013     2    24        1      2245      76     121    2354      87 B6         608 N216JB 
## 10  2013     3     8        1      2355       6     431     440      -9 B6         739 N586JB 
## # ... with 336,766 more rows, 7 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable
## #   names 1: sched_dep_time, 2: dep_delay, 3: arr_time, 4: sched_arr_time, 5: arr_delay
## # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
  1. The fastest average speed was Delta flight 1499 from LGA to ATL on May 25.
arrange(flights, desc(distance/air_time))

The slowest average speed was US Airways flight 1860 from LGA to PHL on January 28.

arrange(flights, distance/air_time)
  1. The Hawaiian Airlines flights from JFK to HNL flew the farthest distance of 4983 miles.
arrange(flights, desc(distance))

The shortest flight in the data set was US Airways flight 1632 from EWR to LGA on July 27. (It was canceled.)

arrange(flights, distance)

Section 2.3.1

  1. Checking the dest column of the transformed data set created below, we see that the most distant American Airlines destination is SFO.
flights %>%
  filter(carrier == "AA") %>%
  arrange(desc(distance))
  1. We can refine the filter from the previous exercise to include only flights with LGA as the origin airport. The most distant destination is DFW.
flights %>%
  filter(carrier == "AA" & origin == "LGA") %>%
  arrange(desc(distance))
  1. The most delayed winter flight was Hawaiian Airlines flight 51 from JFK to HNL on January 9.
flights %>%
  filter(month == 12 | month <= 2) %>%
  arrange(desc(dep_delay))

The most delayed summer flight was Envoy Air flight 3535 from JFK to CMH on June 15.

flights %>%
  filter(month >= 6 & month <= 8) %>%
  arrange(desc(dep_delay))

Section 2.5.1

  1. The fastest flight was Delta flight 1499 from LGA to ATL on May 25. It’s average air speed was about 703 mph.
flights %>%
  filter(!is.na(dep_time)) %>%  # <---- Filters out canceled flights
  mutate(air_speed = distance/air_time * 60) %>%
  arrange(desc(air_speed)) %>%
  select(month, day, carrier, flight, origin, dest, air_speed)

The slowest flight was US Airways flight 1860 from LGA to PHL on January 28. It’s average air speed was about 77 mph.

flights %>%
  filter(!is.na(dep_time)) %>%  # <---- Filters out canceled flights
  mutate(air_speed = distance/air_time * 60) %>%
  arrange(air_speed) %>%
  select(month, day, carrier, flight, origin, dest, air_speed)
  1. The difference column below shows the values of the the newly created flight_minutes variable minus the air_time variable already in flights. A little digging on the internet could tell us that air_time only includes the times during which the plane is actually in the air and does not include taxiing on the runway, etc. dep_time and arr_time are the times when the plane departs from and arrives at the gate, so flight_minutes includes time spent on the runway. This would mean flight_minutes should be consistently larger than air_time, producing all positive values in the difference column.

However, this is not what we observe. If we look carefully at the flights with a negative difference value, we might notice that the destination airport is not in the Eastern Time Zone. The arr_time values in the Central Time Zone are thus 60 minutes behind Eastern, Mountain Time Zone arr_times are 120 minutes behind, and Pacific Time Zones are 180 minutes behind. This explains the negative difference values and also why you would see, for example, a flight from Newark to San Francisco with a flight_minutes value of only 205.

flights %>%
  mutate(dep_mins_midnight = (dep_time %/% 100)*60 + (dep_time %% 100),
         arr_mins_midnight = (arr_time %/% 100)*60 + (arr_time %% 100),
         flight_minutes = arr_mins_midnight - dep_mins_midnight,
         difference = flight_minutes - air_time) %>%
  select(month, day, origin, dest, air_time, flight_minutes, difference)
## # A tibble: 336,776 x 7
##    month   day origin dest  air_time flight_minutes difference
##    <int> <int> <chr>  <chr>    <dbl>          <dbl>      <dbl>
##  1     1     1 EWR    IAH        227            193        -34
##  2     1     1 LGA    IAH        227            197        -30
##  3     1     1 JFK    MIA        160            221         61
##  4     1     1 JFK    BQN        183            260         77
##  5     1     1 LGA    ATL        116            138         22
##  6     1     1 EWR    ORD        150            106        -44
##  7     1     1 EWR    FLL        158            198         40
##  8     1     1 LGA    IAD         53             72         19
##  9     1     1 JFK    MCO        140            161         21
## 10     1     1 LGA    ORD        138            115        -23
## # ... with 336,766 more rows
## # i Use `print(n = ...)` to see more rows
  1. The first day of the year with a canceled flight was January 1.
flights %>%
  mutate(status = ifelse(is.na(dep_time), "canceled", "not canceled")) %>%
  arrange(status)
## # A tibble: 336,776 x 20
##     year month   day dep_time sched_d~1 dep_d~2 arr_t~3 sched~4 arr_d~5 carrier flight tailnum
##    <int> <int> <int>    <int>     <int>   <dbl>   <int>   <int>   <dbl> <chr>    <int> <chr>  
##  1  2013     1     1       NA      1630      NA      NA    1815      NA EV        4308 N18120 
##  2  2013     1     1       NA      1935      NA      NA    2240      NA AA         791 N3EHAA 
##  3  2013     1     1       NA      1500      NA      NA    1825      NA AA        1925 N3EVAA 
##  4  2013     1     1       NA       600      NA      NA     901      NA B6         125 N618JB 
##  5  2013     1     2       NA      1540      NA      NA    1747      NA EV        4352 N10575 
##  6  2013     1     2       NA      1620      NA      NA    1746      NA EV        4406 N13949 
##  7  2013     1     2       NA      1355      NA      NA    1459      NA EV        4434 N10575 
##  8  2013     1     2       NA      1420      NA      NA    1644      NA EV        4935 N759EV 
##  9  2013     1     2       NA      1321      NA      NA    1536      NA EV        3849 N13550 
## 10  2013     1     2       NA      1545      NA      NA    1910      NA AA         133 <NA>   
## # ... with 336,766 more rows, 8 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, status <chr>, and abbreviated
## #   variable names 1: sched_dep_time, 2: dep_delay, 3: arr_time, 4: sched_arr_time,
## #   5: arr_delay
## # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
  1. There were 9430 flights for which arr_status is NA. The reason that NA was assigned to arr_status for these flights is that the arr_delay value for these flights is NA, so the condition in the ifelse would be checking whether NA <= 0. The result of this condition can be neither TRUE nor FALSE, it will instead just be NA.
flights_w_arr_status <- flights %>%
  mutate(arr_status = ifelse(arr_delay <= 0, 
                             "on time", 
                             "late"))

sum(is.na(flights_w_arr_status$arr_status))
## [1] 9430
  1. We will rearrange the conditions so that the first one checked is arr_delay <= 0 and the second one is is.na(arr_delay). We can then find a row with an NA in the arr_delay column by sorting by is.na(arr_delay) in descending order.

The results below show that arr_status was not correctly assigned a “canceled” value for these flights. This is because the first condition in the nested ifelse is arr_delay <= 0. As we saw in the previous exercise, when the arr_delay value is NA, this first condition is neither TRUE nor FALSE, so neither of the conditional statements will be executed. Instead, a value of NA will be assigned to arr_status.

flights_w_arr_status2 <- flights %>%
  mutate(arr_status = ifelse(arr_delay <= 0, 
                             "on time",
                             ifelse(is.na(arr_delay),
                                    "canceled",
                                    "late"))) %>%
  arrange(desc(is.na(arr_delay))) %>%
  select(arr_delay, arr_status)

flights_w_arr_status2
## # A tibble: 336,776 x 2
##    arr_delay arr_status
##        <dbl> <chr>     
##  1        NA <NA>      
##  2        NA <NA>      
##  3        NA <NA>      
##  4        NA <NA>      
##  5        NA <NA>      
##  6        NA <NA>      
##  7        NA <NA>      
##  8        NA <NA>      
##  9        NA <NA>      
## 10        NA <NA>      
## # ... with 336,766 more rows
## # i Use `print(n = ...)` to see more rows
  1. We’ll interchange the first two conditions as we did in the previous exercise and then again sort by is.na(arr_status) in descending order to find the flights for which arr_delay is NA. We can see that arr_status is correctly assigned a canceled value. This shows that the order in which the conditions are listed does not matter in a case_when statement. The previous exercise shows that the order does matter in a nested ifelse, which means that case_when statements are more robust ways to handle conditional statements.
flights_w_arr_status3 <- flights %>%
  mutate(arr_status = case_when(arr_delay <= 0 ~ "on time",
                                is.na(arr_delay) ~ "canceled",
                                arr_delay >0 ~ "delayed")) %>%
  arrange(desc(is.na(arr_delay))) %>%
  select(arr_delay, arr_status)

flights_w_arr_status3
## # A tibble: 336,776 x 2
##    arr_delay arr_status
##        <dbl> <chr>     
##  1        NA canceled  
##  2        NA canceled  
##  3        NA canceled  
##  4        NA canceled  
##  5        NA canceled  
##  6        NA canceled  
##  7        NA canceled  
##  8        NA canceled  
##  9        NA canceled  
## 10        NA canceled  
## # ... with 336,766 more rows
## # i Use `print(n = ...)` to see more rows
flights %>%
  mutate(season = case_when(month >= 3 & month <= 5 ~ "spring",
                            month >= 6 & month <= 8 ~ "summer",
                            month >= 9 & month <= 11 ~ "fall",
                            TRUE ~ "winter"))
  1. It might help to recall the distribution of price in diamonds by looking at its histogram shown below.
ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = price))

The histogram might suggest the following cutoffs: expensive for price < 5000, very expensive for 5000 <= price < 10000, insanely expensive for 10000 <= price < 15000 and priceless for price >= 15000. We can assign these labels with the following case_when:

diamonds2 <- diamonds %>%
  mutate(category = case_when(price < 5000 ~ "expensive",
                           price >= 5000 & price < 10000 ~ "very expensive",
                           price >= 10000 & price < 15000 ~ "insanely expensive",
                           TRUE ~ "priceless"))

Finally, we can visualize the distribution with a bar graph:

ggplot(data = diamonds2) +
  geom_bar(mapping = aes(x = category))

  1. Batting has 110,495 observations and 22 variables.

  2. Our results are surprising because there are several players with perfect batting averages. This is explained by the fact that these players had very few at bats.

Batting %>%
  mutate(batting_average = H/AB) %>%
  arrange(desc(batting_average))
  1. The highest single season batting average of 0.440 belonged to Hugh Duffy in 1894.
Batting %>%
  filter(AB >= 350) %>%
  mutate(batting_average = H/AB) %>%
  arrange(desc(batting_average))
  1. The highest batting average of the modern era belonged to Tony Gwynn, who hit 0.394 in 1994.
Batting %>%
  filter(yearID >= 1947,
         AB >= 350) %>%
  mutate(batting_average = H/AB) %>%
  arrange(desc(batting_average))
  1. The last player to hit 0.400 was Ted Williams in 1941.
Batting %>%
  filter(AB >= 350) %>%
  mutate(batting_average = H/AB) %>%
  filter(batting_average >= 0.4) %>%
  arrange(desc(yearID))

Section 2.6.1

  1. Be sure to remove the NA values when computing the mean.
flights %>%
  group_by(month, day) %>%
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE))
  1. The carrier with the best on-time rate was AS (Alaska Airlines) at about 73%. However, American Airlines and Delta Airlines should be recognized as well since they maintained comparably high on-time rates while flying tens of thousands more flights than Alaska Airlines.

The carrier with the worst on-time rate was FL (Airtran Airways) at about 40%. It should also be mentioned that Express Jet (EV) flew over 50,000 flights while compiling an on-time rate of just 52%.

arr_rate <- flights %>%
  group_by(carrier) %>%
  summarize(on_time_rate = mean(arr_delay <= 0, na.rm = TRUE),
            count = n())

arrange(arr_rate, desc(on_time_rate))
## # A tibble: 16 x 3
##    carrier on_time_rate count
##    <chr>          <dbl> <int>
##  1 AS             0.733   714
##  2 HA             0.716   342
##  3 AA             0.665 32729
##  4 VX             0.659  5162
##  5 DL             0.656 48110
##  6 OO             0.655    32
##  7 US             0.629 20536
##  8 9E             0.616 18460
##  9 UA             0.615 58665
## 10 B6             0.563 54635
## 11 WN             0.560 12275
## 12 MQ             0.533 26397
## 13 YV             0.526   601
## 14 EV             0.521 54173
## 15 F9             0.424   685
## 16 FL             0.403  3260
arrange(arr_rate, on_time_rate)
## # A tibble: 16 x 3
##    carrier on_time_rate count
##    <chr>          <dbl> <int>
##  1 FL             0.403  3260
##  2 F9             0.424   685
##  3 EV             0.521 54173
##  4 YV             0.526   601
##  5 MQ             0.533 26397
##  6 WN             0.560 12275
##  7 B6             0.563 54635
##  8 UA             0.615 58665
##  9 9E             0.616 18460
## 10 US             0.629 20536
## 11 OO             0.655    32
## 12 DL             0.656 48110
## 13 VX             0.659  5162
## 14 AA             0.665 32729
## 15 HA             0.716   342
## 16 AS             0.733   714
  1. One way we can answer this question by comparing the minimum air_time value to the average air_time value at each destination and sorting by the difference to find destinations with minimums way smaller than averages.

It looks like there was a flight to Minneapolis-St. Paul that arrived almost 1 hour faster than the average flight to that destination. (It looks like this flight left the origin gate at 3:58PM EST and arrived at the destination gate at 5:45PM CST, which would mean 107 minutes gate-to-gate. The air_time value of 93 is thus probably not an entry error. The flight had a departure delay of 45 minutes and may have been trying to make up time in the air.)

flights %>%
  filter(!is.na(air_time)) %>%
  group_by(dest) %>%
  summarize(min_air_time = min(air_time),
            avg_air_time = mean(air_time)) %>%
  arrange(desc(avg_air_time - min_air_time))
  1. Express Jet flew to 61 different destinations.
flights %>%
  group_by(carrier) %>%
  summarize(dist_dest = n_distinct(dest)) %>%
  arrange(desc(dist_dest))
  1. Standard deviation is the statistic most often used to measure variation in a variable. The plane with tail number N76062 had the highest standard deviation is distances flown at about 1796 miles.
flights %>%
  filter(!is.na(tailnum)) %>%
  group_by(tailnum) %>%
  summarize(variation = sd(distance, na.rm = TRUE),
            count = n()) %>%
  arrange(desc(variation))
  1. February 8 had the most cancellations at 472. There were 7 days with no cancellations, including, luckily, Thanksgiving and the day after Thanksgiving.
cancellations <- flights %>%
  group_by(month, day) %>%
  summarize(flights_canceled = sum(is.na(dep_time)))

arrange(cancellations, desc(flights_canceled))
## # A tibble: 365 x 3
## # Groups:   month [12]
##    month   day flights_canceled
##    <int> <int>            <int>
##  1     2     8              472
##  2     2     9              393
##  3     5    23              221
##  4    12    10              204
##  5     9    12              192
##  6     3     6              180
##  7     3     8              180
##  8    12     5              158
##  9    12    14              125
## 10     6    28              123
## # ... with 355 more rows
## # i Use `print(n = ...)` to see more rows
arrange(cancellations, flights_canceled)
## # A tibble: 365 x 3
## # Groups:   month [12]
##    month   day flights_canceled
##    <int> <int>            <int>
##  1     4    21                0
##  2     5    17                0
##  3     5    26                0
##  4    10     5                0
##  5    10    20                0
##  6    11    28                0
##  7    11    29                0
##  8     1     6                1
##  9     1    19                1
## 10     2    16                1
## # ... with 355 more rows
## # i Use `print(n = ...)` to see more rows
HR_by_year <- Batting %>% 
  group_by(yearID) %>%
  summarize(max_HR = max(HR))
  1. The huge jump around 1920 is explained by the arrival of Babe Ruth. Home runs totals remained very high in his aftermath, showing how dramatically he changed the game. You might also recognize some high spikes during the PED-fueled years of Mark McGwire, Sammy Sosa, and Barry Bonds, and some low spikes during strike- or COVID- shortened seasons.
ggplot(data = HR_by_year, mapping = aes(x = yearID, y = max_HR)) +
  geom_point() +
  geom_line()

Batting %>%
  group_by(playerID) %>%
  summarize(at_bats = sum(AB, na.rm = TRUE),
            hits = sum(H, na.rm = TRUE),
            batting_avg = hits/at_bats) %>%
  filter(at_bats >= 3000) %>%
  arrange(desc(batting_avg))
## # A tibble: 1,774 x 4
##    playerID  at_bats  hits batting_avg
##    <chr>       <int> <int>       <dbl>
##  1 cobbty01    11436  4189       0.366
##  2 hornsro01    8173  2930       0.358
##  3 jacksjo01    4981  1772       0.356
##  4 odoulle01    3264  1140       0.349
##  5 delahed01    7510  2597       0.346
##  6 speaktr01   10195  3514       0.345
##  7 hamilbi01    6283  2164       0.344
##  8 willite01    7706  2654       0.344
##  9 broutda01    6726  2303       0.342
## 10 ruthba01     8398  2873       0.342
## # ... with 1,764 more rows
## # i Use `print(n = ...)` to see more rows