Exercise Solutions
Section 1.1.1
- You should expect these variables to be very closely correlated, with the points clustered very close to a line.
ggplot(data = mpg) +
geom_point(mapping = aes(x = cty, y = hwy))
manufacturer
,model
,year
,cyl
,trans
,drv
,fl
, andclass
are categorical. The others are continuous. (You could make an argument foryear
andcyl
being continuous as well, but because each only has a few values and the possible values cannot occur across a continuous spectrum, they are more appropriately classified as categorical.)You could use either
hwy
orcty
to measure fuel efficiency. The scatter plot below would indicate that front-wheel drives generally have the best fuel efficiency but exhibit the most variation. Rear-wheel drives get the worst fuel efficiency and exhibit the least variation.
ggplot(data = mpg) +
geom_point(mapping = aes(x = drv, y = hwy))
The points are stacked vertically. This happens because
drv
is categorical. (We’ll see in Section 1.6 that box plots are better ways to visualize a continuous vs. categorical relationship.)In the scatter plot below, a dot just indicates that a car with that
class
-drv
combination exists in the data set. For example, there’s a car withclass = 2seater
anddrv = r
. While this is somewhat helpful, it doesn’t convey a trend, which is what scatter plots are supposed to do.
ggplot(data = mpg) +
geom_point(mapping = aes(x = class, y = drv))
- Scatter plots are most useful when they illustrate a trend in the data, i.e., a way that a change in one variable brings about a change in a related variable. Changes in variables are easier to think about and visualize when the variables are continuous.
Section 1.2.1
- The warning message indicates that you can only map a variable to
shape
if it has 6 or fewer values since looking at more than 6 shapes can be visually confusing. The 62 rows that were removed must be those for whichclass = SUV
sinceSUV
is the seventh value ofclass
when listed alphabetically.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values because more than 6
## becomes difficult to discriminate; you have 7. Consider specifying shapes manually
## if you must have them.
## Warning: Removed 62 rows containing missing values (geom_point).
- One example would be to use
drv
instead ofclass
since it only has 3 values:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = drv))
- The scatter plot below shows what happens when the continuous variable
cty
is mapped toshape
. The error message indicates that this is not an option.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = cty))
## Error in `scale_f()`:
## ! A continuous variable can not be mapped to shape
- First, let’s map the continuous
cty
tosize
. We see that the size of the dots corresponds to the magnitude of the values ofcty
. The legend displays representative dot sizes.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = cty))
If we map the categorical drv
to size
, we get a warning message that indicates that size
is not an appropriate aesthetic for categorical variables. This is because such variables usually have no meaningful notion of magnitude, so the size
aesthetic can be misleading. For example, in the plot below, one might get the idea that rear-wheel drives are somehow “bigger” than front-wheel and 4-wheel drives.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = drv))
## Warning: Using size for a discrete variable is not advised.
- Mapping the continuous
cty
tocolor
, we see that a gradation of colors is assigned to the continuous interval ofcty
values.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = cty))
- The scatter plot below maps
class
tocolor
anddrv
toshape
. The result is a visualization which is somewhat hard to read. Generally speaking, mapping to more than one extra aesthetic crams a little too much information into a visualization.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class, shape = drv))
- The scatter plot below maps the
drv
variable to bothcolor
andshape
. Mappingdrv
to two different aesthetics is a good way to further differentiate the scatter plot points that represent the differentdrv
values.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = drv, shape = drv))
- This technique gives us a way to visually separate a scatter plot according to a given condition. Below, we can visualize the relationship between
displ
andhwy
separately for cars with fewer than 6 cylinders and those with 6 or more cylinders.
ggplot(mpg) +
geom_point(aes(displ, hwy, color = cyl<6))
Section 1.4.1
- The available aesthetics are found in the documentation by entering
?geom_smooth
. The visualizations below mapdrv
tolinetype
andsize
.
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, size = drv))
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point(mapping = aes(color = drv)) +
geom_smooth(se = FALSE) +
labs(title = "The Relationship Between Highway and City Fuel Efficiency",
x = "City Fuel Efficiency (miles per gallon)",
y = "Highway Fuel Efficiency (miles per gallon)",
color = "Drive Train Type")
- The visualization below shows that the
color
aesthetic should probably not be used with a categorical variable with several values since it leads to a very cluttered plot.
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, color = class))
Section 1.6.1
- Since
carat
andprice
are both continuous, a scatter plot is appropriate. Notice below thatprice
is the y-axis variable, andcarat
is the x-axis variable. This is becausecarat
is an independent variable (meaning that it’s not affected by the other variables) andprice
is a dependent variable (meaning that it is affected by the other variables).
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price))
- Since
clarity
is categorical, a better way to visualize its affect onprice
is to use a box plot.
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = clarity, y = price))
Judging by the median value of
price
, it looks likeSI2
is the highest priced clarity value. You could also make an argument forVS2
orVS1
since they have the highest values for Q3.VS2
andVS1
clarities also seem to have the most variation inprice
since they have the largest interquartile ranges.The fact that, for any given
clarity
value, all of the outliers lie above the main cluster indicates that the data set contains several diamonds with extremely high prices that are much larger than the median prices. This is often the case with valuable commodities - there are always several items whose values are extremely high compared to the median. (Think houses, baseball cards, cars, antiques, etc.)The
fill
aesthetic seems to best among the three at visually distinguishing the values ofcut
.
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = color, y = price, color = cut))
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = color, y = price, fill = cut))
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = color, y = price, linetype = cut))
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = reorder(clarity, price), y = price)) +
coord_flip()
Section 1.7.1
- Since
clarity
is categorical, we should use a bar graph.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = clarity))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = clarity, y = stat(prop), group = 1))
- Since
carat
is continuous, we should now use a histogram.
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat))
- The first histogram below has bins that are way too narrow. The histogram is too granular to visualize the true distribution. The second one has bins which are way too wide, producing a histogram that is too coarse.
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.01)
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 2)
- The following bar graph looks so spiky because it’s producing a bar for every single value of
price
in the data set, of which there are hundreds, at least. Sinceprice
is continuous, a histogram would have been the way to visualize its distribution.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = price))
- Mapping
cut
tofill
produces a much better result than mapping tocolor
.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = clarity, color = cut))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = clarity, fill = cut))
- Since using the optional
position = "dodge"
argument produces a bar for eachcut
value within a givenclarity
value, it’s easier to see howcut
values are distributed within eachclarity
value since it allows you to compare relative counts more easily.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = clarity, fill = cut), position = "dodge")
- Mapping
cut
to two different aesthetics is redundant, but doing so makes it easier to distinguish the bars for the variouscut
values. (See Exercise 7 from Section 1.2.1.)
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
- You get a pie chart, where the radius of each piece of pie corresponds to the count for that
cut
value.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut)) +
coord_polar()
- In the first bar graph below, the names of the
manufacturer
values are crowded together on the x-axis, making them hard to read. The second bar graph fixes this problems by adding acoord_flip()
layer.
ggplot(data = mpg) +
geom_bar(mapping = aes(x = manufacturer))
ggplot(data = mpg) +
geom_bar(mapping = aes(x = manufacturer)) +
coord_flip()
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat, fill = cut))
- A frequency polynomial contains the same information as a histogram but displays it in a more minimal way.
ggplot(data = diamonds) +
geom_freqpoly(mapping = aes(x = price))
- First, you may have noticed that mapping
cut
to fill, as was done in Exercise 11, is not as helpful ingeom_freqpoly
. Mapping tocolor
produces a nice plot. It seems a little better than the histogram in Exercise 11 since it avoids the stacking of the colored segments of the bars.
ggplot(data = diamonds) +
geom_freqpoly(mapping = aes(x = carat, color = cut))
ggplot(data = mpg) +
geom_bar(mapping = aes(x = class, fill = drv)) +
labs(x = "type of car",
y = "number of cars",
fill = "drive train",
title = "Distribution of Car Type Broken Down by Drive Train")
Section 2.1.1
There are 336,776 observations (rows) and 19 variables (columns).
filter(flights, month == 2)
filter(flights, carrier == "UA" | carrier == "AA")
filter(flights, month >= 6 & month <= 8)
filter(flights, arr_delay > 120 & dep_delay <= 0)
filter(flights, dep_delay > 60 & dep_delay - arr_delay > 30)
filter(flights, dep_time < 600)
A canceled flight would have an
NA
for its departure time, so we can just count the number ofNA
s in this column as shown below:
sum(is.na(flights$dep_time))
## [1] 8255
- We would have to find the number of flights with a 0 or negative
arr_delay
value and then divide that number by the total number of arriving flights Delta flew. The following shows a way to do this. (Recall that when you compute the mean of a column ofTRUE
/FALSE
values, you’re computing the percentage ofTRUE
s in the column.)
# First create a data set containing the Delta flights that arrived at their destinations:
<- filter(flights, carrier == "DL" & !is.na(arr_delay))
Delta_arr
# Now find the percentage of rows of this filtered data set with `arr_delay <= 0`:
mean(Delta_arr$arr_delay <= 0)
## [1] 0.6556087
So Delta’s on-time arrival rate in 2013 was about 66%. For the winter on-time arrival rate, we can repeat this calculation after filtering Delta_arr
down to just the winter flights.
<- filter(Delta_arr, month == 12 | month <= 2)
winter_Delta_arr
mean(winter_Delta_arr$arr_delay <= 0)
## [1] 0.6546032
The on-time arrival rate is about 65%.
Section 2.2.1
- Let’s sort the
msleep
data set byconservation
and thenView
it:
View(arrange(msleep, conservation))
Scrolling through the sorted data set, we see the rows with NA
in the conservation
column appear at the end. To move them to the top, we can do the following, remembering that is.na(conservation)
will return a 1 for the rows with an NA
in the conservation column and a 0 otherwise.
arrange(msleep, desc(is.na(conservation)))
- The longest delay was 1301 minutes.
arrange(flights, desc(dep_delay))
## # A tibble: 336,776 x 19
## year month day dep_time sched_d~1 dep_d~2 arr_t~3 sched~4 arr_d~5 carrier flight tailnum
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr>
## 1 2013 1 9 641 900 1301 1242 1530 1272 HA 51 N384HA
## 2 2013 6 15 1432 1935 1137 1607 2120 1127 MQ 3535 N504MQ
## 3 2013 1 10 1121 1635 1126 1239 1810 1109 MQ 3695 N517MQ
## 4 2013 9 20 1139 1845 1014 1457 2210 1007 AA 177 N338AA
## 5 2013 7 22 845 1600 1005 1044 1815 989 MQ 3075 N665MQ
## 6 2013 4 10 1100 1900 960 1342 2211 931 DL 2391 N959DL
## 7 2013 3 17 2321 810 911 135 1020 915 DL 2119 N927DA
## 8 2013 6 27 959 1900 899 1236 2226 850 DL 2007 N3762Y
## 9 2013 7 22 2257 759 898 121 1026 895 DL 2047 N6716C
## 10 2013 12 5 756 1700 896 1058 2020 878 AA 172 N5DMAA
## # ... with 336,766 more rows, 7 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable
## # names 1: sched_dep_time, 2: dep_delay, 3: arr_time, 4: sched_arr_time, 5: arr_delay
## # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
- It looks like several flights departed at 12:01am.
arrange(flights, dep_time)
## # A tibble: 336,776 x 19
## year month day dep_time sched_d~1 dep_d~2 arr_t~3 sched~4 arr_d~5 carrier flight tailnum
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr>
## 1 2013 1 13 1 2249 72 108 2357 71 B6 22 N206JB
## 2 2013 1 31 1 2100 181 124 2225 179 WN 530 N550WN
## 3 2013 11 13 1 2359 2 442 440 2 B6 1503 N627JB
## 4 2013 12 16 1 2359 2 447 437 10 B6 839 N607JB
## 5 2013 12 20 1 2359 2 430 440 -10 B6 1503 N608JB
## 6 2013 12 26 1 2359 2 437 440 -3 B6 1503 N527JB
## 7 2013 12 30 1 2359 2 441 437 4 B6 839 N508JB
## 8 2013 2 11 1 2100 181 111 2225 166 WN 530 N231WN
## 9 2013 2 24 1 2245 76 121 2354 87 B6 608 N216JB
## 10 2013 3 8 1 2355 6 431 440 -9 B6 739 N586JB
## # ... with 336,766 more rows, 7 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable
## # names 1: sched_dep_time, 2: dep_delay, 3: arr_time, 4: sched_arr_time, 5: arr_delay
## # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
- The fastest average speed was Delta flight 1499 from LGA to ATL on May 25.
arrange(flights, desc(distance/air_time))
The slowest average speed was US Airways flight 1860 from LGA to PHL on January 28.
arrange(flights, distance/air_time)
- The Hawaiian Airlines flights from JFK to HNL flew the farthest distance of 4983 miles.
arrange(flights, desc(distance))
The shortest flight in the data set was US Airways flight 1632 from EWR to LGA on July 27. (It was canceled.)
arrange(flights, distance)
Section 2.3.1
- Checking the
dest
column of the transformed data set created below, we see that the most distant American Airlines destination is SFO.
%>%
flights filter(carrier == "AA") %>%
arrange(desc(distance))
- We can refine the filter from the previous exercise to include only flights with LGA as the origin airport. The most distant destination is DFW.
%>%
flights filter(carrier == "AA" & origin == "LGA") %>%
arrange(desc(distance))
- The most delayed winter flight was Hawaiian Airlines flight 51 from JFK to HNL on January 9.
%>%
flights filter(month == 12 | month <= 2) %>%
arrange(desc(dep_delay))
The most delayed summer flight was Envoy Air flight 3535 from JFK to CMH on June 15.
%>%
flights filter(month >= 6 & month <= 8) %>%
arrange(desc(dep_delay))
Section 2.5.1
- The fastest flight was Delta flight 1499 from LGA to ATL on May 25. It’s average air speed was about 703 mph.
%>%
flights filter(!is.na(dep_time)) %>% # <---- Filters out canceled flights
mutate(air_speed = distance/air_time * 60) %>%
arrange(desc(air_speed)) %>%
select(month, day, carrier, flight, origin, dest, air_speed)
The slowest flight was US Airways flight 1860 from LGA to PHL on January 28. It’s average air speed was about 77 mph.
%>%
flights filter(!is.na(dep_time)) %>% # <---- Filters out canceled flights
mutate(air_speed = distance/air_time * 60) %>%
arrange(air_speed) %>%
select(month, day, carrier, flight, origin, dest, air_speed)
- The
difference
column below shows the values of the the newly createdflight_minutes
variable minus theair_time
variable already inflights
. A little digging on the internet could tell us thatair_time
only includes the times during which the plane is actually in the air and does not include taxiing on the runway, etc.dep_time
andarr_time
are the times when the plane departs from and arrives at the gate, soflight_minutes
includes time spent on the runway. This would meanflight_minutes
should be consistently larger thanair_time
, producing all positive values in the difference column.
However, this is not what we observe. If we look carefully at the flights with a negative difference
value, we might notice that the destination airport is not in the Eastern Time Zone. The arr_time
values in the Central Time Zone are thus 60 minutes behind Eastern, Mountain Time Zone arr_times
are 120 minutes behind, and Pacific Time Zones are 180 minutes behind. This explains the negative difference
values and also why you would see, for example, a flight from Newark to San Francisco with a flight_minutes
value of only 205.
%>%
flights mutate(dep_mins_midnight = (dep_time %/% 100)*60 + (dep_time %% 100),
arr_mins_midnight = (arr_time %/% 100)*60 + (arr_time %% 100),
flight_minutes = arr_mins_midnight - dep_mins_midnight,
difference = flight_minutes - air_time) %>%
select(month, day, origin, dest, air_time, flight_minutes, difference)
## # A tibble: 336,776 x 7
## month day origin dest air_time flight_minutes difference
## <int> <int> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 1 EWR IAH 227 193 -34
## 2 1 1 LGA IAH 227 197 -30
## 3 1 1 JFK MIA 160 221 61
## 4 1 1 JFK BQN 183 260 77
## 5 1 1 LGA ATL 116 138 22
## 6 1 1 EWR ORD 150 106 -44
## 7 1 1 EWR FLL 158 198 40
## 8 1 1 LGA IAD 53 72 19
## 9 1 1 JFK MCO 140 161 21
## 10 1 1 LGA ORD 138 115 -23
## # ... with 336,766 more rows
## # i Use `print(n = ...)` to see more rows
- The first day of the year with a canceled flight was January 1.
%>%
flights mutate(status = ifelse(is.na(dep_time), "canceled", "not canceled")) %>%
arrange(status)
## # A tibble: 336,776 x 20
## year month day dep_time sched_d~1 dep_d~2 arr_t~3 sched~4 arr_d~5 carrier flight tailnum
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr>
## 1 2013 1 1 NA 1630 NA NA 1815 NA EV 4308 N18120
## 2 2013 1 1 NA 1935 NA NA 2240 NA AA 791 N3EHAA
## 3 2013 1 1 NA 1500 NA NA 1825 NA AA 1925 N3EVAA
## 4 2013 1 1 NA 600 NA NA 901 NA B6 125 N618JB
## 5 2013 1 2 NA 1540 NA NA 1747 NA EV 4352 N10575
## 6 2013 1 2 NA 1620 NA NA 1746 NA EV 4406 N13949
## 7 2013 1 2 NA 1355 NA NA 1459 NA EV 4434 N10575
## 8 2013 1 2 NA 1420 NA NA 1644 NA EV 4935 N759EV
## 9 2013 1 2 NA 1321 NA NA 1536 NA EV 3849 N13550
## 10 2013 1 2 NA 1545 NA NA 1910 NA AA 133 <NA>
## # ... with 336,766 more rows, 8 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, status <chr>, and abbreviated
## # variable names 1: sched_dep_time, 2: dep_delay, 3: arr_time, 4: sched_arr_time,
## # 5: arr_delay
## # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
- There were 9430 flights for which
arr_status
isNA
. The reason thatNA
was assigned toarr_status
for these flights is that thearr_delay
value for these flights isNA
, so the condition in theifelse
would be checking whetherNA <= 0
. The result of this condition can be neitherTRUE
norFALSE
, it will instead just beNA
.
<- flights %>%
flights_w_arr_status mutate(arr_status = ifelse(arr_delay <= 0,
"on time",
"late"))
sum(is.na(flights_w_arr_status$arr_status))
## [1] 9430
- We will rearrange the conditions so that the first one checked is
arr_delay <= 0
and the second one isis.na(arr_delay)
. We can then find a row with anNA
in thearr_delay
column by sorting byis.na(arr_delay)
in descending order.
The results below show that arr_status
was not correctly assigned a “canceled” value for these flights. This is because the first condition in the nested ifelse
is arr_delay <= 0
. As we saw in the previous exercise, when the arr_delay
value is NA
, this first condition is neither TRUE
nor FALSE
, so neither of the conditional statements will be executed. Instead, a value of NA
will be assigned to arr_status
.
<- flights %>%
flights_w_arr_status2 mutate(arr_status = ifelse(arr_delay <= 0,
"on time",
ifelse(is.na(arr_delay),
"canceled",
"late"))) %>%
arrange(desc(is.na(arr_delay))) %>%
select(arr_delay, arr_status)
flights_w_arr_status2
## # A tibble: 336,776 x 2
## arr_delay arr_status
## <dbl> <chr>
## 1 NA <NA>
## 2 NA <NA>
## 3 NA <NA>
## 4 NA <NA>
## 5 NA <NA>
## 6 NA <NA>
## 7 NA <NA>
## 8 NA <NA>
## 9 NA <NA>
## 10 NA <NA>
## # ... with 336,766 more rows
## # i Use `print(n = ...)` to see more rows
- We’ll interchange the first two conditions as we did in the previous exercise and then again sort by
is.na(arr_status)
in descending order to find the flights for whicharr_delay
isNA
. We can see thatarr_status
is correctly assigned acanceled
value. This shows that the order in which the conditions are listed does not matter in acase_when
statement. The previous exercise shows that the order does matter in a nestedifelse
, which means thatcase_when
statements are more robust ways to handle conditional statements.
<- flights %>%
flights_w_arr_status3 mutate(arr_status = case_when(arr_delay <= 0 ~ "on time",
is.na(arr_delay) ~ "canceled",
>0 ~ "delayed")) %>%
arr_delay arrange(desc(is.na(arr_delay))) %>%
select(arr_delay, arr_status)
flights_w_arr_status3
## # A tibble: 336,776 x 2
## arr_delay arr_status
## <dbl> <chr>
## 1 NA canceled
## 2 NA canceled
## 3 NA canceled
## 4 NA canceled
## 5 NA canceled
## 6 NA canceled
## 7 NA canceled
## 8 NA canceled
## 9 NA canceled
## 10 NA canceled
## # ... with 336,766 more rows
## # i Use `print(n = ...)` to see more rows
%>%
flights mutate(season = case_when(month >= 3 & month <= 5 ~ "spring",
>= 6 & month <= 8 ~ "summer",
month >= 9 & month <= 11 ~ "fall",
month TRUE ~ "winter"))
- It might help to recall the distribution of
price
indiamonds
by looking at its histogram shown below.
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price))
The histogram might suggest the following cutoffs: expensive
for price < 5000
, very expensive
for 5000 <= price < 10000
, insanely expensive
for 10000 <= price < 15000
and priceless
for price >= 15000
. We can assign these labels with the following case_when
:
<- diamonds %>%
diamonds2 mutate(category = case_when(price < 5000 ~ "expensive",
>= 5000 & price < 10000 ~ "very expensive",
price >= 10000 & price < 15000 ~ "insanely expensive",
price TRUE ~ "priceless"))
Finally, we can visualize the distribution with a bar graph:
ggplot(data = diamonds2) +
geom_bar(mapping = aes(x = category))
Batting
has 110,495 observations and 22 variables.Our results are surprising because there are several players with perfect batting averages. This is explained by the fact that these players had very few at bats.
%>%
Batting mutate(batting_average = H/AB) %>%
arrange(desc(batting_average))
- The highest single season batting average of 0.440 belonged to Hugh Duffy in 1894.
%>%
Batting filter(AB >= 350) %>%
mutate(batting_average = H/AB) %>%
arrange(desc(batting_average))
- The highest batting average of the modern era belonged to Tony Gwynn, who hit 0.394 in 1994.
%>%
Batting filter(yearID >= 1947,
>= 350) %>%
AB mutate(batting_average = H/AB) %>%
arrange(desc(batting_average))
- The last player to hit 0.400 was Ted Williams in 1941.
%>%
Batting filter(AB >= 350) %>%
mutate(batting_average = H/AB) %>%
filter(batting_average >= 0.4) %>%
arrange(desc(yearID))
Section 2.6.1
- Be sure to remove the
NA
values when computing the mean.
%>%
flights group_by(month, day) %>%
summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE))
- The carrier with the best on-time rate was AS (Alaska Airlines) at about 73%. However, American Airlines and Delta Airlines should be recognized as well since they maintained comparably high on-time rates while flying tens of thousands more flights than Alaska Airlines.
The carrier with the worst on-time rate was FL (Airtran Airways) at about 40%. It should also be mentioned that Express Jet (EV) flew over 50,000 flights while compiling an on-time rate of just 52%.
<- flights %>%
arr_rate group_by(carrier) %>%
summarize(on_time_rate = mean(arr_delay <= 0, na.rm = TRUE),
count = n())
arrange(arr_rate, desc(on_time_rate))
## # A tibble: 16 x 3
## carrier on_time_rate count
## <chr> <dbl> <int>
## 1 AS 0.733 714
## 2 HA 0.716 342
## 3 AA 0.665 32729
## 4 VX 0.659 5162
## 5 DL 0.656 48110
## 6 OO 0.655 32
## 7 US 0.629 20536
## 8 9E 0.616 18460
## 9 UA 0.615 58665
## 10 B6 0.563 54635
## 11 WN 0.560 12275
## 12 MQ 0.533 26397
## 13 YV 0.526 601
## 14 EV 0.521 54173
## 15 F9 0.424 685
## 16 FL 0.403 3260
arrange(arr_rate, on_time_rate)
## # A tibble: 16 x 3
## carrier on_time_rate count
## <chr> <dbl> <int>
## 1 FL 0.403 3260
## 2 F9 0.424 685
## 3 EV 0.521 54173
## 4 YV 0.526 601
## 5 MQ 0.533 26397
## 6 WN 0.560 12275
## 7 B6 0.563 54635
## 8 UA 0.615 58665
## 9 9E 0.616 18460
## 10 US 0.629 20536
## 11 OO 0.655 32
## 12 DL 0.656 48110
## 13 VX 0.659 5162
## 14 AA 0.665 32729
## 15 HA 0.716 342
## 16 AS 0.733 714
- One way we can answer this question by comparing the minimum
air_time
value to the averageair_time
value at each destination and sorting by the difference to find destinations with minimums way smaller than averages.
It looks like there was a flight to Minneapolis-St. Paul that arrived almost 1 hour faster than the average flight to that destination. (It looks like this flight left the origin gate at 3:58PM EST and arrived at the destination gate at 5:45PM CST, which would mean 107 minutes gate-to-gate. The air_time
value of 93 is thus probably not an entry error. The flight had a departure delay of 45 minutes and may have been trying to make up time in the air.)
%>%
flights filter(!is.na(air_time)) %>%
group_by(dest) %>%
summarize(min_air_time = min(air_time),
avg_air_time = mean(air_time)) %>%
arrange(desc(avg_air_time - min_air_time))
- Express Jet flew to 61 different destinations.
%>%
flights group_by(carrier) %>%
summarize(dist_dest = n_distinct(dest)) %>%
arrange(desc(dist_dest))
- Standard deviation is the statistic most often used to measure variation in a variable. The plane with tail number N76062 had the highest standard deviation is distances flown at about 1796 miles.
%>%
flights filter(!is.na(tailnum)) %>%
group_by(tailnum) %>%
summarize(variation = sd(distance, na.rm = TRUE),
count = n()) %>%
arrange(desc(variation))
- February 8 had the most cancellations at 472. There were 7 days with no cancellations, including, luckily, Thanksgiving and the day after Thanksgiving.
<- flights %>%
cancellations group_by(month, day) %>%
summarize(flights_canceled = sum(is.na(dep_time)))
arrange(cancellations, desc(flights_canceled))
## # A tibble: 365 x 3
## # Groups: month [12]
## month day flights_canceled
## <int> <int> <int>
## 1 2 8 472
## 2 2 9 393
## 3 5 23 221
## 4 12 10 204
## 5 9 12 192
## 6 3 6 180
## 7 3 8 180
## 8 12 5 158
## 9 12 14 125
## 10 6 28 123
## # ... with 355 more rows
## # i Use `print(n = ...)` to see more rows
arrange(cancellations, flights_canceled)
## # A tibble: 365 x 3
## # Groups: month [12]
## month day flights_canceled
## <int> <int> <int>
## 1 4 21 0
## 2 5 17 0
## 3 5 26 0
## 4 10 5 0
## 5 10 20 0
## 6 11 28 0
## 7 11 29 0
## 8 1 6 1
## 9 1 19 1
## 10 2 16 1
## # ... with 355 more rows
## # i Use `print(n = ...)` to see more rows
<- Batting %>%
HR_by_year group_by(yearID) %>%
summarize(max_HR = max(HR))
- The huge jump around 1920 is explained by the arrival of Babe Ruth. Home runs totals remained very high in his aftermath, showing how dramatically he changed the game. You might also recognize some high spikes during the PED-fueled years of Mark McGwire, Sammy Sosa, and Barry Bonds, and some low spikes during strike- or COVID- shortened seasons.
ggplot(data = HR_by_year, mapping = aes(x = yearID, y = max_HR)) +
geom_point() +
geom_line()
%>%
Batting group_by(playerID) %>%
summarize(at_bats = sum(AB, na.rm = TRUE),
hits = sum(H, na.rm = TRUE),
batting_avg = hits/at_bats) %>%
filter(at_bats >= 3000) %>%
arrange(desc(batting_avg))
## # A tibble: 1,774 x 4
## playerID at_bats hits batting_avg
## <chr> <int> <int> <dbl>
## 1 cobbty01 11436 4189 0.366
## 2 hornsro01 8173 2930 0.358
## 3 jacksjo01 4981 1772 0.356
## 4 odoulle01 3264 1140 0.349
## 5 delahed01 7510 2597 0.346
## 6 speaktr01 10195 3514 0.345
## 7 hamilbi01 6283 2164 0.344
## 8 willite01 7706 2654 0.344
## 9 broutda01 6726 2303 0.342
## 10 ruthba01 8398 2873 0.342
## # ... with 1,764 more rows
## # i Use `print(n = ...)` to see more rows