16 벡터 도구들
16.1 들어가기
%in%
library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
#> ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
#> ✓ tibble 3.1.6 ✓ dplyr 1.0.7
#> ✓ tidyr 1.1.4 ✓ stringr 1.4.0.9000
#> ✓ readr 2.1.1 ✓ forcats 0.5.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
library(nycflights13)
not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
16.2 Counts
-
Counts: You’ve seen
n()
, which takes no arguments, and returns the size of the current group. To count the number of non-missing values, usesum(!is.na(x))
. To count the number of distinct (unique) values, usen_distinct(x)
.# Which destinations have the most carriers? not_cancelled %>% group_by(dest) %>% summarise(carriers = n_distinct(carrier)) %>% arrange(desc(carriers)) #> # A tibble: 104 × 2 #> dest carriers #> <chr> <int> #> 1 ATL 7 #> 2 BOS 7 #> 3 CLT 7 #> 4 ORD 7 #> 5 TPA 7 #> 6 AUS 6 #> # … with 98 more rows
Counts are so useful that dplyr provides a simple helper if all you want is a count:
not_cancelled %>% count(dest) #> # A tibble: 104 × 2 #> dest n #> <chr> <int> #> 1 ABQ 254 #> 2 ACK 264 #> 3 ALB 418 #> 4 ANC 8 #> 5 ATL 16837 #> 6 AUS 2411 #> # … with 98 more rows
Just like with
group_by()
, you can also provide multiple variables tocount()
.not_cancelled %>% count(carrier, dest) #> # A tibble: 312 × 3 #> carrier dest n #> <chr> <chr> <int> #> 1 9E ATL 56 #> 2 9E AUS 2 #> 3 9E AVL 10 #> 4 9E BNA 452 #> 5 9E BOS 853 #> 6 9E BTV 2 #> # … with 306 more rows
You can optionally provide a weight variable. For example, you could use this to “count” (sum) the total number of miles a plane flew:
16.3 Window functions
-
Offsets:
lead()
andlag()
allow you to refer to leading or lagging values. This allows you to compute running differences (e.g.x - lag(x)
) or find when values change (x != lag(x)
). They are most useful in conjunction withgroup_by()
, which you’ll learn about shortly. -
Ranking: there are a number of ranking functions, but you should start with
min_rank()
. It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th). The default gives smallest values the small ranks; usedesc(x)
to give the largest values the smallest ranks.If
min_rank()
doesn’t do what you need, look at the variantsrow_number()
,dense_rank()
,percent_rank()
,cume_dist()
,ntile()
. See their help pages for more details.row_number(y) #> [1] 1 2 3 NA 4 5 dense_rank(y) #> [1] 1 2 2 NA 3 4 percent_rank(y) #> [1] 0.00 0.25 0.25 NA 0.75 1.00 cume_dist(y) #> [1] 0.2 0.6 0.6 NA 0.8 1.0
-
Measures of position:
first(x)
,nth(x, 2)
,last(x)
. These work similarly tox[1]
,x[2]
, andx[length(x)]
but let you set a default value if that position does not exist (i.e. you’re trying to get the 3rd element from a group that only has two elements). For example, we can find the first and last departure for each day:not_cancelled %>% group_by(year, month, day) %>% summarise( first_dep = first(dep_time), last_dep = last(dep_time) ) #> # A tibble: 365 × 5 #> # Groups: year, month [12] #> year month day first_dep last_dep #> <int> <int> <int> <int> <int> #> 1 2013 1 1 517 2356 #> 2 2013 1 2 42 2354 #> 3 2013 1 3 32 2349 #> 4 2013 1 4 25 2358 #> 5 2013 1 5 14 2357 #> 6 2013 1 6 16 2355 #> # … with 359 more rows
These functions are complementary to filtering on ranks. Filtering gives you all variables, with each observation in a separate row:
not_cancelled %>% group_by(year, month, day) %>% mutate(r = min_rank(desc(dep_time))) %>% filter(r %in% range(r)) #> # A tibble: 770 × 20 #> # Groups: year, month, day [365] #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 1 1 517 515 2 830 819 #> 2 2013 1 1 2356 2359 -3 425 437 #> 3 2013 1 2 42 2359 43 518 442 #> 4 2013 1 2 2354 2359 -5 413 437 #> 5 2013 1 3 32 2359 33 504 442 #> 6 2013 1 3 2349 2359 -10 434 445 #> # … with 764 more rows, and 12 more variables: arr_delay <dbl>, carrier <chr>, #> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, #> # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, r <int>
16.3.1 Cumulative
- Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes:
cumsum()
,cumprod()
,cummin()
,cummax()
; and dplyr providescummean()
for cumulative means. If you need rolling aggregates (i.e. a sum computed over a rolling window), try the RcppRoll package.
x <- 1:10
cumsum(x)
#> [1] 1 3 6 10 15 21 28 36 45 55
cummean(x)
#> [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
Generalise to rolling and use slider package instead?
Another useful pair of functions are cumulative any, cumany()
, and cumulative all, cumall()
.
cumany()
will be TRUE
after it encounters the first TRUE
, and cumall()
will be FALSE
after it encounters its first FALSE
.
These are particularly useful in conjunction with filter()
because they allow you to select:
-
cumall(x)
: all cases until the firstFALSE
. -
cumall(!x)
: all cases until the firstTRUE
. -
cumany(x)
: all cases after the firstTRUE
. -
cumany(!x)
: all cases after the firstFALSE
.
df <- data.frame(
date = as.Date("2020-01-01") + 0:6,
balance = c(100, 50, 25, -25, -50, 30, 120)
)
# all rows after first overdraft
df %>% filter(cumany(balance < 0))
#> date balance
#> 1 2020-01-04 -25
#> 2 2020-01-05 -50
#> 3 2020-01-06 30
#> 4 2020-01-07 120
# all rows until first overdraft
df %>% filter(cumall(!(balance < 0)))
#> date balance
#> 1 2020-01-01 100
#> 2 2020-01-02 50
#> 3 2020-01-03 25
16.3.3 dplyr
-
Find the worst members of each group:
flights_sml %>% group_by(year, month, day) %>% filter(rank(desc(arr_delay)) < 10) #> # A tibble: 3,306 × 7 #> # Groups: year, month, day [365] #> year month day dep_delay arr_delay distance air_time #> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> #> 1 2013 1 1 853 851 184 41 #> 2 2013 1 1 290 338 1134 213 #> 3 2013 1 1 260 263 266 46 #> 4 2013 1 1 157 174 213 60 #> 5 2013 1 1 216 222 708 121 #> 6 2013 1 1 255 250 589 115 #> # … with 3,300 more rows
-
Find all groups bigger than a threshold:
popular_dests <- flights %>% group_by(dest) %>% filter(n() > 365) popular_dests #> # A tibble: 332,577 × 19 #> # Groups: dest [77] #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 1 1 517 515 2 830 819 #> 2 2013 1 1 533 529 4 850 830 #> 3 2013 1 1 542 540 2 923 850 #> 4 2013 1 1 544 545 -1 1004 1022 #> 5 2013 1 1 554 600 -6 812 837 #> 6 2013 1 1 554 558 -4 740 728 #> # … with 332,571 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
-
Standardise to compute per group metrics:
popular_dests %>% filter(arr_delay > 0) %>% mutate(prop_delay = arr_delay / sum(arr_delay)) %>% select(year:day, dest, arr_delay, prop_delay) #> # A tibble: 131,106 × 6 #> # Groups: dest [77] #> year month day dest arr_delay prop_delay #> <int> <int> <int> <chr> <dbl> <dbl> #> 1 2013 1 1 IAH 11 0.000111 #> 2 2013 1 1 IAH 20 0.000201 #> 3 2013 1 1 MIA 33 0.000235 #> 4 2013 1 1 ORD 12 0.0000424 #> 5 2013 1 1 FLL 19 0.0000938 #> 6 2013 1 1 ORD 8 0.0000283 #> # … with 131,100 more rows
A grouped filter is a grouped mutate followed by an ungrouped filter. I generally avoid them except for quick and dirty manipulations: otherwise it’s hard to check that you’ve done the manipulation correctly.
Functions that work most naturally in grouped mutates and filters are known as window functions (vs. the summary functions used for summaries).
You can learn more about useful window functions in the corresponding vignette: vignette("window-functions")
.
16.3.4 Exercises
Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for
min_rank()
.Which plane (
tailnum
) has the worst on-time record?What time of day should you fly if you want to avoid delays as much as possible?
For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.
Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using
lag()
, explore how the delay of a flight is related to the delay of the immediately preceding flight.Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?
Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.