第 41 章 tidyverse中的across()之美2
41.1 曾经的痛点
dplyr 1.0.0 引入了across()
函数,让我们再次感受到了dplyr的强大和人性化。
across()
函数与summarise()
和mutate()
函数配合起来使用,非常方便(参考第 40 章),
但与filter()
函数不是很理想,比如我们想筛选数据框有缺失值的行
## # A tibble: 0 × 8
## # ℹ 8 variables: species <fct>, island <fct>, bill_length_mm <dbl>,
## # bill_depth_mm <dbl>, flipper_length_mm <int>, body_mass_g <int>, sex <fct>,
## # year <int>
代码能运行,但结果明显不正确。我搜索了很久,发现只能用dplyr 1.0.0之前的filter_all()
函数实现,
penguins %>%
filter_all( any_vars(is.na(.)) )
## # A tibble: 11 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen NA NA NA NA
## 2 Adelie Torgersen 34.1 18.1 193 3475
## 3 Adelie Torgersen 42 20.2 190 4250
## 4 Adelie Torgersen 37.8 17.1 186 3300
## 5 Adelie Torgersen 37.8 17.3 180 3700
## 6 Adelie Dream 37.5 18.9 179 2975
## 7 Gentoo Biscoe 44.5 14.3 216 4100
## 8 Gentoo Biscoe 46.2 14.4 214 4650
## 9 Gentoo Biscoe 47.3 13.8 216 4725
## 10 Gentoo Biscoe 44.5 15.7 217 4875
## 11 Gentoo Biscoe NA NA NA NA
## # ℹ 2 more variables: sex <fct>, year <int>
多少让人感觉,在追求简约道路上,还是有美中不足。
41.2 dplyr 1.0.4: if_any() and if_all()
如今,dplyr 1.0.4推出了 if_any()
and if_all()
两个函数,正是弥补这个缺陷
penguins %>%
filter(if_any(everything(), is.na))
## # A tibble: 11 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen NA NA NA NA
## 2 Adelie Torgersen 34.1 18.1 193 3475
## 3 Adelie Torgersen 42 20.2 190 4250
## 4 Adelie Torgersen 37.8 17.1 186 3300
## 5 Adelie Torgersen 37.8 17.3 180 3700
## 6 Adelie Dream 37.5 18.9 179 2975
## 7 Gentoo Biscoe 44.5 14.3 216 4100
## 8 Gentoo Biscoe 46.2 14.4 214 4650
## 9 Gentoo Biscoe 47.3 13.8 216 4725
## 10 Gentoo Biscoe 44.5 15.7 217 4875
## 11 Gentoo Biscoe NA NA NA NA
## # ℹ 2 more variables: sex <fct>, year <int>
从函数形式上看,if_any
对应着 across
的地位,
across(.cols = everything(), .fns = NULL, ..., .names = NULL)
if_any(.cols, .fns = NULL, ..., .names = NULL)
if_all(.cols, .fns = NULL, ..., .names = NULL)
这就意味着列方向我们有across(),行方向我们有if_any()/if_all()了,可谓 纵横武林,倚天屠龙、谁与争锋?
41.3 案例赏析
下面通过一些例子展示下这两个新函数,其中一部分案例来自官网。
- 筛选有缺失值的行
penguins %>%
filter(if_any(everything(), is.na))
## # A tibble: 11 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen NA NA NA NA
## 2 Adelie Torgersen 34.1 18.1 193 3475
## 3 Adelie Torgersen 42 20.2 190 4250
## 4 Adelie Torgersen 37.8 17.1 186 3300
## 5 Adelie Torgersen 37.8 17.3 180 3700
## 6 Adelie Dream 37.5 18.9 179 2975
## 7 Gentoo Biscoe 44.5 14.3 216 4100
## 8 Gentoo Biscoe 46.2 14.4 214 4650
## 9 Gentoo Biscoe 47.3 13.8 216 4725
## 10 Gentoo Biscoe 44.5 15.7 217 4875
## 11 Gentoo Biscoe NA NA NA NA
## # ℹ 2 more variables: sex <fct>, year <int>
或者更简单
## # A tibble: 11 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen NA NA NA NA
## 2 Adelie Torgersen 34.1 18.1 193 3475
## 3 Adelie Torgersen 42 20.2 190 4250
## 4 Adelie Torgersen 37.8 17.1 186 3300
## 5 Adelie Torgersen 37.8 17.3 180 3700
## 6 Adelie Dream 37.5 18.9 179 2975
## 7 Gentoo Biscoe 44.5 14.3 216 4100
## 8 Gentoo Biscoe 46.2 14.4 214 4650
## 9 Gentoo Biscoe 47.3 13.8 216 4725
## 10 Gentoo Biscoe 44.5 15.7 217 4875
## 11 Gentoo Biscoe NA NA NA NA
## # ℹ 2 more variables: sex <fct>, year <int>
- 筛选全部是缺失值的行
penguins %>%
filter(if_all(everything(), is.na))
## # A tibble: 0 × 8
## # ℹ 8 variables: species <fct>, island <fct>, bill_length_mm <dbl>,
## # bill_depth_mm <dbl>, flipper_length_mm <int>, body_mass_g <int>, sex <fct>,
## # year <int>
- 筛选企鹅嘴峰(长度和厚度)全部大于21mm的行
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 38.6 21.2 191 3800
## 2 Adelie Torgersen 34.6 21.1 198 4400
## 3 Adelie Torgersen 46 21.5 194 4200
## 4 Adelie Dream 39.2 21.1 196 4150
## 5 Adelie Dream 42.3 21.2 191 4150
## 6 Adelie Biscoe 41.3 21.1 195 4400
## # ℹ 2 more variables: sex <fct>, year <int>
当然可以弄成更骚一点喔
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 38.6 21.2 191 3800
## 2 Adelie Torgersen 34.6 21.1 198 4400
## 3 Adelie Torgersen 46 21.5 194 4200
## 4 Adelie Dream 39.2 21.1 196 4150
## 5 Adelie Dream 42.3 21.2 191 4150
## 6 Adelie Biscoe 41.3 21.1 195 4400
## # ℹ 2 more variables: sex <fct>, year <int>
- 筛选企鹅嘴峰(长度或者厚度)大于21mm的行
## # A tibble: 342 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen 36.7 19.3 193 3450
## 5 Adelie Torgersen 39.3 20.6 190 3650
## 6 Adelie Torgersen 38.9 17.8 181 3625
## 7 Adelie Torgersen 39.2 19.6 195 4675
## 8 Adelie Torgersen 34.1 18.1 193 3475
## 9 Adelie Torgersen 42 20.2 190 4250
## 10 Adelie Torgersen 37.8 17.1 186 3300
## # ℹ 332 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
- 在指定的列(嘴峰长度和厚度)中检查每行的元素,如果这些元素都大于各自所在列的均值,就保留下来
bigger_than_mean <- function(x) {
x > mean(x, na.rm = TRUE)
}
penguins %>%
filter(if_all(contains("bill"), bigger_than_mean))
## # A tibble: 61 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgers… 46 21.5 194 4200
## 2 Adelie Dream 44.1 19.7 196 4400
## 3 Adelie Torgers… 45.8 18.9 197 4150
## 4 Adelie Biscoe 45.6 20.3 191 4600
## 5 Adelie Torgers… 44.1 18 210 4000
## 6 Gentoo Biscoe 44.4 17.3 219 5250
## 7 Gentoo Biscoe 50.8 17.3 228 5600
## 8 Chinstrap Dream 46.5 17.9 192 3500
## 9 Chinstrap Dream 50 19.5 196 3900
## 10 Chinstrap Dream 51.3 19.2 193 3650
## # ℹ 51 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
- 在指定的列(嘴峰长度和嘴峰厚度)中检查每行的元素,如果这些元素都大于各自所在列的均值,就”both big”;如果这些元素有一个大于自己所在列的均值,就”one big”,(注意case_when中if_all要在if_any之前)
penguins %>%
filter(!is.na(bill_length_mm)) %>%
mutate(
category = case_when(
if_all(contains("bill"), bigger_than_mean) ~ "both big",
if_any(contains("bill"), bigger_than_mean) ~ "one big",
TRUE ~ "small"
))
## # A tibble: 342 × 9
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen 36.7 19.3 193 3450
## 5 Adelie Torgersen 39.3 20.6 190 3650
## 6 Adelie Torgersen 38.9 17.8 181 3625
## 7 Adelie Torgersen 39.2 19.6 195 4675
## 8 Adelie Torgersen 34.1 18.1 193 3475
## 9 Adelie Torgersen 42 20.2 190 4250
## 10 Adelie Torgersen 37.8 17.1 186 3300
## # ℹ 332 more rows
## # ℹ 3 more variables: sex <fct>, year <int>, category <chr>