第 41 章 tidyverse中的across()之美2

41.1 曾经的痛点

dplyr 1.0.0 引入了across()函数,让我们再次感受到了dplyr的强大和人性化。 across()函数与summarise()mutate()函数配合起来使用,非常方便(参考第 40 章), 但与filter()函数不是很理想,比如我们想筛选数据框有缺失值的行

## # A tibble: 0 × 8
## # … with 8 variables: species <fct>, island <fct>,
## #   bill_length_mm <dbl>, bill_depth_mm <dbl>,
## #   flipper_length_mm <int>, body_mass_g <int>,
## #   sex <fct>, year <int>

代码能运行,但结果明显不正确。我搜索了很久,发现只能用dplyr 1.0.0之前的filter_all()函数实现,

penguins %>% 
  filter_all( any_vars(is.na(.)) )
## # A tibble: 11 × 8
##   species island    bill_length_mm bill_depth_mm
##   <fct>   <fct>              <dbl>         <dbl>
## 1 Adelie  Torgersen           NA            NA  
## 2 Adelie  Torgersen           34.1          18.1
## 3 Adelie  Torgersen           42            20.2
## 4 Adelie  Torgersen           37.8          17.1
## 5 Adelie  Torgersen           37.8          17.3
## 6 Adelie  Dream               37.5          18.9
## # … with 5 more rows, and 4 more variables:
## #   flipper_length_mm <int>, body_mass_g <int>,
## #   sex <fct>, year <int>

多少让人感觉,在追求简约道路上,还是有美中不足。

41.2 dplyr 1.0.4: if_any() and if_all()

如今,dplyr 1.0.4推出了 if_any() and if_all() 两个函数,正是弥补这个缺陷

penguins %>% 
  filter(if_any(everything(), is.na))
## # A tibble: 11 × 8
##   species island    bill_length_mm bill_depth_mm
##   <fct>   <fct>              <dbl>         <dbl>
## 1 Adelie  Torgersen           NA            NA  
## 2 Adelie  Torgersen           34.1          18.1
## 3 Adelie  Torgersen           42            20.2
## 4 Adelie  Torgersen           37.8          17.1
## 5 Adelie  Torgersen           37.8          17.3
## 6 Adelie  Dream               37.5          18.9
## # … with 5 more rows, and 4 more variables:
## #   flipper_length_mm <int>, body_mass_g <int>,
## #   sex <fct>, year <int>

从函数形式上看,if_any 对应着 across的地位,

across(.cols = everything(), .fns = NULL, ..., .names = NULL)

if_any(.cols, .fns = NULL, ..., .names = NULL)

if_all(.cols, .fns = NULL, ..., .names = NULL)

这就意味着列方向我们有across(),行方向我们有if_any()/if_all()了,可谓 纵横武林,倚天屠龙、谁与争锋?

41.3 案例赏析

下面通过一些例子展示下这两个新函数,其中一部分案例来自官网

  • 筛选有缺失值的行
penguins %>% 
  filter(if_any(everything(), is.na))
## # A tibble: 11 × 8
##   species island    bill_length_mm bill_depth_mm
##   <fct>   <fct>              <dbl>         <dbl>
## 1 Adelie  Torgersen           NA            NA  
## 2 Adelie  Torgersen           34.1          18.1
## 3 Adelie  Torgersen           42            20.2
## 4 Adelie  Torgersen           37.8          17.1
## 5 Adelie  Torgersen           37.8          17.3
## 6 Adelie  Dream               37.5          18.9
## # … with 5 more rows, and 4 more variables:
## #   flipper_length_mm <int>, body_mass_g <int>,
## #   sex <fct>, year <int>

或者更简单

penguins %>% 
  filter(if_any(.fns = is.na))
## # A tibble: 11 × 8
##   species island    bill_length_mm bill_depth_mm
##   <fct>   <fct>              <dbl>         <dbl>
## 1 Adelie  Torgersen           NA            NA  
## 2 Adelie  Torgersen           34.1          18.1
## 3 Adelie  Torgersen           42            20.2
## 4 Adelie  Torgersen           37.8          17.1
## 5 Adelie  Torgersen           37.8          17.3
## 6 Adelie  Dream               37.5          18.9
## # … with 5 more rows, and 4 more variables:
## #   flipper_length_mm <int>, body_mass_g <int>,
## #   sex <fct>, year <int>
  • 筛选全部是缺失值的行
penguins %>% 
  filter(if_all(everything(), is.na))
## # A tibble: 0 × 8
## # … with 8 variables: species <fct>, island <fct>,
## #   bill_length_mm <dbl>, bill_depth_mm <dbl>,
## #   flipper_length_mm <int>, body_mass_g <int>,
## #   sex <fct>, year <int>
  • 筛选企鹅嘴峰(长度和厚度)全部大于21mm的行
penguins %>% 
  filter(if_all(contains("bill"), ~ . > 21))
## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm
##   <fct>   <fct>              <dbl>         <dbl>
## 1 Adelie  Torgersen           38.6          21.2
## 2 Adelie  Torgersen           34.6          21.1
## 3 Adelie  Torgersen           46            21.5
## 4 Adelie  Dream               39.2          21.1
## 5 Adelie  Dream               42.3          21.2
## 6 Adelie  Biscoe              41.3          21.1
## # … with 4 more variables: flipper_length_mm <int>,
## #   body_mass_g <int>, sex <fct>, year <int>

当然可以弄成更骚一点喔

penguins %>% 
  filter(if_all(contains("bill"), `>`, 21))
## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm
##   <fct>   <fct>              <dbl>         <dbl>
## 1 Adelie  Torgersen           38.6          21.2
## 2 Adelie  Torgersen           34.6          21.1
## 3 Adelie  Torgersen           46            21.5
## 4 Adelie  Dream               39.2          21.1
## 5 Adelie  Dream               42.3          21.2
## 6 Adelie  Biscoe              41.3          21.1
## # … with 4 more variables: flipper_length_mm <int>,
## #   body_mass_g <int>, sex <fct>, year <int>
  • 筛选企鹅嘴峰(长度或者厚度)大于21mm的行
penguins %>% 
  filter(if_any(contains("bill"), ~ . > 21))
## # A tibble: 342 × 8
##   species island    bill_length_mm bill_depth_mm
##   <fct>   <fct>              <dbl>         <dbl>
## 1 Adelie  Torgersen           39.1          18.7
## 2 Adelie  Torgersen           39.5          17.4
## 3 Adelie  Torgersen           40.3          18  
## 4 Adelie  Torgersen           36.7          19.3
## 5 Adelie  Torgersen           39.3          20.6
## 6 Adelie  Torgersen           38.9          17.8
## # … with 336 more rows, and 4 more variables:
## #   flipper_length_mm <int>, body_mass_g <int>,
## #   sex <fct>, year <int>
  • 在指定的列(嘴峰长度和厚度)中检查每行的元素,如果这些元素都大于各自所在列的均值,就保留下来
bigger_than_mean <- function(x) {
  x > mean(x, na.rm = TRUE)
}

penguins %>% 
  filter(if_all(contains("bill"), bigger_than_mean))
## # A tibble: 61 × 8
##   species island    bill_length_mm bill_depth_mm
##   <fct>   <fct>              <dbl>         <dbl>
## 1 Adelie  Torgersen           46            21.5
## 2 Adelie  Dream               44.1          19.7
## 3 Adelie  Torgersen           45.8          18.9
## 4 Adelie  Biscoe              45.6          20.3
## 5 Adelie  Torgersen           44.1          18  
## 6 Gentoo  Biscoe              44.4          17.3
## # … with 55 more rows, and 4 more variables:
## #   flipper_length_mm <int>, body_mass_g <int>,
## #   sex <fct>, year <int>
  • 在指定的列(嘴峰长度和嘴峰厚度)中检查每行的元素,如果这些元素都大于各自所在列的均值,就”both big”;如果这些元素有一个大于自己所在列的均值,就”one big”,(注意case_when中if_all要在if_any之前)
penguins %>% 
  filter(!is.na(bill_length_mm)) %>% 
  mutate(
    category = case_when(
      if_all(contains("bill"), bigger_than_mean) ~ "both big", 
      if_any(contains("bill"), bigger_than_mean) ~ "one big", 
      TRUE                          ~ "small"
    ))
## # A tibble: 342 × 9
##   species island    bill_length_mm bill_depth_mm
##   <fct>   <fct>              <dbl>         <dbl>
## 1 Adelie  Torgersen           39.1          18.7
## 2 Adelie  Torgersen           39.5          17.4
## 3 Adelie  Torgersen           40.3          18  
## 4 Adelie  Torgersen           36.7          19.3
## 5 Adelie  Torgersen           39.3          20.6
## 6 Adelie  Torgersen           38.9          17.8
## # … with 336 more rows, and 5 more variables:
## #   flipper_length_mm <int>, body_mass_g <int>,
## #   sex <fct>, year <int>, category <chr>