第 43 章 tidyverse中的across()之美4

本章讲讲mutate()中的across()c_acorss()map_df()map2_dfc()系列的纠缠。

内容涉及迭代、泛函、返回数据框、数据框并入等概念。

library(tidyverse)

penguins <- palmerpenguins::penguins %>% drop_na()

43.1 mutate() 数据框并入

在以往的学习中,我们了解到mutate()的功能是新增一列,

d <- tibble(
  x = 3:5
)

d %>% 
  mutate(x1 = x*2)
## # A tibble: 3 × 2
##       x    x1
##   <int> <dbl>
## 1     3     6
## 2     4     8
## 3     5    10

有时候我在想同时新增多列呢?

d %>% 
  mutate(
    x1 = x*2,
    x2 = x^2
  )
## # A tibble: 3 × 3
##       x    x1    x2
##   <int> <dbl> <dbl>
## 1     3     6     9
## 2     4     8    16
## 3     5    10    25

从形式上看,相当于在原来d的基础上并入了一个新的数据框。或许mutate()和我们认识的不一样, 于是我试试

t <- tibble(
  y = 4:6
)

d %>% 
  mutate(t)
## # A tibble: 3 × 2
##       x     y
##   <int> <int>
## 1     3     4
## 2     4     5
## 3     5     6

奥利给。我们再看看,让数据框是以函数的形式返回

my_fun <- function(x) {
  tibble(
    x1 = x * 2, 
    x2 = x^2)
}

d %>%
  mutate(
    my_fun(x)
  )
## # A tibble: 3 × 3
##       x    x1    x2
##   <int> <dbl> <dbl>
## 1     3     6     9
## 2     4     8    16
## 3     5    10    25

是不是很惊喜。 我们再看看across()函数,在mutate()across() 返回的就是数据框,正好并入原来d,道理是一样的。

d %>%
  mutate(
    across(x, list(f1 = ~ .x * 2, f2 = ~ .x^2))
  )
## # A tibble: 3 × 3
##       x  x_f1  x_f2
##   <int> <dbl> <dbl>
## 1     3     6     9
## 2     4     8    16
## 3     5    10    25

事实上,across()返回数据框的特性,结合mutate()并入数据框功能,让数据处理如鱼得水、如虎添翼。

43.2 从一个问题开始

计算每天水分和食物的所占比例,比如第一天water和food都是10.0,那么各自比例都是50%.

d <- tibble::tribble(
  ~water, ~food,
  10.0,   10.0,
  12.1,   10.3,
  13.5,   19.1,
  17.4,   16.0,
  25.8,   15.6,
  27.4,   19.8
)
d
## # A tibble: 6 × 2
##   water  food
##   <dbl> <dbl>
## 1  10    10  
## 2  12.1  10.3
## 3  13.5  19.1
## 4  17.4  16  
## 5  25.8  15.6
## 6  27.4  19.8

43.3 传统的方法

传统的方法是,把数据框旋转成长表格,计算所占比例后,再旋转回来

d %>%
  rownames_to_column() %>%
  pivot_longer(
    cols = !rowname
  ) %>%
  group_by(rowname) %>%
  mutate(
    percent = 100 * value / sum(value)
  ) %>%
  ungroup() %>%
  pivot_wider(
    names_from = name,
    values_from = c(value, percent),
    names_glue = "{name}_{.value}"
  )
## # A tibble: 6 × 5
##   rowname water_value food_value water_percent
##   <chr>         <dbl>      <dbl>         <dbl>
## 1 1              10         10            50  
## 2 2              12.1       10.3          54.0
## 3 3              13.5       19.1          41.4
## 4 4              17.4       16            52.1
## 5 5              25.8       15.6          62.3
## 6 6              27.4       19.8          58.1
## # … with 1 more variable: food_percent <dbl>

43.4 across()的方法

传统的方法,用到基本的dplyr函数,思路很清晰,但有点周折。下面,我列出几个比较新颖的方法,当然这些方法都来源于强大across()函数

43.4.1 方法1

d %>%
  mutate(100 * across(.names = "%{.col}") / rowSums(across())) %>%
  ungroup()
## # A tibble: 6 × 4
##   water  food `%water` `%food`
##   <dbl> <dbl>    <dbl>   <dbl>
## 1  10    10       50      50  
## 2  12.1  10.3     54.0    46.0
## 3  13.5  19.1     41.4    58.6
## 4  17.4  16       52.1    47.9
## 5  25.8  15.6     62.3    37.7
## 6  27.4  19.8     58.1    41.9

43.4.2 方法2

rowPercent <- function(df) {
  df / rowSums(df) * 100
}

d %>%
  mutate(rowPercent(across(.names = "%{.col}")))
## # A tibble: 6 × 4
##   water  food `%water` `%food`
##   <dbl> <dbl>    <dbl>   <dbl>
## 1  10    10       50      50  
## 2  12.1  10.3     54.0    46.0
## 3  13.5  19.1     41.4    58.6
## 4  17.4  16       52.1    47.9
## 5  25.8  15.6     62.3    37.7
## 6  27.4  19.8     58.1    41.9

43.4.3 方法3

d %>% 
  rowwise() %>% 
  mutate(
    across(everything(), ~ .x / sum(c_across()) )
  )


df %>% 
  rowwise() %>% 
  mutate(
    across(everything(), .names = "prop_{.col}", ~ .x / sum(c_across())  )
  )


df %>% 
  rowwise() %>% 
  mutate(
    across(.names = "prop_{.col}", .fns =  ~ .x / sum(c_across())  )
  )
d %>%
  rowwise() %>%
  mutate(100 * across(.names = "%{.col}") / sum(c_across())) %>%
  ungroup()
## # A tibble: 6 × 4
##   water  food `%water` `%food`
##   <dbl> <dbl>    <dbl>   <dbl>
## 1  10    10       50      50  
## 2  12.1  10.3     54.0    46.0
## 3  13.5  19.1     41.4    58.6
## 4  17.4  16       52.1    47.9
## 5  25.8  15.6     62.3    37.7
## 6  27.4  19.8     58.1    41.9

43.4.4 方法4

scale <- function(x) {
  100 * x / sum(x, na.rm = TRUE)
}

d %>%
  rowwise() %>%
  mutate(
    scale(across(.names = "%{.col}"))
  )
## # A tibble: 6 × 4
## # Rowwise: 
##   water  food `%water` `%food`
##   <dbl> <dbl>    <dbl>   <dbl>
## 1  10    10       50      50  
## 2  12.1  10.3     54.0    46.0
## 3  13.5  19.1     41.4    58.6
## 4  17.4  16       52.1    47.9
## 5  25.8  15.6     62.3    37.7
## 6  27.4  19.8     58.1    41.9

43.4.5 方法5

d %>%
  rowwise() %>% 
  mutate(100 * proportions(across(.names = "%{.col}")))
## # A tibble: 6 × 4
## # Rowwise: 
##   water  food `%water` `%food`
##   <dbl> <dbl>    <dbl>   <dbl>
## 1  10    10       50      50  
## 2  12.1  10.3     54.0    46.0
## 3  13.5  19.1     41.4    58.6
## 4  17.4  16       52.1    47.9
## 5  25.8  15.6     62.3    37.7
## 6  27.4  19.8     58.1    41.9

上面的方法虽然很多,但基本思路是一样的。

43.5 纠缠不清的迭代

我们先弄清楚迭代方向:

  • rowwise() 一行一行的处理
  • across() 一列一列的处理
  • rowwise() + across() 这种组合,双重迭代,(一行一行 + 一列一列)就变成了一个一个的处理
  • across() + purrr::map_dbl()这种组合分两种情形:
    • purrr::map_dbl() 作为across( .fns = ) 中的函数,即across(.cols = , .fns = map_dbl() )across()一列一列的迭代,每一列又传入purrr::map_dbl()再次迭代,因此这里是双重迭代
    • across()作为purrr::map_df(.x = )的数据,即purrr::map_df(.x = across(), .f = )。因为在mutate()across()返回数据框,因此可以把across()整体视为数据框,然后这个数据框传入purrr::map_df(.x = )进行迭代,因此这种情形可以认为只有purrr::map_*()一次迭代。
# rowwise() + across()
# rowwise() 设定行方向后,接着across() 就行方向上的元素一个一个的执行.fns
# 循环模式:第一层,一行一行的,第二层在每一行里,一个元素到一个元素


penguins %>% 
  group_by(species, year) %>% 
  summarise(flipper_length_mm = list(flipper_length_mm)) %>% 
  ungroup() %>% 
  pivot_wider(
    names_from = year,
    values_from = flipper_length_mm
  ) %>% 
  rowwise() %>% 
  mutate(
    across(where(is.list), .fns = length)
  )
## # A tibble: 3 × 4
## # Rowwise: 
##   species   `2007` `2008` `2009`
##   <fct>      <int>  <int>  <int>
## 1 Adelie        44     50     52
## 2 Chinstrap     26     18     24
## 3 Gentoo        33     45     41
# across(.cols = ,     .fns = purrr::map_dbl()  )
# 用across()就是一列一列的处理,
# 此时的一列是vector or list,又可以进入purrr::map_dbl()再次迭代,对这一列的每个元素,执行.f
# 然后across()到下一列
# 循环模式:第一层,一列一列,第二层在每一列里,一个元素到一个元素

penguins %>% 
  group_by(species, year) %>% 
  summarise(flipper_length_mm = list(flipper_length_mm)) %>% 
  ungroup() %>% 
  pivot_wider(
    names_from = year,
    values_from = flipper_length_mm
  ) %>% 
  mutate(
    across(where(is.list), ~ purrr::map_dbl(.x, length))
  )
## # A tibble: 3 × 4
##   species   `2007` `2008` `2009`
##   <fct>      <dbl>  <dbl>  <dbl>
## 1 Adelie        44     50     52
## 2 Chinstrap     26     18     24
## 3 Gentoo        33     45     41
# `purrr::map_df(.x = across(),  .f = )`
# mutate()中的`across()`整体被视为**数据框**,传入purrr::map_df(.x = across(),    .f = ),然后迭代,返回数据框最后并入最初的df


penguins %>%
  select(species, starts_with("bill_")) %>% 
  head(5) %>% 
  
  mutate(
    map_dfc(
      .x = across(ends_with("_mm"), .names = '{sub("_mm", "", .col)}'),
      .f = ~.x^2
    )
  )
## # A tibble: 5 × 5
##   species bill_length_mm bill_depth_mm bill_length
##   <fct>            <dbl>         <dbl>       <dbl>
## 1 Adelie            39.1          18.7       1529.
## 2 Adelie            39.5          17.4       1560.
## 3 Adelie            40.3          18         1624.
## 4 Adelie            36.7          19.3       1347.
## 5 Adelie            39.3          20.6       1544.
## # … with 1 more variable: bill_depth <dbl>

写成分步的形式,可能更好理解

penguins %>%
  select(species, starts_with("bill_")) %>% 
  head(5) %>% 
  
  mutate({
    data <- across(ends_with("_mm"), .names = '{sub("_mm", "", .col)}')
    out <- map_dfc(data, .f = ~.x^2)
    out
  })
## # A tibble: 5 × 5
##   species bill_length_mm bill_depth_mm bill_length
##   <fct>            <dbl>         <dbl>       <dbl>
## 1 Adelie            39.1          18.7       1529.
## 2 Adelie            39.5          17.4       1560.
## 3 Adelie            40.3          18         1624.
## 4 Adelie            36.7          19.3       1347.
## 5 Adelie            39.3          20.6       1544.
## # … with 1 more variable: bill_depth <dbl>

再回头看across()的常规用法,是否对它有了新的认识?

penguins %>%
  select(species, starts_with("bill_")) %>% 
  head(5) %>% 
  
  mutate(
    across(ends_with("_mm"), .fns = ~.x^2, .names = '{sub("_mm", "", .col)}')
  )
## # A tibble: 5 × 5
##   species bill_length_mm bill_depth_mm bill_length
##   <fct>            <dbl>         <dbl>       <dbl>
## 1 Adelie            39.1          18.7       1529.
## 2 Adelie            39.5          17.4       1560.
## 3 Adelie            40.3          18         1624.
## 4 Adelie            36.7          19.3       1347.
## 5 Adelie            39.3          20.6       1544.
## # … with 1 more variable: bill_depth <dbl>

43.5.1 案例1

觉得不过瘾,我们看下面复杂点的例子

tt <- penguins %>% 
  group_by(species, year) %>% 
  summarise(
    across(c(bill_length_mm, bill_depth_mm), list) 
  ) %>% 
  ungroup() 
tt
## # A tibble: 9 × 4
##   species    year bill_length_mm bill_depth_mm
##   <fct>     <int> <list>         <list>       
## 1 Adelie     2007 <dbl [44]>     <dbl [44]>   
## 2 Adelie     2008 <dbl [50]>     <dbl [50]>   
## 3 Adelie     2009 <dbl [52]>     <dbl [52]>   
## 4 Chinstrap  2007 <dbl [26]>     <dbl [26]>   
## 5 Chinstrap  2008 <dbl [18]>     <dbl [18]>   
## 6 Chinstrap  2009 <dbl [24]>     <dbl [24]>   
## # … with 3 more rows
tt %>%
  mutate(
    map_dfc(
      .x = across(ends_with("_mm"), .names = '{sub("_mm", "", .col)}'),
      .f = ~ map_dbl(.x, length)
    )
  )
## # A tibble: 9 × 6
##   species    year bill_length_mm bill_depth_mm
##   <fct>     <int> <list>         <list>       
## 1 Adelie     2007 <dbl [44]>     <dbl [44]>   
## 2 Adelie     2008 <dbl [50]>     <dbl [50]>   
## 3 Adelie     2009 <dbl [52]>     <dbl [52]>   
## 4 Chinstrap  2007 <dbl [26]>     <dbl [26]>   
## 5 Chinstrap  2008 <dbl [18]>     <dbl [18]>   
## 6 Chinstrap  2009 <dbl [24]>     <dbl [24]>   
## # … with 3 more rows, and 2 more variables:
## #   bill_length <dbl>, bill_depth <dbl>

分步写法

tt %>%
  mutate({
    data <- across(ends_with("_mm"), .names = '{sub("_mm", "", .col)}')
    out <- map_dfc(data, .f = ~ map_dbl(.x, length))
    out
  })
## # A tibble: 9 × 6
##   species    year bill_length_mm bill_depth_mm
##   <fct>     <int> <list>         <list>       
## 1 Adelie     2007 <dbl [44]>     <dbl [44]>   
## 2 Adelie     2008 <dbl [50]>     <dbl [50]>   
## 3 Adelie     2009 <dbl [52]>     <dbl [52]>   
## 4 Chinstrap  2007 <dbl [26]>     <dbl [26]>   
## 5 Chinstrap  2008 <dbl [18]>     <dbl [18]>   
## 6 Chinstrap  2009 <dbl [24]>     <dbl [24]>   
## # … with 3 more rows, and 2 more variables:
## #   bill_length <dbl>, bill_depth <dbl>

回到常规写法

tt %>%
  mutate(
    across(ends_with("_mm"), .fns = ~ map_dbl(.x, length), .names = '{sub("_mm", "", .col)}')
  )
## # A tibble: 9 × 6
##   species    year bill_length_mm bill_depth_mm
##   <fct>     <int> <list>         <list>       
## 1 Adelie     2007 <dbl [44]>     <dbl [44]>   
## 2 Adelie     2008 <dbl [50]>     <dbl [50]>   
## 3 Adelie     2009 <dbl [52]>     <dbl [52]>   
## 4 Chinstrap  2007 <dbl [26]>     <dbl [26]>   
## 5 Chinstrap  2008 <dbl [18]>     <dbl [18]>   
## 6 Chinstrap  2009 <dbl [24]>     <dbl [24]>   
## # … with 3 more rows, and 2 more variables:
## #   bill_length <dbl>, bill_depth <dbl>

43.5.2 案例2

更变态的案例

tt %>%
  mutate(
    purrr::map2_dfr(
      .x = across(bill_length_mm, .names = "cor"),
      .y = across(bill_depth_mm),
      .f = ~ map2_dbl(.x, .y, cor)
    )
  )
## # A tibble: 9 × 5
##   species    year bill_length_mm bill_depth_mm   cor
##   <fct>     <int> <list>         <list>        <dbl>
## 1 Adelie     2007 <dbl [44]>     <dbl [44]>    0.294
## 2 Adelie     2008 <dbl [50]>     <dbl [50]>    0.439
## 3 Adelie     2009 <dbl [52]>     <dbl [52]>    0.444
## 4 Chinstrap  2007 <dbl [26]>     <dbl [26]>    0.526
## 5 Chinstrap  2008 <dbl [18]>     <dbl [18]>    0.884
## 6 Chinstrap  2009 <dbl [24]>     <dbl [24]>    0.575
## # … with 3 more rows
tt %>%
  mutate(
    purrr::map2_dfr(
      .x = across(bill_length_mm, .names = "cor"),
      .y = across(bill_depth_mm),
      .f = ~ map2_dbl(.x, .y, cor)
    )
  )
## # A tibble: 9 × 5
##   species    year bill_length_mm bill_depth_mm   cor
##   <fct>     <int> <list>         <list>        <dbl>
## 1 Adelie     2007 <dbl [44]>     <dbl [44]>    0.294
## 2 Adelie     2008 <dbl [50]>     <dbl [50]>    0.439
## 3 Adelie     2009 <dbl [52]>     <dbl [52]>    0.444
## 4 Chinstrap  2007 <dbl [26]>     <dbl [26]>    0.526
## 5 Chinstrap  2008 <dbl [18]>     <dbl [18]>    0.884
## 6 Chinstrap  2009 <dbl [24]>     <dbl [24]>    0.575
## # … with 3 more rows

分步写法

tt %>%
  mutate({
    data1 <- across(bill_length_mm, .names = "cor")
    data2 <- across(bill_depth_mm)
    out <- purrr::map2_dfc(data1, data2, .f = ~ map2_dbl(.x, .y, cor))
    out
  })
## # A tibble: 9 × 5
##   species    year bill_length_mm bill_depth_mm   cor
##   <fct>     <int> <list>         <list>        <dbl>
## 1 Adelie     2007 <dbl [44]>     <dbl [44]>    0.294
## 2 Adelie     2008 <dbl [50]>     <dbl [50]>    0.439
## 3 Adelie     2009 <dbl [52]>     <dbl [52]>    0.444
## 4 Chinstrap  2007 <dbl [26]>     <dbl [26]>    0.526
## 5 Chinstrap  2008 <dbl [18]>     <dbl [18]>    0.884
## 6 Chinstrap  2009 <dbl [24]>     <dbl [24]>    0.575
## # … with 3 more rows

常规方法

tt %>% 
  rowwise() %>% 
  mutate(
    cor = cor(bill_length_mm, bill_depth_mm)
  )
## # A tibble: 9 × 5
## # Rowwise: 
##   species    year bill_length_mm bill_depth_mm   cor
##   <fct>     <int> <list>         <list>        <dbl>
## 1 Adelie     2007 <dbl [44]>     <dbl [44]>    0.294
## 2 Adelie     2008 <dbl [50]>     <dbl [50]>    0.439
## 3 Adelie     2009 <dbl [52]>     <dbl [52]>    0.444
## 4 Chinstrap  2007 <dbl [26]>     <dbl [26]>    0.526
## 5 Chinstrap  2008 <dbl [18]>     <dbl [18]>    0.884
## 6 Chinstrap  2009 <dbl [24]>     <dbl [24]>    0.575
## # … with 3 more rows

我们这样折腾只是为了展示各种迭代.

43.6 习题

43.6.1 习题1

对于数据

df <- tibble(
  id = 1:10,
  sex = c("m", "m", "m", "f", "f", "f", "m", "f", "f", "m"),
  lds1.x = c(NA, 1, 0, 1, NA, 0, 0, NA, 0, 1),
  lds1.y = c(1, NA, 1, 1, 0, NA, 0, 3, NA, 1),
  lds2.x = c(2, 1, NA, 0, 0, NA, 1, NA, NA, 1),
  lds2.y = c(0, 2, 2, NA, NA, 0, 0, 3, 0, NA)
)
df
## # A tibble: 10 × 6
##      id sex   lds1.x lds1.y lds2.x lds2.y
##   <int> <chr>  <dbl>  <dbl>  <dbl>  <dbl>
## 1     1 m         NA      1      2      0
## 2     2 m          1     NA      1      2
## 3     3 m          0      1     NA      2
## 4     4 f          1      1      0     NA
## 5     5 f         NA      0      0     NA
## 6     6 f          0     NA     NA      0
## # … with 4 more rows

希望两两coalesce,比如,

df %>%
  mutate(
    lds1 = coalesce(lds1.x, lds1.y),
    lds2 = coalesce(lds2.x, lds2.y)
  )

但要求是用across()写。

解题思路:

df %>%
  mutate(
    across(ends_with(".x"))
  )


df %>%
  mutate(
    across(ends_with(".x"), .names = '{sub(".x","",.col)}')
  )
df %>%
  mutate(
    map_dfc(
      .x = across(ends_with(".x"), .names = '{sub(".x","", .col)}'),
      .f = ~is.na(.x)
    )
  )
df %>%
  mutate(
    map2_dfr(
      .x = across(ends_with(".x"), .names = '{sub(".x","",.col)}'),
      .y = across(ends_with(".y")),
      .f = coalesce               # Vectors coalesce
    )
  )
## # A tibble: 10 × 8
##      id sex   lds1.x lds1.y lds2.x lds2.y  lds1  lds2
##   <int> <chr>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl>
## 1     1 m         NA      1      2      0     1     2
## 2     2 m          1     NA      1      2     1     1
## 3     3 m          0      1     NA      2     0     2
## 4     4 f          1      1      0     NA     1     0
## 5     5 f         NA      0      0     NA     0     0
## 6     6 f          0     NA     NA      0     0     0
## # … with 4 more rows
  • 分步写,更清晰和优雅。迭代过程:数据框df1的第一列和数据框data2的第一列coalesce,然后数据框df1的第二列和数据框df2的第二列coalesce.
df %>%
  mutate({
    df1 <- across(ends_with(".x"), .names = '{sub(".x","",.col)}')
    df2 <- across(ends_with(".y"))
    out <- purrr::map2_dfc(df1, df2, ~ coalesce(.x, .y)) 
    out
  })
## # A tibble: 10 × 8
##      id sex   lds1.x lds1.y lds2.x lds2.y  lds1  lds2
##   <int> <chr>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl>
## 1     1 m         NA      1      2      0     1     2
## 2     2 m          1     NA      1      2     1     1
## 3     3 m          0      1     NA      2     0     2
## 4     4 f          1      1      0     NA     1     0
## 5     5 f         NA      0      0     NA     0     0
## 6     6 f          0     NA     NA      0     0     0
## # … with 4 more rows

43.6.2 习题2

题目:如果符合某个条件,就让指定的列反号。比如,如果x小于4,x和y两列就反号。

事实上,完成这个任务的方法很多,我们只是演示across()的某些特征。

d <- tibble( x = 1:4, y = 1:4)
d
## # A tibble: 4 × 2
##       x     y
##   <int> <int>
## 1     1     1
## 2     2     2
## 3     3     3
## 4     4     4
# using data frame returns
d %>% 
  mutate({
    test <- x < 4
    x[test] <- -x[test]
    y[test] <- -y[test]
    data.frame(x = x, y = y)
  })
## # A tibble: 4 × 2
##       x     y
##   <int> <int>
## 1    -1    -1
## 2    -2    -2
## 3    -3    -3
## 4     4     4
# using across()
d %>% 
  mutate({
    test <- x < 4
    across(c(x, y), ~ {.x[test] <- -.x[test]; .x })
  })
## # A tibble: 4 × 2
##       x     y
##   <int> <int>
## 1    -1    -1
## 2    -2    -2
## 3    -3    -3
## 4     4     4
# further abstract
negate_if <- function(condition, cols) {
  across({{ cols }}, ~ {
    .x[condition] <- -.x[condition]
    .x
  })
}
d %>% 
  mutate(negate_if(x < 4, c(x, y)))
## # A tibble: 4 × 2
##       x     y
##   <int> <int>
## 1    -1    -1
## 2    -2    -2
## 3    -3    -3
## 4     4     4