第 43 章 tidyverse中的across()之美4
本章讲讲mutate()
中的across()
与c_acorss()
、map_df()
、map2_dfc()
系列的纠缠。
内容涉及迭代、泛函、返回数据框、数据框并入等概念。
43.1 mutate()
数据框并入
在以往的学习中,我们了解到mutate()
的功能是新增一列,
## # A tibble: 3 × 2
## x x1
## <int> <dbl>
## 1 3 6
## 2 4 8
## 3 5 10
有时候我在想同时新增多列呢?
## # A tibble: 3 × 3
## x x1 x2
## <int> <dbl> <dbl>
## 1 3 6 9
## 2 4 8 16
## 3 5 10 25
从形式上看,相当于在原来d的基础上并入了一个新的数据框。或许mutate()
和我们认识的不一样, 于是我试试
## # A tibble: 3 × 2
## x y
## <int> <int>
## 1 3 4
## 2 4 5
## 3 5 6
奥利给。我们再看看,让数据框是以函数的形式返回
## # A tibble: 3 × 3
## x x1 x2
## <int> <dbl> <dbl>
## 1 3 6 9
## 2 4 8 16
## 3 5 10 25
是不是很惊喜。 我们再看看across()
函数,在mutate()
中 across()
返回的就是数据框,正好并入原来d,道理是一样的。
## # A tibble: 3 × 3
## x x_f1 x_f2
## <int> <dbl> <dbl>
## 1 3 6 9
## 2 4 8 16
## 3 5 10 25
43.2 从一个问题开始
计算每天水分和食物的所占比例,比如第一天water和food都是10.0,那么各自比例都是50%.
d <- tibble::tribble(
~water, ~food,
10.0, 10.0,
12.1, 10.3,
13.5, 19.1,
17.4, 16.0,
25.8, 15.6,
27.4, 19.8
)
d
## # A tibble: 6 × 2
## water food
## <dbl> <dbl>
## 1 10 10
## 2 12.1 10.3
## 3 13.5 19.1
## 4 17.4 16
## 5 25.8 15.6
## 6 27.4 19.8
43.3 传统的方法
传统的方法是,把数据框旋转成长表格,计算所占比例后,再旋转回来
d %>%
rownames_to_column() %>%
pivot_longer(
cols = !rowname
) %>%
group_by(rowname) %>%
mutate(
percent = 100 * value / sum(value)
) %>%
ungroup() %>%
pivot_wider(
names_from = name,
values_from = c(value, percent),
names_glue = "{name}_{.value}"
)
## # A tibble: 6 × 5
## rowname water_value food_value water_percent food_percent
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 10 10 50 50
## 2 2 12.1 10.3 54.0 46.0
## 3 3 13.5 19.1 41.4 58.6
## 4 4 17.4 16 52.1 47.9
## 5 5 25.8 15.6 62.3 37.7
## 6 6 27.4 19.8 58.1 41.9
43.4 across()的方法
传统的方法,用到基本的dplyr函数,思路很清晰,但有点周折。下面,我列出几个比较新颖的方法,当然这些方法都来源于强大across()
函数
43.4.1 方法1
## # A tibble: 6 × 4
## water food `%water` `%food`
## <dbl> <dbl> <dbl> <dbl>
## 1 10 10 50 50
## 2 12.1 10.3 54.0 46.0
## 3 13.5 19.1 41.4 58.6
## 4 17.4 16 52.1 47.9
## 5 25.8 15.6 62.3 37.7
## 6 27.4 19.8 58.1 41.9
43.4.2 方法2
rowPercent <- function(df) {
df / rowSums(df) * 100
}
d %>%
mutate(rowPercent(across(.names = "%{.col}")))
## # A tibble: 6 × 4
## water food `%water` `%food`
## <dbl> <dbl> <dbl> <dbl>
## 1 10 10 50 50
## 2 12.1 10.3 54.0 46.0
## 3 13.5 19.1 41.4 58.6
## 4 17.4 16 52.1 47.9
## 5 25.8 15.6 62.3 37.7
## 6 27.4 19.8 58.1 41.9
43.4.3 方法3
d %>%
rowwise() %>%
mutate(
across(everything(), ~ .x / sum(c_across()) )
)
df %>%
rowwise() %>%
mutate(
across(everything(), .names = "prop_{.col}", ~ .x / sum(c_across()) )
)
df %>%
rowwise() %>%
mutate(
across(.names = "prop_{.col}", .fns = ~ .x / sum(c_across()) )
)
## # A tibble: 6 × 4
## water food `%water` `%food`
## <dbl> <dbl> <dbl> <dbl>
## 1 10 10 50 50
## 2 12.1 10.3 54.0 46.0
## 3 13.5 19.1 41.4 58.6
## 4 17.4 16 52.1 47.9
## 5 25.8 15.6 62.3 37.7
## 6 27.4 19.8 58.1 41.9
43.4.4 方法4
scale <- function(x) {
100 * x / sum(x, na.rm = TRUE)
}
d %>%
rowwise() %>%
mutate(
scale(across(.names = "%{.col}"))
)
## # A tibble: 6 × 4
## # Rowwise:
## water food `%water` `%food`
## <dbl> <dbl> <dbl> <dbl>
## 1 10 10 50 50
## 2 12.1 10.3 54.0 46.0
## 3 13.5 19.1 41.4 58.6
## 4 17.4 16 52.1 47.9
## 5 25.8 15.6 62.3 37.7
## 6 27.4 19.8 58.1 41.9
43.4.5 方法5
## # A tibble: 6 × 4
## # Rowwise:
## water food `%water` `%food`
## <dbl> <dbl> <dbl> <dbl>
## 1 10 10 50 50
## 2 12.1 10.3 54.0 46.0
## 3 13.5 19.1 41.4 58.6
## 4 17.4 16 52.1 47.9
## 5 25.8 15.6 62.3 37.7
## 6 27.4 19.8 58.1 41.9
43.4.6 方法6
更好的呈现比例
d %>%
rowwise() %>%
mutate(
across(c(water, food), ~scales::label_percent(scale = 100)(.x /sum(c_across())), .names = "{.col}_%")
)
## # A tibble: 6 × 4
## # Rowwise:
## water food `water_%` `food_%`
## <dbl> <dbl> <chr> <chr>
## 1 10 10 50% 50%
## 2 12.1 10.3 54% 46%
## 3 13.5 19.1 41% 59%
## 4 17.4 16 52% 48%
## 5 25.8 15.6 62% 38%
## 6 27.4 19.8 58% 42%
上面的方法虽然很多,但基本思路是一样的。
43.5 纠缠不清的迭代
我们先弄清楚迭代方向:
-
rowwise()
一行一行的处理 -
across()
一列一列的处理 -
rowwise() + across()
这种组合,双重迭代,(一行一行+
一列一列)就变成了一个一个的处理 -
across() + purrr::map_dbl()
这种组合分两种情形:-
purrr::map_dbl()
作为across( .fns = )
中的函数,即across(.cols = , .fns = map_dbl() )
。across()
一列一列的迭代,每一列又传入purrr::map_dbl()
再次迭代,因此这里是双重迭代 -
across()
作为purrr::map_df(.x = )
的数据,即purrr::map_df(.x = across(), .f = )
。因为在mutate()
中across()
返回数据框,因此可以把across()
整体视为数据框,然后这个数据框传入purrr::map_df(.x = )
进行迭代,因此这种情形可以认为只有purrr::map_*()
一次迭代。
-
# rowwise() + across()
# rowwise() 设定行方向后,接着across() 就行方向上的元素一个一个的执行.fns
# 循环模式:第一层,一行一行的,第二层在每一行里,一个元素到一个元素
penguins %>%
group_by(species, year) %>%
summarise(flipper_length_mm = list(flipper_length_mm)) %>%
ungroup() %>%
pivot_wider(
names_from = year,
values_from = flipper_length_mm
) %>%
rowwise() %>%
mutate(
across(where(is.list), .fns = length)
)
## # A tibble: 3 × 4
## # Rowwise:
## species `2007` `2008` `2009`
## <fct> <int> <int> <int>
## 1 Adelie 44 50 52
## 2 Chinstrap 26 18 24
## 3 Gentoo 33 45 41
# across(.cols = , .fns = purrr::map_dbl() )
# 用across()就是一列一列的处理,
# 此时的一列是vector or list,又可以进入purrr::map_dbl()再次迭代,对这一列的每个元素,执行.f
# 然后across()到下一列
# 循环模式:第一层,一列一列,第二层在每一列里,一个元素到一个元素
penguins %>%
group_by(species, year) %>%
summarise(flipper_length_mm = list(flipper_length_mm)) %>%
ungroup() %>%
pivot_wider(
names_from = year,
values_from = flipper_length_mm
) %>%
mutate(
across(where(is.list), ~ purrr::map_dbl(.x, length))
)
## # A tibble: 3 × 4
## species `2007` `2008` `2009`
## <fct> <dbl> <dbl> <dbl>
## 1 Adelie 44 50 52
## 2 Chinstrap 26 18 24
## 3 Gentoo 33 45 41
# `purrr::map_df(.x = across(), .f = )`
# mutate()中的`across()`整体被视为**数据框**,传入purrr::map_df(.x = across(), .f = ),然后迭代,返回数据框最后并入最初的df
penguins %>%
select(species, starts_with("bill_")) %>%
head(5) %>%
mutate(
map_dfc(
.x = across(ends_with("_mm"), .names = '{sub("_mm", "", .col)}'),
.f = ~.x^2
)
)
## # A tibble: 5 × 5
## species bill_length_mm bill_depth_mm bill_length bill_depth
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie 39.1 18.7 1529. 350.
## 2 Adelie 39.5 17.4 1560. 303.
## 3 Adelie 40.3 18 1624. 324
## 4 Adelie 36.7 19.3 1347. 372.
## 5 Adelie 39.3 20.6 1544. 424.
写成分步的形式,可能更好理解
penguins %>%
select(species, starts_with("bill_")) %>%
head(5) %>%
mutate({
data <- across(ends_with("_mm"), .names = '{sub("_mm", "", .col)}')
out <- map_dfc(data, .f = ~.x^2)
out
})
## # A tibble: 5 × 5
## species bill_length_mm bill_depth_mm bill_length bill_depth
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie 39.1 18.7 1529. 350.
## 2 Adelie 39.5 17.4 1560. 303.
## 3 Adelie 40.3 18 1624. 324
## 4 Adelie 36.7 19.3 1347. 372.
## 5 Adelie 39.3 20.6 1544. 424.
再回头看across()
的常规用法,是否对它有了新的认识?
penguins %>%
select(species, starts_with("bill_")) %>%
head(5) %>%
mutate(
across(ends_with("_mm"), .fns = ~.x^2, .names = '{sub("_mm", "", .col)}')
)
## # A tibble: 5 × 5
## species bill_length_mm bill_depth_mm bill_length bill_depth
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie 39.1 18.7 1529. 350.
## 2 Adelie 39.5 17.4 1560. 303.
## 3 Adelie 40.3 18 1624. 324
## 4 Adelie 36.7 19.3 1347. 372.
## 5 Adelie 39.3 20.6 1544. 424.
43.5.1 案例1
觉得不过瘾,我们看下面复杂点的例子
tt <- penguins %>%
group_by(species, year) %>%
summarise(
across(c(bill_length_mm, bill_depth_mm), list)
) %>%
ungroup()
tt
## # A tibble: 9 × 4
## species year bill_length_mm bill_depth_mm
## <fct> <int> <list> <list>
## 1 Adelie 2007 <dbl [44]> <dbl [44]>
## 2 Adelie 2008 <dbl [50]> <dbl [50]>
## 3 Adelie 2009 <dbl [52]> <dbl [52]>
## 4 Chinstrap 2007 <dbl [26]> <dbl [26]>
## 5 Chinstrap 2008 <dbl [18]> <dbl [18]>
## 6 Chinstrap 2009 <dbl [24]> <dbl [24]>
## 7 Gentoo 2007 <dbl [33]> <dbl [33]>
## 8 Gentoo 2008 <dbl [45]> <dbl [45]>
## 9 Gentoo 2009 <dbl [41]> <dbl [41]>
tt %>%
mutate(
map_dfc(
.x = across(ends_with("_mm"), .names = '{sub("_mm", "", .col)}'),
.f = ~ map_dbl(.x, length)
)
)
## # A tibble: 9 × 6
## species year bill_length_mm bill_depth_mm bill_length bill_depth
## <fct> <int> <list> <list> <dbl> <dbl>
## 1 Adelie 2007 <dbl [44]> <dbl [44]> 44 44
## 2 Adelie 2008 <dbl [50]> <dbl [50]> 50 50
## 3 Adelie 2009 <dbl [52]> <dbl [52]> 52 52
## 4 Chinstrap 2007 <dbl [26]> <dbl [26]> 26 26
## 5 Chinstrap 2008 <dbl [18]> <dbl [18]> 18 18
## 6 Chinstrap 2009 <dbl [24]> <dbl [24]> 24 24
## 7 Gentoo 2007 <dbl [33]> <dbl [33]> 33 33
## 8 Gentoo 2008 <dbl [45]> <dbl [45]> 45 45
## 9 Gentoo 2009 <dbl [41]> <dbl [41]> 41 41
分步写法
tt %>%
mutate({
data <- across(ends_with("_mm"), .names = '{sub("_mm", "", .col)}')
out <- map_dfc(data, .f = ~ map_dbl(.x, length))
out
})
## # A tibble: 9 × 6
## species year bill_length_mm bill_depth_mm bill_length bill_depth
## <fct> <int> <list> <list> <dbl> <dbl>
## 1 Adelie 2007 <dbl [44]> <dbl [44]> 44 44
## 2 Adelie 2008 <dbl [50]> <dbl [50]> 50 50
## 3 Adelie 2009 <dbl [52]> <dbl [52]> 52 52
## 4 Chinstrap 2007 <dbl [26]> <dbl [26]> 26 26
## 5 Chinstrap 2008 <dbl [18]> <dbl [18]> 18 18
## 6 Chinstrap 2009 <dbl [24]> <dbl [24]> 24 24
## 7 Gentoo 2007 <dbl [33]> <dbl [33]> 33 33
## 8 Gentoo 2008 <dbl [45]> <dbl [45]> 45 45
## 9 Gentoo 2009 <dbl [41]> <dbl [41]> 41 41
回到常规写法
tt %>%
mutate(
across(ends_with("_mm"), .fns = ~ map_dbl(.x, length), .names = '{sub("_mm", "", .col)}')
)
## # A tibble: 9 × 6
## species year bill_length_mm bill_depth_mm bill_length bill_depth
## <fct> <int> <list> <list> <dbl> <dbl>
## 1 Adelie 2007 <dbl [44]> <dbl [44]> 44 44
## 2 Adelie 2008 <dbl [50]> <dbl [50]> 50 50
## 3 Adelie 2009 <dbl [52]> <dbl [52]> 52 52
## 4 Chinstrap 2007 <dbl [26]> <dbl [26]> 26 26
## 5 Chinstrap 2008 <dbl [18]> <dbl [18]> 18 18
## 6 Chinstrap 2009 <dbl [24]> <dbl [24]> 24 24
## 7 Gentoo 2007 <dbl [33]> <dbl [33]> 33 33
## 8 Gentoo 2008 <dbl [45]> <dbl [45]> 45 45
## 9 Gentoo 2009 <dbl [41]> <dbl [41]> 41 41
43.5.2 案例2
更变态的案例
tt %>%
mutate(
purrr::map2_dfr(
.x = across(bill_length_mm, .names = "cor"),
.y = across(bill_depth_mm),
.f = ~ map2_dbl(.x, .y, cor)
)
)
## # A tibble: 9 × 5
## species year bill_length_mm bill_depth_mm cor
## <fct> <int> <list> <list> <dbl>
## 1 Adelie 2007 <dbl [44]> <dbl [44]> 0.294
## 2 Adelie 2008 <dbl [50]> <dbl [50]> 0.439
## 3 Adelie 2009 <dbl [52]> <dbl [52]> 0.444
## 4 Chinstrap 2007 <dbl [26]> <dbl [26]> 0.526
## 5 Chinstrap 2008 <dbl [18]> <dbl [18]> 0.884
## 6 Chinstrap 2009 <dbl [24]> <dbl [24]> 0.575
## 7 Gentoo 2007 <dbl [33]> <dbl [33]> 0.724
## 8 Gentoo 2008 <dbl [45]> <dbl [45]> 0.547
## 9 Gentoo 2009 <dbl [41]> <dbl [41]> 0.670
tt %>%
mutate(
purrr::map2_dfr(
.x = across(bill_length_mm, .names = "cor"),
.y = across(bill_depth_mm),
.f = ~ map2_dbl(.x, .y, cor)
)
)
## # A tibble: 9 × 5
## species year bill_length_mm bill_depth_mm cor
## <fct> <int> <list> <list> <dbl>
## 1 Adelie 2007 <dbl [44]> <dbl [44]> 0.294
## 2 Adelie 2008 <dbl [50]> <dbl [50]> 0.439
## 3 Adelie 2009 <dbl [52]> <dbl [52]> 0.444
## 4 Chinstrap 2007 <dbl [26]> <dbl [26]> 0.526
## 5 Chinstrap 2008 <dbl [18]> <dbl [18]> 0.884
## 6 Chinstrap 2009 <dbl [24]> <dbl [24]> 0.575
## 7 Gentoo 2007 <dbl [33]> <dbl [33]> 0.724
## 8 Gentoo 2008 <dbl [45]> <dbl [45]> 0.547
## 9 Gentoo 2009 <dbl [41]> <dbl [41]> 0.670
分步写法
tt %>%
mutate({
data1 <- across(bill_length_mm, .names = "cor")
data2 <- across(bill_depth_mm)
out <- purrr::map2_dfc(data1, data2, .f = ~ map2_dbl(.x, .y, cor))
out
})
## # A tibble: 9 × 5
## species year bill_length_mm bill_depth_mm cor
## <fct> <int> <list> <list> <dbl>
## 1 Adelie 2007 <dbl [44]> <dbl [44]> 0.294
## 2 Adelie 2008 <dbl [50]> <dbl [50]> 0.439
## 3 Adelie 2009 <dbl [52]> <dbl [52]> 0.444
## 4 Chinstrap 2007 <dbl [26]> <dbl [26]> 0.526
## 5 Chinstrap 2008 <dbl [18]> <dbl [18]> 0.884
## 6 Chinstrap 2009 <dbl [24]> <dbl [24]> 0.575
## 7 Gentoo 2007 <dbl [33]> <dbl [33]> 0.724
## 8 Gentoo 2008 <dbl [45]> <dbl [45]> 0.547
## 9 Gentoo 2009 <dbl [41]> <dbl [41]> 0.670
常规方法
## # A tibble: 9 × 5
## # Rowwise:
## species year bill_length_mm bill_depth_mm cor
## <fct> <int> <list> <list> <dbl>
## 1 Adelie 2007 <dbl [44]> <dbl [44]> 0.294
## 2 Adelie 2008 <dbl [50]> <dbl [50]> 0.439
## 3 Adelie 2009 <dbl [52]> <dbl [52]> 0.444
## 4 Chinstrap 2007 <dbl [26]> <dbl [26]> 0.526
## 5 Chinstrap 2008 <dbl [18]> <dbl [18]> 0.884
## 6 Chinstrap 2009 <dbl [24]> <dbl [24]> 0.575
## 7 Gentoo 2007 <dbl [33]> <dbl [33]> 0.724
## 8 Gentoo 2008 <dbl [45]> <dbl [45]> 0.547
## 9 Gentoo 2009 <dbl [41]> <dbl [41]> 0.670
我们这样折腾只是为了展示各种迭代.
43.6 习题
43.6.1 习题1
对于数据
df <- tibble(
id = 1:10,
sex = c("m", "m", "m", "f", "f", "f", "m", "f", "f", "m"),
lds1.x = c(NA, 1, 0, 1, NA, 0, 0, NA, 0, 1),
lds1.y = c(1, NA, 1, 1, 0, NA, 0, 3, NA, 1),
lds2.x = c(2, 1, NA, 0, 0, NA, 1, NA, NA, 1),
lds2.y = c(0, 2, 2, NA, NA, 0, 0, 3, 0, NA)
)
df
## # A tibble: 10 × 6
## id sex lds1.x lds1.y lds2.x lds2.y
## <int> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 m NA 1 2 0
## 2 2 m 1 NA 1 2
## 3 3 m 0 1 NA 2
## 4 4 f 1 1 0 NA
## 5 5 f NA 0 0 NA
## 6 6 f 0 NA NA 0
## 7 7 m 0 0 1 0
## 8 8 f NA 3 NA 3
## 9 9 f 0 NA NA 0
## 10 10 m 1 1 1 NA
希望两两coalesce
,比如,
但要求是用across()
写。
解题思路:
df %>%
mutate(
across(ends_with(".x"))
)
df %>%
mutate(
across(ends_with(".x"), .names = '{sub(".x","",.col)}')
)
df %>%
mutate(
map_dfc(
.x = across(ends_with(".x"), .names = '{sub(".x","", .col)}'),
.f = ~is.na(.x)
)
)
- 两个
across()
对应两个数据框,传递给map2_dfc()
函数
df %>%
mutate(
map2_dfr(
.x = across(ends_with(".x"), .names = '{sub(".x","",.col)}'),
.y = across(ends_with(".y")),
.f = coalesce # Vectors coalesce
)
)
## # A tibble: 10 × 8
## id sex lds1.x lds1.y lds2.x lds2.y lds1 lds2
## <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 m NA 1 2 0 1 2
## 2 2 m 1 NA 1 2 1 1
## 3 3 m 0 1 NA 2 0 2
## 4 4 f 1 1 0 NA 1 0
## 5 5 f NA 0 0 NA 0 0
## 6 6 f 0 NA NA 0 0 0
## 7 7 m 0 0 1 0 0 1
## 8 8 f NA 3 NA 3 3 3
## 9 9 f 0 NA NA 0 0 0
## 10 10 m 1 1 1 NA 1 1
- 分步写,更清晰和优雅。迭代过程:数据框df1的第一列和数据框data2的第一列coalesce,然后数据框df1的第二列和数据框df2的第二列coalesce.
df %>%
mutate({
df1 <- across(ends_with(".x"), .names = '{sub(".x","",.col)}')
df2 <- across(ends_with(".y"))
out <- purrr::map2_dfc(df1, df2, ~ coalesce(.x, .y))
out
})
## # A tibble: 10 × 8
## id sex lds1.x lds1.y lds2.x lds2.y lds1 lds2
## <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 m NA 1 2 0 1 2
## 2 2 m 1 NA 1 2 1 1
## 3 3 m 0 1 NA 2 0 2
## 4 4 f 1 1 0 NA 1 0
## 5 5 f NA 0 0 NA 0 0
## 6 6 f 0 NA NA 0 0 0
## 7 7 m 0 0 1 0 0 1
## 8 8 f NA 3 NA 3 3 3
## 9 9 f 0 NA NA 0 0 0
## 10 10 m 1 1 1 NA 1 1
43.6.2 习题2
题目:如果符合某个条件,就让指定的列反号。比如,如果x小于4,x和y两列就反号。
事实上,完成这个任务的方法很多,我们只是演示across()
的某些特征。
d <- tibble( x = 1:4, y = 1:4)
d
## # A tibble: 4 × 2
## x y
## <int> <int>
## 1 1 1
## 2 2 2
## 3 3 3
## 4 4 4
# using data frame returns
d %>%
mutate({
test <- x < 4
x[test] <- -x[test]
y[test] <- -y[test]
data.frame(x = x, y = y)
})
## # A tibble: 4 × 2
## x y
## <int> <int>
## 1 -1 -1
## 2 -2 -2
## 3 -3 -3
## 4 4 4
## # A tibble: 4 × 2
## x y
## <int> <int>
## 1 -1 -1
## 2 -2 -2
## 3 -3 -3
## 4 4 4
# further abstract
negate_if <- function(condition, cols) {
across({{ cols }}, ~ {
.x[condition] <- -.x[condition]
.x
})
}
d %>%
mutate(negate_if(x < 4, c(x, y)))
## # A tibble: 4 × 2
## x y
## <int> <int>
## 1 -1 -1
## 2 -2 -2
## 3 -3 -3
## 4 4 4