第 39 章 tidyverse中的列方向和行方向

dplyr宏包是数据科学tidyverse集合的核心部件之一,Hadley Wickham大神说将会在5月15日发布dplyr 1.0版本,欢呼。

为迎接新时代的到来,我在线上同大家一起分享dplyr 1.0版本新的特点和功能,看看都为我们带来哪些惊喜?

39.1 体验新版本

New dplyr - 8 things to know:

  1. Built in tidyselect
  2. relocate()
  3. Superpowered summarise()
  4. colwise using across()
  5. cur_data(), cur_group() and cur_column()
  6. new rowwise() grammar
  7. easy modeling inside dataframes
  8. nest_by()
library(dplyr, warn.conflicts = FALSE)
library(tidyr)

39.2 简单回顾

39.3 summarise()更强大了

在dplyr 1.0之前,summarise()会把统计结果整理成一行一列的数据框,现在可以根据函数返回的结果,可以有多种形式:

  • 长度为 1 的向量,比如,min(x), n(), or sum(is.na(y))
  • 长度为 n 的向量,比如,quantile()
  • 数据框
df <- tibble(
  grp = rep(c("a", "b"), each = 5),
  x = c(rnorm(5, -0.25, 1), rnorm(5, 0, 1.5)),
  y = c(rnorm(5, 0.25, 1), rnorm(5, 0, 0.5))
)
df
## # A tibble: 10 × 3
##    grp        x      y
##    <chr>  <dbl>  <dbl>
##  1 a     -0.392  0.198
##  2 a     -0.197 -2.01 
##  3 a      2.04   2.44 
##  4 a     -1.10   0.421
##  5 a      0.693 -0.882
##  6 b      1.76  -0.600
##  7 b     -2.81   0.364
##  8 b     -0.326 -0.189
##  9 b     -0.508  0.964
## 10 b      2.83  -1.07
df %>%
  group_by(grp) %>%
  summarise(rng = mean(x))
## # A tibble: 2 × 2
##   grp     rng
##   <chr> <dbl>
## 1 a     0.209
## 2 b     0.189

当统计函数返回多个值的时候,比如range()返回是最小值和最大值,summarise()很贴心地将结果整理成多行,这样符合tidy的格式。

df %>%
  group_by(grp) %>%
  summarise(rng = range(x))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'grp'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 2
## # Groups:   grp [2]
##   grp     rng
##   <chr> <dbl>
## 1 a     -1.10
## 2 a      2.04
## 3 b     -2.81
## 4 b      2.83

类似的还有quantile()函数,也是返回多个值

df %>%
  group_by(grp) %>%
  summarise(
    rng = quantile(x, probs = c(0.05, 0.5, 0.95))
  )
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'grp'. You can override using the `.groups`
## argument.
## # A tibble: 6 × 2
## # Groups:   grp [2]
##   grp      rng
##   <chr>  <dbl>
## 1 a     -0.957
## 2 a     -0.197
## 3 a      1.77 
## 4 b     -2.35 
## 5 b     -0.326
## 6 b      2.62
df %>%
  group_by(grp) %>%
  summarise(
    x = quantile(x, c(0.25, 0.5, 0.75)),
    q = c(0.25, 0.5, 0.75)
  )
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'grp'. You can override using the `.groups`
## argument.
## # A tibble: 6 × 3
## # Groups:   grp [2]
##   grp        x     q
##   <chr>  <dbl> <dbl>
## 1 a     -0.392  0.25
## 2 a     -0.197  0.5 
## 3 a      0.693  0.75
## 4 b     -0.508  0.25
## 5 b     -0.326  0.5 
## 6 b      1.76   0.75

summarise()可以输出数据框,比如

my_quantile <- function(x, probs) {
  tibble(x = quantile(x, probs), probs = probs)
}
mtcars %>%
  group_by(cyl) %>%
  summarise(my_quantile(disp, c(0.25, 0.75)))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'cyl'. You can override using the `.groups`
## argument.
## # A tibble: 6 × 3
## # Groups:   cyl [3]
##     cyl     x probs
##   <dbl> <dbl> <dbl>
## 1     4  78.8  0.25
## 2     4 121.   0.75
## 3     6 160    0.25
## 4     6 196.   0.75
## 5     8 302.   0.25
## 6     8 390    0.75

再比如:

dplyr 1.0 之前是需要group_modify()来实现数据框进,数据框出

mtcars %>%
  group_by(cyl) %>%
  group_modify(
    ~ broom::tidy(lm(mpg ~ wt, data = .))
  )
## # A tibble: 6 × 6
## # Groups:   cyl [3]
##     cyl term        estimate std.error statistic    p.value
##   <dbl> <chr>          <dbl>     <dbl>     <dbl>      <dbl>
## 1     4 (Intercept)    39.6      4.35       9.10 0.00000777
## 2     4 wt             -5.65     1.85      -3.05 0.0137    
## 3     6 (Intercept)    28.4      4.18       6.79 0.00105   
## 4     6 wt             -2.78     1.33      -2.08 0.0918    
## 5     8 (Intercept)    23.9      3.01       7.94 0.00000405
## 6     8 wt             -2.19     0.739     -2.97 0.0118

dplyr 1.0 之后,有了新的方案

mtcars %>%
  group_by(cyl) %>%
  summarise(
    broom::tidy(lm(mpg ~ wt))
  )
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'cyl'. You can override using the `.groups`
## argument.
## # A tibble: 6 × 6
## # Groups:   cyl [3]
##     cyl term        estimate std.error statistic    p.value
##   <dbl> <chr>          <dbl>     <dbl>     <dbl>      <dbl>
## 1     4 (Intercept)    39.6      4.35       9.10 0.00000777
## 2     4 wt             -5.65     1.85      -3.05 0.0137    
## 3     6 (Intercept)    28.4      4.18       6.79 0.00105   
## 4     6 wt             -2.78     1.33      -2.08 0.0918    
## 5     8 (Intercept)    23.9      3.01       7.94 0.00000405
## 6     8 wt             -2.19     0.739     -2.97 0.0118

39.4 summarise()后的分组信息是去是留?

group_by()summarise()配合使用的时候,summarise()默认会抵消掉最近一次的分组信息,比如下面按照cylvs分组,但summarise()后,就只剩下cyl的分组信息了。

mtcars %>%
  group_by(cyl, vs) %>%
  summarise(cyl_n = n())
## `summarise()` has grouped output by 'cyl'. You can override using the `.groups`
## argument.
## # A tibble: 5 × 3
## # Groups:   cyl [3]
##     cyl    vs cyl_n
##   <dbl> <dbl> <int>
## 1     4     0     1
## 2     4     1    10
## 3     6     0     3
## 4     6     1     4
## 5     8     0    14
mtcars %>%
  group_by(cyl, vs) %>%
  summarise(cyl_n = n()) %>%
  group_vars()
## `summarise()` has grouped output by 'cyl'. You can override using the `.groups`
## argument.
## [1] "cyl"

如果想保留vs的分组信息,就需要设置.groups = keep参数

mtcars %>%
  group_by(cyl, vs) %>%
  summarise(cyl_n = n(), .groups = "keep") %>%
  group_vars()
## [1] "cyl" "vs"

当然summarise()可以控制输出的更多形式

  • 丢弃所有的分组信息
mtcars %>%
  group_by(cyl, vs) %>%
  summarise(cyl_n = n(), .groups = "drop") %>%
  group_vars()
## character(0)
  • 变成行方向分组,即,每行是一个分组
mtcars %>%
  group_by(cyl, vs) %>%
  summarise(cyl_n = n(), .groups = "rowwise") %>%
  group_vars()
## [1] "cyl" "vs"

39.5 选择某列

  • 通过位置索引进行选取
df %>% select(1, 3)
## # A tibble: 10 × 2
##    grp        y
##    <chr>  <dbl>
##  1 a      0.198
##  2 a     -2.01 
##  3 a      2.44 
##  4 a      0.421
##  5 a     -0.882
##  6 b     -0.600
##  7 b      0.364
##  8 b     -0.189
##  9 b      0.964
## 10 b     -1.07
df %>% select(2:3)
## # A tibble: 10 × 2
##         x      y
##     <dbl>  <dbl>
##  1 -0.392  0.198
##  2 -0.197 -2.01 
##  3  2.04   2.44 
##  4 -1.10   0.421
##  5  0.693 -0.882
##  6  1.76  -0.600
##  7 -2.81   0.364
##  8 -0.326 -0.189
##  9 -0.508  0.964
## 10  2.83  -1.07
  • 通过列名
df %>% select(grp, x, y)
## # A tibble: 10 × 3
##    grp        x      y
##    <chr>  <dbl>  <dbl>
##  1 a     -0.392  0.198
##  2 a     -0.197 -2.01 
##  3 a      2.04   2.44 
##  4 a     -1.10   0.421
##  5 a      0.693 -0.882
##  6 b      1.76  -0.600
##  7 b     -2.81   0.364
##  8 b     -0.326 -0.189
##  9 b     -0.508  0.964
## 10 b      2.83  -1.07
df %>% select(x:y)
## # A tibble: 10 × 2
##         x      y
##     <dbl>  <dbl>
##  1 -0.392  0.198
##  2 -0.197 -2.01 
##  3  2.04   2.44 
##  4 -1.10   0.421
##  5  0.693 -0.882
##  6  1.76  -0.600
##  7 -2.81   0.364
##  8 -0.326 -0.189
##  9 -0.508  0.964
## 10  2.83  -1.07
  • 通过函数选取
## # A tibble: 10 × 1
##         x
##     <dbl>
##  1 -0.392
##  2 -0.197
##  3  2.04 
##  4 -1.10 
##  5  0.693
##  6  1.76 
##  7 -2.81 
##  8 -0.326
##  9 -0.508
## 10  2.83
## # A tibble: 10 × 1
##    grp  
##    <chr>
##  1 a    
##  2 a    
##  3 a    
##  4 a    
##  5 a    
##  6 b    
##  7 b    
##  8 b    
##  9 b    
## 10 b
## # A tibble: 10 × 1
##         x
##     <dbl>
##  1 -0.392
##  2 -0.197
##  3  2.04 
##  4 -1.10 
##  5  0.693
##  6  1.76 
##  7 -2.81 
##  8 -0.326
##  9 -0.508
## 10  2.83
df %>% select(matches("x"))
## # A tibble: 10 × 1
##         x
##     <dbl>
##  1 -0.392
##  2 -0.197
##  3  2.04 
##  4 -1.10 
##  5  0.693
##  6  1.76 
##  7 -2.81 
##  8 -0.326
##  9 -0.508
## 10  2.83
  • 通过类型
df %>% select(where(is.character))
## # A tibble: 10 × 1
##    grp  
##    <chr>
##  1 a    
##  2 a    
##  3 a    
##  4 a    
##  5 a    
##  6 b    
##  7 b    
##  8 b    
##  9 b    
## 10 b
df %>% select(where(is.numeric))
## # A tibble: 10 × 2
##         x      y
##     <dbl>  <dbl>
##  1 -0.392  0.198
##  2 -0.197 -2.01 
##  3  2.04   2.44 
##  4 -1.10   0.421
##  5  0.693 -0.882
##  6  1.76  -0.600
##  7 -2.81   0.364
##  8 -0.326 -0.189
##  9 -0.508  0.964
## 10  2.83  -1.07
  • 通过各种组合
df %>% select(!where(is.character))
## # A tibble: 10 × 2
##         x      y
##     <dbl>  <dbl>
##  1 -0.392  0.198
##  2 -0.197 -2.01 
##  3  2.04   2.44 
##  4 -1.10   0.421
##  5  0.693 -0.882
##  6  1.76  -0.600
##  7 -2.81   0.364
##  8 -0.326 -0.189
##  9 -0.508  0.964
## 10  2.83  -1.07
df %>% select(where(is.numeric) & starts_with("x"))
## # A tibble: 10 × 1
##         x
##     <dbl>
##  1 -0.392
##  2 -0.197
##  3  2.04 
##  4 -1.10 
##  5  0.693
##  6  1.76 
##  7 -2.81 
##  8 -0.326
##  9 -0.508
## 10  2.83
## # A tibble: 10 × 2
##    grp        y
##    <chr>  <dbl>
##  1 a      0.198
##  2 a     -2.01 
##  3 a      2.44 
##  4 a      0.421
##  5 a     -0.882
##  6 b     -0.600
##  7 b      0.364
##  8 b     -0.189
##  9 b      0.964
## 10 b     -1.07

注意any_of和all_of的区别

vars <- c("x", "y", "z")
df %>% select(all_of(vars))
df %>% select(any_of(vars))

39.6 重命名某列

df %>% rename(group = grp)
## # A tibble: 10 × 3
##    group      x      y
##    <chr>  <dbl>  <dbl>
##  1 a     -0.392  0.198
##  2 a     -0.197 -2.01 
##  3 a      2.04   2.44 
##  4 a     -1.10   0.421
##  5 a      0.693 -0.882
##  6 b      1.76  -0.600
##  7 b     -2.81   0.364
##  8 b     -0.326 -0.189
##  9 b     -0.508  0.964
## 10 b      2.83  -1.07
df %>% rename_with(toupper)
## # A tibble: 10 × 3
##    GRP        X      Y
##    <chr>  <dbl>  <dbl>
##  1 a     -0.392  0.198
##  2 a     -0.197 -2.01 
##  3 a      2.04   2.44 
##  4 a     -1.10   0.421
##  5 a      0.693 -0.882
##  6 b      1.76  -0.600
##  7 b     -2.81   0.364
##  8 b     -0.326 -0.189
##  9 b     -0.508  0.964
## 10 b      2.83  -1.07
df %>% rename_with(toupper, is.numeric)
## Warning: Use of bare predicate functions was deprecated in tidyselect 1.1.0.
## ℹ Please use wrap predicates in `where()` instead.
##   # Was:
##   data %>% select(is.numeric)
## 
##   # Now:
##   data %>% select(where(is.numeric))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## # A tibble: 10 × 3
##    grp        X      Y
##    <chr>  <dbl>  <dbl>
##  1 a     -0.392  0.198
##  2 a     -0.197 -2.01 
##  3 a      2.04   2.44 
##  4 a     -1.10   0.421
##  5 a      0.693 -0.882
##  6 b      1.76  -0.600
##  7 b     -2.81   0.364
##  8 b     -0.326 -0.189
##  9 b     -0.508  0.964
## 10 b      2.83  -1.07
df %>% rename_with(toupper, starts_with("x"))
## # A tibble: 10 × 3
##    grp        X      y
##    <chr>  <dbl>  <dbl>
##  1 a     -0.392  0.198
##  2 a     -0.197 -2.01 
##  3 a      2.04   2.44 
##  4 a     -1.10   0.421
##  5 a      0.693 -0.882
##  6 b      1.76  -0.600
##  7 b     -2.81   0.364
##  8 b     -0.326 -0.189
##  9 b     -0.508  0.964
## 10 b      2.83  -1.07

39.7 调整列的位置

我们前面一章讲过arrange()排序,这是行方向的排序, 比如按照x变量绝对值的大小从高到低排序。

## # A tibble: 10 × 3
##    grp        x      y
##    <chr>  <dbl>  <dbl>
##  1 b      2.83  -1.07 
##  2 b     -2.81   0.364
##  3 a      2.04   2.44 
##  4 b      1.76  -0.600
##  5 a     -1.10   0.421
##  6 a      0.693 -0.882
##  7 b     -0.508  0.964
##  8 a     -0.392  0.198
##  9 b     -0.326 -0.189
## 10 a     -0.197 -2.01

我们现在想调整列的位置,比如,这里调整数据框三列的位置,让grp列放在x列的后面

df %>% select(x, grp, y)
## # A tibble: 10 × 3
##         x grp        y
##     <dbl> <chr>  <dbl>
##  1 -0.392 a      0.198
##  2 -0.197 a     -2.01 
##  3  2.04  a      2.44 
##  4 -1.10  a      0.421
##  5  0.693 a     -0.882
##  6  1.76  b     -0.600
##  7 -2.81  b      0.364
##  8 -0.326 b     -0.189
##  9 -0.508 b      0.964
## 10  2.83  b     -1.07

如果列变量很多的时候,上面的方法就不太好用,因此推荐大家使用relocate()

df %>% relocate(grp, .after = y)
## # A tibble: 10 × 3
##         x      y grp  
##     <dbl>  <dbl> <chr>
##  1 -0.392  0.198 a    
##  2 -0.197 -2.01  a    
##  3  2.04   2.44  a    
##  4 -1.10   0.421 a    
##  5  0.693 -0.882 a    
##  6  1.76  -0.600 b    
##  7 -2.81   0.364 b    
##  8 -0.326 -0.189 b    
##  9 -0.508  0.964 b    
## 10  2.83  -1.07  b
df %>% relocate(x, .before = grp)
## # A tibble: 10 × 3
##         x grp        y
##     <dbl> <chr>  <dbl>
##  1 -0.392 a      0.198
##  2 -0.197 a     -2.01 
##  3  2.04  a      2.44 
##  4 -1.10  a      0.421
##  5  0.693 a     -0.882
##  6  1.76  b     -0.600
##  7 -2.81  b      0.364
##  8 -0.326 b     -0.189
##  9 -0.508 b      0.964
## 10  2.83  b     -1.07

还有

df %>% relocate(grp, .after = last_col())
## # A tibble: 10 × 3
##         x      y grp  
##     <dbl>  <dbl> <chr>
##  1 -0.392  0.198 a    
##  2 -0.197 -2.01  a    
##  3  2.04   2.44  a    
##  4 -1.10   0.421 a    
##  5  0.693 -0.882 a    
##  6  1.76  -0.600 b    
##  7 -2.81   0.364 b    
##  8 -0.326 -0.189 b    
##  9 -0.508  0.964 b    
## 10  2.83  -1.07  b

39.8 强大的across函数

我们必须为这个函数点赞。大爱Hadley Wickham !!!

我们经常需要对数据框的多列执行相同的操作。比如

iris <- iris %>% as_tibble()
iris
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ℹ 140 more rows
iris %>%
  group_by(Species) %>%
  summarise(
    mean_Sepal_Length = mean(Sepal.Length),
    mean_Sepal_Width = mean(Sepal.Width),
    mean_Petal_Length = mean(Petal.Length),
    mean_Petal_Width = mean(Petal.Width)
  )
## # A tibble: 3 × 5
##   Species  mean_Sepal_Length mean_Sepal_Width mean_Petal_Length mean_Petal_Width
##   <fct>                <dbl>            <dbl>             <dbl>            <dbl>
## 1 setosa                5.01             3.43              1.46            0.246
## 2 versico…              5.94             2.77              4.26            1.33 
## 3 virgini…              6.59             2.97              5.55            2.03

dplyr 1.0之后,使用across()函数异常简练

iris %>%
  group_by(Species) %>%
  summarise(
    across(everything(), mean)
  )
## # A tibble: 3 × 5
##   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
##   <fct>             <dbl>       <dbl>        <dbl>       <dbl>
## 1 setosa             5.01        3.43         1.46       0.246
## 2 versicolor         5.94        2.77         4.26       1.33 
## 3 virginica          6.59        2.97         5.55       2.03

或者更科学的

iris %>%
  group_by(Species) %>%
  summarise(
    across(is.numeric, mean)
  )
## # A tibble: 3 × 5
##   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
##   <fct>             <dbl>       <dbl>        <dbl>       <dbl>
## 1 setosa             5.01        3.43         1.46       0.246
## 2 versicolor         5.94        2.77         4.26       1.33 
## 3 virginica          6.59        2.97         5.55       2.03

可以看到,以往是一列一列的处理,现在对多列同时操作,这主要得益于across()函数,它有两个主要的参数:

across(.cols = , .fns = )
  • 第一个参数.cols,选取我们要需要的若干列,选取多列的语法与select()的语法一致
  • 第二个参数.fns,我们要执行的函数(或者多个函数),函数的语法有三种形式可选:
    • A function, e.g. mean.
    • A purrr-style lambda, e.g. ~ mean(.x, na.rm = TRUE)
    • A list of functions/lambdas, e.g. list(mean = mean, n_miss = ~ sum(is.na(.x))

再看看这个案例

std <- function(x) {
  (x - mean(x)) / sd(x)
}

iris %>%
  group_by(Species) %>%
  summarise(
    across(starts_with("Sepal"), std)
  )
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'Species'. You can override using the
## `.groups` argument.
## # A tibble: 150 × 3
## # Groups:   Species [3]
##    Species Sepal.Length Sepal.Width
##    <fct>          <dbl>       <dbl>
##  1 setosa        0.267       0.190 
##  2 setosa       -0.301      -1.13  
##  3 setosa       -0.868      -0.601 
##  4 setosa       -1.15       -0.865 
##  5 setosa       -0.0170      0.454 
##  6 setosa        1.12        1.25  
##  7 setosa       -1.15       -0.0739
##  8 setosa       -0.0170     -0.0739
##  9 setosa       -1.72       -1.39  
## 10 setosa       -0.301      -0.865 
## # ℹ 140 more rows
# purrr style
iris %>%
  group_by(Species) %>%
  summarise(
    across(starts_with("Sepal"), ~ (.x - mean(.x)) / sd(.x))
  )
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'Species'. You can override using the
## `.groups` argument.
## # A tibble: 150 × 3
## # Groups:   Species [3]
##    Species Sepal.Length Sepal.Width
##    <fct>          <dbl>       <dbl>
##  1 setosa        0.267       0.190 
##  2 setosa       -0.301      -1.13  
##  3 setosa       -0.868      -0.601 
##  4 setosa       -1.15       -0.865 
##  5 setosa       -0.0170      0.454 
##  6 setosa        1.12        1.25  
##  7 setosa       -1.15       -0.0739
##  8 setosa       -0.0170     -0.0739
##  9 setosa       -1.72       -1.39  
## 10 setosa       -0.301      -0.865 
## # ℹ 140 more rows
iris %>%
  group_by(Species) %>%
  summarise(
    across(starts_with("Petal"), list(min = min, max = max))
    # across(starts_with("Petal"), list(min = min, max = max), .names = "{fn}_{col}")
  )
## # A tibble: 3 × 5
##   Species    Petal.Length_min Petal.Length_max Petal.Width_min Petal.Width_max
##   <fct>                 <dbl>            <dbl>           <dbl>           <dbl>
## 1 setosa                  1                1.9             0.1             0.6
## 2 versicolor              3                5.1             1               1.8
## 3 virginica               4.5              6.9             1.4             2.5
iris %>%
  group_by(Species) %>%
  summarise(
    across(starts_with("Sepal"), mean),
    Area = mean(Petal.Length * Petal.Width),
    across(c(Petal.Width), min),
    n = n()
  )
## # A tibble: 3 × 6
##   Species    Sepal.Length Sepal.Width   Area Petal.Width     n
##   <fct>             <dbl>       <dbl>  <dbl>       <dbl> <int>
## 1 setosa             5.01        3.43  0.366         0.1    50
## 2 versicolor         5.94        2.77  5.72          1      50
## 3 virginica          6.59        2.97 11.3           1.4    50

除了在summarise()里可以使用外,在其它函数也是可以使用的

iris %>% mutate(across(is.numeric, mean))
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1         5.84        3.06         3.76        1.20 setosa 
##  2         5.84        3.06         3.76        1.20 setosa 
##  3         5.84        3.06         3.76        1.20 setosa 
##  4         5.84        3.06         3.76        1.20 setosa 
##  5         5.84        3.06         3.76        1.20 setosa 
##  6         5.84        3.06         3.76        1.20 setosa 
##  7         5.84        3.06         3.76        1.20 setosa 
##  8         5.84        3.06         3.76        1.20 setosa 
##  9         5.84        3.06         3.76        1.20 setosa 
## 10         5.84        3.06         3.76        1.20 setosa 
## # ℹ 140 more rows
iris %>% mutate(across(starts_with("Sepal"), mean))
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1         5.84        3.06          1.4         0.2 setosa 
##  2         5.84        3.06          1.4         0.2 setosa 
##  3         5.84        3.06          1.3         0.2 setosa 
##  4         5.84        3.06          1.5         0.2 setosa 
##  5         5.84        3.06          1.4         0.2 setosa 
##  6         5.84        3.06          1.7         0.4 setosa 
##  7         5.84        3.06          1.4         0.3 setosa 
##  8         5.84        3.06          1.5         0.2 setosa 
##  9         5.84        3.06          1.4         0.2 setosa 
## 10         5.84        3.06          1.5         0.1 setosa 
## # ℹ 140 more rows
iris %>% mutate(across(is.numeric, std)) # std function has defined before
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1       -0.898      1.02          -1.34       -1.31 setosa 
##  2       -1.14      -0.132         -1.34       -1.31 setosa 
##  3       -1.38       0.327         -1.39       -1.31 setosa 
##  4       -1.50       0.0979        -1.28       -1.31 setosa 
##  5       -1.02       1.25          -1.34       -1.31 setosa 
##  6       -0.535      1.93          -1.17       -1.05 setosa 
##  7       -1.50       0.786         -1.34       -1.18 setosa 
##  8       -1.02       0.786         -1.28       -1.31 setosa 
##  9       -1.74      -0.361         -1.34       -1.31 setosa 
## 10       -1.14       0.0979        -1.28       -1.44 setosa 
## # ℹ 140 more rows
iris %>% mutate(
  across(is.numeric, ~ .x / 2),
  across(is.factor, stringr::str_to_upper)
)
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across(is.factor, stringr::str_to_upper)`.
## Caused by warning:
## ! Use of bare predicate functions was deprecated in tidyselect 1.1.0.
## ℹ Please use wrap predicates in `where()` instead.
##   # Was:
##   data %>% select(is.factor)
## 
##   # Now:
##   data %>% select(where(is.factor))
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <chr>  
##  1         2.55        1.75         0.7         0.1  SETOSA 
##  2         2.45        1.5          0.7         0.1  SETOSA 
##  3         2.35        1.6          0.65        0.1  SETOSA 
##  4         2.3         1.55         0.75        0.1  SETOSA 
##  5         2.5         1.8          0.7         0.1  SETOSA 
##  6         2.7         1.95         0.85        0.2  SETOSA 
##  7         2.3         1.7          0.7         0.15 SETOSA 
##  8         2.5         1.7          0.75        0.1  SETOSA 
##  9         2.2         1.45         0.7         0.1  SETOSA 
## 10         2.45        1.55         0.75        0.05 SETOSA 
## # ℹ 140 more rows

39.9 “current” group or “current” variable

  • n(), 返回当前分组的多少行
  • cur_data(), 返回当前分组的数据内容(不包含分组变量)
  • cur_group(), 返回当前分组的分组变量(一行一列的数据框)
  • across(cur_column()), 返回当前列的列名

这些函数返回当前分组的信息,因此只能在特定函数内部使用,比如summarise() and mutate()

df <- tibble(
  g = sample(rep(letters[1:3], 1:3)),
  x = runif(6),
  y = runif(6)
)
df
## # A tibble: 6 × 3
##   g         x      y
##   <chr> <dbl>  <dbl>
## 1 c     0.314 0.0705
## 2 b     0.521 0.246 
## 3 c     0.171 0.939 
## 4 c     0.593 0.0751
## 5 a     0.746 0.890 
## 6 b     0.995 0.324
df %>%
  group_by(g) %>%
  summarise(
    n = n()
  )
## # A tibble: 3 × 2
##   g         n
##   <chr> <int>
## 1 a         1
## 2 b         2
## 3 c         3
df %>%
  group_by(g) %>%
  summarise(
    data = list(cur_group())
  )
## # A tibble: 3 × 2
##   g     data            
##   <chr> <list>          
## 1 a     <tibble [1 × 1]>
## 2 b     <tibble [1 × 1]>
## 3 c     <tibble [1 × 1]>
df %>%
  group_by(g) %>%
  summarise(
    data = list(cur_data())
  )
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `data = list(cur_data())`.
## ℹ In group 1: `g = "a"`.
## Caused by warning:
## ! `cur_data()` was deprecated in dplyr 1.1.0.
## ℹ Please use `pick()` instead.
## # A tibble: 3 × 2
##   g     data            
##   <chr> <list>          
## 1 a     <tibble [1 × 2]>
## 2 b     <tibble [2 × 2]>
## 3 c     <tibble [3 × 2]>
mtcars %>%
  group_by(cyl) %>%
  summarise(
    broom::tidy(lm(mpg ~ wt, data = cur_data()))
  )
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'cyl'. You can override using the `.groups`
## argument.
## # A tibble: 6 × 6
## # Groups:   cyl [3]
##     cyl term        estimate std.error statistic    p.value
##   <dbl> <chr>          <dbl>     <dbl>     <dbl>      <dbl>
## 1     4 (Intercept)    39.6      4.35       9.10 0.00000777
## 2     4 wt             -5.65     1.85      -3.05 0.0137    
## 3     6 (Intercept)    28.4      4.18       6.79 0.00105   
## 4     6 wt             -2.78     1.33      -2.08 0.0918    
## 5     8 (Intercept)    23.9      3.01       7.94 0.00000405
## 6     8 wt             -2.19     0.739     -2.97 0.0118
## # A tibble: 6 × 3
## # Groups:   g [3]
##   g     x      y     
##   <chr> <chr>  <chr> 
## 1 c     x 0.31 y 0.07
## 2 b     x 0.52 y 0.25
## 3 c     x 0.17 y 0.94
## 4 c     x 0.59 y 0.08
## 5 a     x 0.75 y 0.89
## 6 b     x 0.99 y 0.32
wt <- c(x = 0.2, y = 0.8)

df %>%
  mutate(
    across(c(x, y), ~ .x * wt[cur_column()])
  )
## # A tibble: 6 × 3
##   g          x      y
##   <chr>  <dbl>  <dbl>
## 1 c     0.0628 0.0564
## 2 b     0.104  0.197 
## 3 c     0.0342 0.751 
## 4 c     0.119  0.0601
## 5 a     0.149  0.712 
## 6 b     0.199  0.259

39.10 行方向操作

数据框中向量de方向,事实上可以看做有两个方向,横着看是row-vector,竖着看是col-vector。

tidyverse遵循的tidy原则,一列表示一个变量,一行表示一次观察。 这种数据的存储格式,对ggplot2很方便,但对行方向的操作或者运算不友好。比如

39.10.1 行方向上的统计

df <- tibble(id = letters[1:6], w = 10:15, x = 20:25, y = 30:35, z = 40:45)
df
## # A tibble: 6 × 5
##   id        w     x     y     z
##   <chr> <int> <int> <int> <int>
## 1 a        10    20    30    40
## 2 b        11    21    31    41
## 3 c        12    22    32    42
## 4 d        13    23    33    43
## 5 e        14    24    34    44
## 6 f        15    25    35    45

计算每行的均值,

df %>% mutate(avg = mean(c(w, x, y, z)))
## # A tibble: 6 × 6
##   id        w     x     y     z   avg
##   <chr> <int> <int> <int> <int> <dbl>
## 1 a        10    20    30    40  27.5
## 2 b        11    21    31    41  27.5
## 3 c        12    22    32    42  27.5
## 4 d        13    23    33    43  27.5
## 5 e        14    24    34    44  27.5
## 6 f        15    25    35    45  27.5

好像不对?为什么呢?

  • 按照tidy的方法
df %>%
  pivot_longer(
    cols = -id,
    names_to = "variable",
    values_to = "value"
  ) %>%
  group_by(id) %>%
  summarize(
    r_mean = mean(value)
  )
## # A tibble: 6 × 2
##   id    r_mean
##   <chr>  <dbl>
## 1 a         25
## 2 b         26
## 3 c         27
## 4 d         28
## 5 e         29
## 6 f         30

如果保留原始数据,就还需要再left_join()一次,虽然思路清晰,但还是挺周转的。

  • 按照Jenny Bryan的方案,使用purrr宏包的pmap_dbl函数
## Warning: package 'purrr' was built under R version 4.2.2
df %>%
  mutate(r_mean = pmap_dbl(select_if(., is.numeric), lift_vd(mean)))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `r_mean = pmap_dbl(select_if(., is.numeric), lift_vd(mean))`.
## Caused by warning:
## ! `lift_vd()` was deprecated in purrr 1.0.0.
## # A tibble: 6 × 6
##   id        w     x     y     z r_mean
##   <chr> <int> <int> <int> <int>  <dbl>
## 1 a        10    20    30    40     25
## 2 b        11    21    31    41     26
## 3 c        12    22    32    42     27
## 4 d        13    23    33    43     28
## 5 e        14    24    34    44     29
## 6 f        15    25    35    45     30

但需要学习新的语法,代价也很高。

df %>%
  rowwise() %>%
  mutate(avg = mean(c(w, x, y, z)))
## # A tibble: 6 × 6
## # Rowwise: 
##   id        w     x     y     z   avg
##   <chr> <int> <int> <int> <int> <dbl>
## 1 a        10    20    30    40    25
## 2 b        11    21    31    41    26
## 3 c        12    22    32    42    27
## 4 d        13    23    33    43    28
## 5 e        14    24    34    44    29
## 6 f        15    25    35    45    30

变量名要是很多的话,又变了体力活了,怎么才能变的轻巧一点呢?

  • rowwise() + c_across(),现在dplyr 1.0终于给出了一个很好的解决方案
df %>%
  rowwise() %>%
  mutate(
    avg = mean(c_across(w:z))
  )
## # A tibble: 6 × 6
## # Rowwise: 
##   id        w     x     y     z   avg
##   <chr> <int> <int> <int> <int> <dbl>
## 1 a        10    20    30    40    25
## 2 b        11    21    31    41    26
## 3 c        12    22    32    42    27
## 4 d        13    23    33    43    28
## 5 e        14    24    34    44    29
## 6 f        15    25    35    45    30

这个很好的解决方案中,rowwise()工作原理类似与group_by(),是按每一行进行分组,然后按行(行方向)统计

df %>%
  rowwise(id) %>%
  mutate(total = mean(c_across(w:z)))
## # A tibble: 6 × 6
## # Rowwise:  id
##   id        w     x     y     z total
##   <chr> <int> <int> <int> <int> <dbl>
## 1 a        10    20    30    40    25
## 2 b        11    21    31    41    26
## 3 c        12    22    32    42    27
## 4 d        13    23    33    43    28
## 5 e        14    24    34    44    29
## 6 f        15    25    35    45    30
df %>%
  rowwise(id) %>%
  mutate(mean = mean(c_across(is.numeric)))
## # A tibble: 6 × 6
## # Rowwise:  id
##   id        w     x     y     z  mean
##   <chr> <int> <int> <int> <int> <dbl>
## 1 a        10    20    30    40    25
## 2 b        11    21    31    41    26
## 3 c        12    22    32    42    27
## 4 d        13    23    33    43    28
## 5 e        14    24    34    44    29
## 6 f        15    25    35    45    30
df %>%
  rowwise(id) %>%
  summarise(
    m = mean(c_across(is.numeric))
  )
## `summarise()` has grouped output by 'id'. You can override using the `.groups`
## argument.
## # A tibble: 6 × 2
## # Groups:   id [6]
##   id        m
##   <chr> <dbl>
## 1 a        25
## 2 b        26
## 3 c        27
## 4 d        28
## 5 e        29
## 6 f        30

因此,我们可以总结成下面这张图

39.10.2 行方向处理与列表列是天然一对

rowwise()不仅仅用于计算行方向均值这样的简单统计,而是当处理列表列时,方才显示出rowwise()purrr::map一样的强大。那么,什么是列表列列表列指的是数据框的一列是一个列表, 比如

tb <- tibble(
  x = list(1, 2:3, 4:6)
)

如果想显示列表中每个元素的长度,用purrr包,可以这样写

tb %>% mutate(l = purrr::map_int(x, length))
## # A tibble: 3 × 2
##   x             l
##   <list>    <int>
## 1 <dbl [1]>     1
## 2 <int [2]>     2
## 3 <int [3]>     3

如果从行方向的角度理解,其实很简练

tb %>%
  rowwise() %>%
  mutate(l = length(x))
## # A tibble: 3 × 2
## # Rowwise: 
##   x             l
##   <list>    <int>
## 1 <dbl [1]>     1
## 2 <int [2]>     2
## 3 <int [3]>     3

39.10.3 行方向上的建模

mtcars <- mtcars %>% as_tibble()
mtcars
## # A tibble: 32 × 11
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # ℹ 22 more rows

以cyl分组,计算每组中mpg ~ wt的线性模型的系数.

mtcars %>%
  group_by(cyl) %>%
  nest()
## # A tibble: 3 × 2
## # Groups:   cyl [3]
##     cyl data              
##   <dbl> <list>            
## 1     6 <tibble [7 × 10]> 
## 2     4 <tibble [11 × 10]>
## 3     8 <tibble [14 × 10]>

39.10.3.1 列方向的做法

分组建模后,形成列表列,此时列表中的每个元素对应一个模型,我们需要依次提取每次模型的系数,列方向的做法是,借用purrr::map完成列表中每个模型的迭代,

mtcars %>%
  group_by(cyl) %>%
  nest() %>%
  mutate(model = purrr::map(data, ~ lm(mpg ~ wt, data = .))) %>%
  mutate(result = purrr::map(model, ~ broom::tidy(.))) %>%
  unnest(result)
## # A tibble: 6 × 8
## # Groups:   cyl [3]
##     cyl data               model  term      estimate std.error statistic p.value
##   <dbl> <list>             <list> <chr>        <dbl>     <dbl>     <dbl>   <dbl>
## 1     6 <tibble [7 × 10]>  <lm>   (Interce…    28.4      4.18       6.79 1.05e-3
## 2     6 <tibble [7 × 10]>  <lm>   wt           -2.78     1.33      -2.08 9.18e-2
## 3     4 <tibble [11 × 10]> <lm>   (Interce…    39.6      4.35       9.10 7.77e-6
## 4     4 <tibble [11 × 10]> <lm>   wt           -5.65     1.85      -3.05 1.37e-2
## 5     8 <tibble [14 × 10]> <lm>   (Interce…    23.9      3.01       7.94 4.05e-6
## 6     8 <tibble [14 × 10]> <lm>   wt           -2.19     0.739     -2.97 1.18e-2

purrr::map实现列表元素一个一个的依次迭代,从数据框的角度来看(数据框是列表的一种特殊形式),因此实质上就是一行一行的处理。所以,尽管purrr很强大,但需要一定学习成本,从解决问题的路径上也比较周折。

39.10.3.2 行方向的做法

事实上,分组建模后,形成列表列,这种存储格式,天然地符合行处理的范式,因此一开始就使用行方向分组(这里nest_by() 类似于 group_by()

mtcars %>%
  nest_by(cyl) %>%
  mutate(model = list(lm(mpg ~ wt, data = data))) %>%
  summarise(broom::tidy(model))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'cyl'. You can override using the `.groups`
## argument.
## # A tibble: 6 × 6
## # Groups:   cyl [3]
##     cyl term        estimate std.error statistic    p.value
##   <dbl> <chr>          <dbl>     <dbl>     <dbl>      <dbl>
## 1     4 (Intercept)    39.6      4.35       9.10 0.00000777
## 2     4 wt             -5.65     1.85      -3.05 0.0137    
## 3     6 (Intercept)    28.4      4.18       6.79 0.00105   
## 4     6 wt             -2.78     1.33      -2.08 0.0918    
## 5     8 (Intercept)    23.9      3.01       7.94 0.00000405
## 6     8 wt             -2.19     0.739     -2.97 0.0118
# or
mtcars %>%
  nest_by(cyl) %>%
  summarise(
    broom::tidy(lm(mpg ~ wt, data = data))
  )
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'cyl'. You can override using the `.groups`
## argument.
## # A tibble: 6 × 6
## # Groups:   cyl [3]
##     cyl term        estimate std.error statistic    p.value
##   <dbl> <chr>          <dbl>     <dbl>     <dbl>      <dbl>
## 1     4 (Intercept)    39.6      4.35       9.10 0.00000777
## 2     4 wt             -5.65     1.85      -3.05 0.0137    
## 3     6 (Intercept)    28.4      4.18       6.79 0.00105   
## 4     6 wt             -2.78     1.33      -2.08 0.0918    
## 5     8 (Intercept)    23.9      3.01       7.94 0.00000405
## 6     8 wt             -2.19     0.739     -2.97 0.0118

至此,tidyverse框架下,实现分组统计中的数据框进,数据框输出, 现在有四种方法了

mtcars %>%
  group_nest(cyl) %>%
  mutate(model = purrr::map(data, ~ lm(mpg ~ wt, data = .))) %>%
  mutate(result = purrr::map(model, ~ broom::tidy(.))) %>%
  tidyr::unnest(result)


mtcars %>%
  group_by(cyl) %>%
  group_modify(
    ~ broom::tidy(lm(mpg ~ wt, data = .))
  )


mtcars %>%
  nest_by(cyl) %>%
  summarise(
    broom::tidy(lm(mpg ~ wt, data = data))
  )


mtcars %>%
  group_by(cyl) %>%
  summarise(
    broom::tidy(lm(mpg ~ wt, data = cur_data()))
  )

# or
mtcars %>%
  group_by(cyl) %>%
  summarise(broom::tidy(lm(mpg ~ wt)))