第 47 章 非标准性评估
Tidy Evaluation (Tidy Eval),不是一个宏包,而是一个非标准评估的框架,也叫延迟评估。主要目的是更方便地与tidyverse里的函数配合使用,事实上,很多时候我们不一定需要用到它。我这里尽可能规避较专业的词汇,用通俗的语言介绍一些简单用法,表述可能不准确。如果想了解背后复杂的机制请阅读advance R。
47.1 编写函数
写代码的过程中,我们会遇到对不同的数据框,执行相同的操作。比如
df1 %>% group_by(x1) %>% summarise(mean = mean(y1))
df2 %>% group_by(x2) %>% summarise(mean = mean(y2))
df3 %>% group_by(x3) %>% summarise(mean = mean(y3))
df4 %>% group_by(x4) %>% summarise(mean = mean(y4))
为了减少代码的重复,我们考虑将共同的部分保留,变化的部分用参数名提取出来
很自然地,我们想到写一个子函数的形式,比如
grouped_mean <- function(data, group_var, summary_var) {
data %>%
group_by(group_var) %>%
summarise(mean = mean(summary_var))
}
当我们试图运行这段代码的时候,却发现报错了
grouped_mean(mtcars, cyl, mpg)
## Error in `group_by()`:
## ! Must group by variables found in `.data`.
## ✖ Column `group_var` is not found.
Hadley Wickham告诉我们,正确的写法应该是,
grouped_mean <- function(data, group_var, summary_var) {
group_var <- enquo(group_var)
summary_var <- enquo(summary_var)
data %>%
group_by(!!group_var) %>%
summarise(mean = mean(!!summary_var))
}
然后再运行
grouped_mean(mtcars, cyl, mpg)
## # A tibble: 3 × 2
## cyl mean
## <dbl> <dbl>
## 1 4 26.7
## 2 6 19.7
## 3 8 15.1
或者更简便的
grouped_mean <- function(data, group_var, summary_var) {
data %>%
group_by({{group_var}}) %>%
summarise(mean = mean({{summary_var}}))
}
grouped_mean(mtcars, cyl, mpg)
## # A tibble: 3 × 2
## cyl mean
## <dbl> <dbl>
## 1 4 26.7
## 2 6 19.7
## 3 8 15.1
dplyr1.0之后,可以这样写
sum_group_vars <- function(df,
group_vars,
sum_vars){
df %>%
group_by(across({{ group_vars }})) %>%
summarise(n = n(),
across({{ sum_vars }},
list(mean = mean, sd = sd))
)
}
sum_group_vars(mpg, c(model, year), c(hwy, cty))
下面我们讲讲为什么要这样写。
47.2 看看发生了什么
弄清楚之前,这里需要明白两个概念:
环境变量(
env-variables
) ,一般你在Rstuido右上角的Environment
中发现它。比如n <- 10
这里的n
数据变量(
data-variables
),一般指数据框的某个变量。比如data <- data.frame(x = 1, n = 2)
中的data$n
那么,对于我们这里编写的函数中
grouped_mean(mtcars, cyl, mpg)
cyl
和mpg
是打算传递的参数,是环境变量,但我们期望他们在函数中当作mtcars中的数据变量,即当做mtcars的一个列的名字来使用, 那么要完成这个角色转换,就需要引用(quote)和解引用(unquote)两个工序:
第一步,用
enquo()
把用户传递过来的参数引用起来(引用可以理解为冷冻起来)第二步,用
!!
解开这个引用(解引用可以理解为解冷),然后使用参数的内容
这个quote-unquote
的过程让环境变量名变成了数据变量,也可以理解为在函数评估过程中,数据变量(data-variable)遮盖了环境变量(env-variable),即数据遮盖(data masking),看到cyl,正常情况下,本来应该是到环境变量里去找这个cyl对应的值,然而,数据遮盖机制,插队了,让代码去数据变量中去找cyl以及对应的值。
我们通过rlang::qq_show()
看看这个quote-unquote
机制是怎么工作的
先看看qq_show()
## height
再看看grouped_mean()
的代码
group_var <- quote(cyl)
summary_var <- quote(mpg)
rlang::qq_show(
data %>%
group_by(!!group_var) %>%
summarise(mean = mean(!!summary_var))
)
## data %>% group_by(cyl) %>% summarise(mean = mean(mpg))
关于数据遮盖更多细节请看Quote and unquote。
47.3 处理多个参数
前面讲了如何传递分组参数和统计参数到子函数。如果传递更多的参数,可以用...
代替group_var
,然后传递到group_by()
,比如
grouped_mean <- function(data, summary_var, ...) {
summary_var <- enquo(summary_var)
group_var <- enquos(...)
data %>%
group_by(!!!group_var) %>%
summarise(mean = mean(!!summary_var))
}
指定统计参数disp
,分组参数(cyl am)
,然后运行代码,
grouped_mean(mtcars, disp, cyl, am)
## # A tibble: 6 × 3
## # Groups: cyl [3]
## cyl am mean
## <dbl> <dbl> <dbl>
## 1 4 0 136.
## 2 4 1 93.6
## 3 6 0 205.
## 4 6 1 155
## 5 8 0 358.
## 6 8 1 326
或者指定统计参数disp
,更多的分组参数(cyl, am, vs)
grouped_mean(mtcars, disp, cyl, am, vs)
## # A tibble: 7 × 4
## # Groups: cyl, am [6]
## cyl am vs mean
## <dbl> <dbl> <dbl> <dbl>
## 1 4 0 1 136.
## 2 4 1 0 120.
## 3 4 1 1 89.8
## 4 6 0 1 205.
## 5 6 1 0 155
## 6 8 0 0 358.
## 7 8 1 0 326
注意到...
代表的是多个参数,因此在引用的时候用的是enquos()
,在解引用的时候
用的是group_by(!!!group_var)
. 事实上, ...
是一个特殊的符号,我们可以省略引用后再解引用的过程,直接传给给group_by()
, 比如
grouped_mean <- function(data, summary_var, ...) {
summary_var <- enquo(summary_var)
data %>%
group_by(...) %>%
summarise(mean = mean(!!summary_var))
}
grouped_mean(mtcars, disp, cyl, am, vs)
## # A tibble: 7 × 4
## # Groups: cyl, am [6]
## cyl am vs mean
## <dbl> <dbl> <dbl> <dbl>
## 1 4 0 1 136.
## 2 4 1 0 120.
## 3 4 1 1 89.8
## 4 6 0 1 205.
## 5 6 1 0 155
## 6 8 0 0 358.
## 7 8 1 0 326
47.4 调整输入的表达式
47.4.1 修改引用参数的默认名
我们希望输出的统计结果中,统计参数名加一个前缀 “avg_”, 可以分三步完成
- 获取引用参数的默认名
- 修改参数的默认名,比如加前缀或者后缀
-
!!
解引用并放在:=
左边
grouped_mean2 <- function(.data, .summary_var, ...) {
summary_var <- enquo(.summary_var)
group_vars <- enquos(...)
# Get and modify the default name
summary_nm <- as_label(summary_var)
summary_nm <- paste0("avg_", summary_nm)
.data %>%
group_by(!!!group_vars) %>%
summarise(!!summary_nm := mean(!!summary_var)) # Unquote the name
}
grouped_mean2(mtcars, disp, cyl, am)
## # A tibble: 6 × 3
## # Groups: cyl [3]
## cyl am avg_disp
## <dbl> <dbl> <dbl>
## 1 4 0 136.
## 2 4 1 93.6
## 3 6 0 205.
## 4 6 1 155
## 5 8 0 358.
## 6 8 1 326
或者更简洁的办法
my_summarise <- function(data, group_var, summarise_var) {
data %>%
group_by(across({{ group_var }})) %>%
summarise(across({{ summarise_var }}, mean, .names = "mean_{col}"))
}
my_summarise(starwars, species, height)
如果想调整多个分组变量的默认名,比如加个前缀”groups_“,方法和上面的步骤类似
- 引用传递过来的参数名,
.enquos(..., .named = TRUE)
, 增加了控制语句.named = TRUE
- 修改在每个参数的默认名,比如加前缀或者后缀
-
!!
解引用并放在:=
左边
grouped_mean3 <- function(.data, .summary_var, ...) {
summary_var <- enquo(.summary_var)
# Quote the dots with default names
group_vars <- enquos(..., .named = TRUE)
summary_nm <- as_label(summary_var)
summary_nm <- paste0("avg_", summary_nm)
# Modify the names of the list of quoted dots
names(group_vars) <- paste0("groups_", names(group_vars))
.data %>%
group_by(!!!group_vars) %>% # Unquote-splice as usual
summarise(!!summary_nm := mean(!!summary_var))
}
grouped_mean3(mtcars, disp, cyl, am)
## # A tibble: 6 × 3
## # Groups: groups_cyl [3]
## groups_cyl groups_am avg_disp
## <dbl> <dbl> <dbl>
## 1 4 0 136.
## 2 4 1 93.6
## 3 6 0 205.
## 4 6 1 155
## 5 8 0 358.
## 6 8 1 326
47.4.2 修改引用的表达式
有时候,我们不想“按多个变量分组,对一个变量统计”。而是“按一个变量分组,对多个变量统计”。这种情况,我们就需要调整引用的表达式
-
.group_var
放分组的变量species
-
...
放需要统计的多个变量height, mass
,期望完成mean(height)
,mean(mass)
- 需要用
purrr:map()
配合调整表达式, 如
vars <- list(quote(mass), quote(height))
purrr::map(vars, function(var) expr(mean(!!var, na.rm = TRUE)))
## [[1]]
## mean(mass, na.rm = TRUE)
##
## [[2]]
## mean(height, na.rm = TRUE)
完整代码可以这样写
grouped_mean4 <- function(.data, .group_var, ...) {
group_var <- enquo(.group_var)
summary_vars <- enquos(..., .named = TRUE)
# Wrap the summary variables with mean()
summary_vars <- purrr::map(summary_vars, function(var) {
expr(mean(!!var, na.rm = TRUE))
})
# Prefix the names with `avg_`
names(summary_vars) <- paste0("avg_", names(summary_vars))
.data %>%
group_by(!!group_var) %>%
summarise(!!!summary_vars)
}
grouped_mean4(starwars, species, height, mass)
## # A tibble: 38 × 3
## species avg_height avg_mass
## <chr> <dbl> <dbl>
## 1 Aleena 79 15
## 2 Besalisk 198 102
## 3 Cerean 198 82
## 4 Chagrian 196 NaN
## 5 Clawdite 168 55
## 6 Droid 131. 69.8
## 7 Dug 112 40
## 8 Ewok 88 20
## 9 Geonosian 183 80
## 10 Gungan 209. 74
## # ℹ 28 more rows
47.5 案例
47.5.1 统计并过滤
## # A tibble: 100 × 1
## index
## <chr>
## 1 c
## 2 a
## 3 c
## 4 d
## 5 c
## 6 a
## 7 a
## 8 c
## 9 d
## 10 b
## # ℹ 90 more rows
filter_which <- function(df, var, val) {
which_var <- enquo(var)
which_val <- as_name(enquo(val))
df %>%
count(!!which_var) %>%
filter(!!which_var == which_val)
}
df %>%
filter_which(index, a)
## # A tibble: 1 × 2
## index n
## <chr> <int>
## 1 a 25
47.5.2 自定义统计输出
my_summarise <- function(data, expr) {
data %>% summarise(
"mean_{{expr}}" := mean({{ expr }}),
"sum_{{expr}}" := sum({{ expr }}),
"n_{{expr}}" := n()
)
}
mtcars %>% my_summarise(mpg)
## mean_mpg sum_mpg n_mpg
## 1 20.09062 642.9 32
47.5.3 形成依次下滑的列
d <- tibble(x = seq_len(10))
jetlag <- function(data, variable, n = 10){
variable <- enquo(variable)
indices <- seq_len(n)
quosures <- purrr::map( indices, ~quo(lag(!!variable, !!.x)) ) %>%
purrr::set_names(nm = purrr::map_chr(indices, ~paste0("lag_", .x)))
dplyr::mutate(data, !!!quosures)
}
d %>% jetlag(x, 3)
## # A tibble: 10 × 4
## x lag_1 lag_2 lag_3
## <int> <int> <int> <int>
## 1 1 NA NA NA
## 2 2 1 NA NA
## 3 3 2 1 NA
## 4 4 3 2 1
## 5 5 4 3 2
## 6 6 5 4 3
## 7 7 6 5 4
## 8 8 7 6 5
## 9 9 8 7 6
## 10 10 9 8 7
47.6 可能会用到的函数
enquo()
vs quo()
vs expr()
vs as_name()
vs as_label()
vs sym()
a <- 1
b <- 1
var <- quote(a + b)
# returns a single quoted expression for the delayed computation
var
## a + b
qq_show(!!var)
## a + b
## mean(a + b, na.rm = TRUE)
## [1] "height"
# also returns a single string but supports any kind of R object as input, including quoted function calls and vectors. Its purpose is to summarise that object into a single label. That label is often suitable as a default name.
as_label(var)
## [1] "height"
# creates a symbol from a string
sym("height")
## height
47.7 Resources
-
tidyeval
book - https://tidyeval.tidyverse.org/ ortidyeval
post - https://rpubs.com/lionel-/tidyeval-introduction -
tidyeval
webinar - https://www.rstudio.com/resources/webinars/tidy-eval/ - “Tidy evaluation in 5 minutes” by Hadley Wickham - https://www.youtube.com/watch?v=nERXS3ssntw
- Metaprogramming chapters in “Advanced R” - https://adv-r.hadley.nz/meta.html
-
tidyeval
cheatsheet - https://www.rstudio.com/resources/cheatsheets/ - https://github.com/tidyverse/dplyr/blob/master/vignettes/programming.Rmd
- https://github.com/romatik/touring_the_tidyverse
- https://tidyeval.tidyverse.org/dplyr.html