2.8 dplyr
编程
本小节概念性东西较多且复杂不易理解,先尝试会使用,概念再慢慢消化理解。虽然复杂,但是比较实用,尤其是当我们需要定义一些通用功能函数时。以下是对原文引用。
两种情况:
- When you have the data-variable in a function argument (i.e. an env-variable that holds a promise2), you need to ** embrace ** the argument by surrounding it in doubled braces, like
filter(df, {{ var }})
.
The following function uses embracing to create a wrapper around summarise()
that computes the minimum and maximum values of a variable, as well as the number of observations that were summarised:
<- function(data, var) {
var_summary %>%
data summarise(n = n(), min = min({{ var }}), max = max({{ var }}))
}%>%
mtcars group_by(cyl) %>%
var_summary(mpg)
- When you have an env-variable that is a character vector, you need to index into the .data pronoun with [[, like summarise(df, mean = mean(.data[[var]])).
The following example uses .data to count the number of unique values in each variable of mtcars:
for (var in names(mtcars)) {
%>% count(.data[[var]]) %>% print()
mtcars }
Note that .data is not a data frame; it’s a special construct, a pronoun, that allows you to access the current variables either directly, with .data$x
or indirectly with .data[[var]]
. Don’t expect other functions to work with it.
2.8.1 案例
当我们不知道接下来会用哪个变量汇总时:
<- function(data, group_var) {
my_summarise %>%
data group_by({{ group_var }}) %>%
summarise(mean = mean(mass))
}
如果在多个位置使用:
<- function(data, expr) {
my_summarise2 %>% summarise(
data mean = mean({{ expr }}),
sum = sum({{ expr }}),
n = n()
) }
当多个表达式时:
<- function(data, mean_var, sd_var) {
my_summarise3 %>%
data summarise(mean = mean({{ mean_var }}), sd = mean({{ sd_var }}))
}
如果要输出变量名时:
<- function(data, expr) {
my_summarise4 %>% summarise(
data "mean_{{expr}}" := mean({{ expr }}),
"sum_{{expr}}" := sum({{ expr }}),
"n_{{expr}}" := n()
)
}<- function(data, mean_var, sd_var) {
my_summarise5 %>%
data summarise(
"mean_{{mean_var}}" := mean({{ mean_var }}),
"sd_{{sd_var}}" := mean({{ sd_var }})
) }
任意个表达式,这种使用场景更多。
<- function(.data, ...) {
my_summarise %>%
.data group_by(...) %>%
summarise(mass = mean(mass, na.rm = TRUE), height = mean(height, na.rm = TRUE))
}%>% my_summarise(homeworld)
starwars #> # A tibble: 49 x 3
#> homeworld mass height
#> <chr> <dbl> <dbl>
#> 1 Alderaan 64 176.
#> 2 Aleen Minor 15 79
#> 3 Bespin 79 175
#> 4 Bestine IV 110 180
#> 5 Cato Neimoidia 90 191
#> 6 Cerea 82 198
#> # ... with 43 more rows
%>% my_summarise(sex, gender)
starwars #> `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
#> # A tibble: 6 x 4
#> # Groups: sex [5]
#> sex gender mass height
#> <chr> <chr> <dbl> <dbl>
#> 1 female feminine 54.7 169.
#> 2 hermaphroditic masculine 1358 175
#> 3 male masculine 81.0 179.
#> 4 none feminine NaN 96
#> 5 none masculine 69.8 140
#> 6 <NA> <NA> 48 181.
本小节做为拓展学习部分,建议完全掌握基础动词的用法后再学习,尤其是在有相关需求的时候再研究效果更好,甚至大部分的商业数据分析师并不需要掌握使用“dplyr 编程”,可以直接跳过不学习本小节。