2.8 dplyr编程

Programming with dplyr

本小节概念性东西较多且复杂不易理解,先尝试会使用,概念再慢慢消化理解。虽然复杂,但是比较实用,尤其是当我们需要定义一些通用功能函数时。以下是对原文引用。

两种情况:

  • When you have the data-variable in a function argument (i.e. an env-variable that holds a promise2), you need to ** embrace ** the argument by surrounding it in doubled braces, like filter(df, {{ var }}).

The following function uses embracing to create a wrapper around summarise() that computes the minimum and maximum values of a variable, as well as the number of observations that were summarised:

var_summary <- function(data, var) {
  data %>%
    summarise(n = n(), min = min({{ var }}), max = max({{ var }}))
}
mtcars %>% 
  group_by(cyl) %>% 
  var_summary(mpg)
  • When you have an env-variable that is a character vector, you need to index into the .data pronoun with [[, like summarise(df, mean = mean(.data[[var]])).

The following example uses .data to count the number of unique values in each variable of mtcars:

for (var in names(mtcars)) {
  mtcars %>% count(.data[[var]]) %>% print()
}

Note that .data is not a data frame; it’s a special construct, a pronoun, that allows you to access the current variables either directly, with .data$x or indirectly with .data[[var]]. Don’t expect other functions to work with it.

2.8.1 案例

当我们不知道接下来会用哪个变量汇总时:

my_summarise <- function(data, group_var) {
  data %>%
    group_by({{ group_var }}) %>%
    summarise(mean = mean(mass))
}

如果在多个位置使用:

my_summarise2 <- function(data, expr) {
  data %>% summarise(
    mean = mean({{ expr }}),
    sum = sum({{ expr }}),
    n = n()
  )
}

当多个表达式时:

my_summarise3 <- function(data, mean_var, sd_var) {
  data %>% 
    summarise(mean = mean({{ mean_var }}), sd = mean({{ sd_var }}))
}

如果要输出变量名时:

my_summarise4 <- function(data, expr) {
  data %>% summarise(
    "mean_{{expr}}" := mean({{ expr }}),
    "sum_{{expr}}" := sum({{ expr }}),
    "n_{{expr}}" := n()
  )
}
my_summarise5 <- function(data, mean_var, sd_var) {
  data %>% 
    summarise(
      "mean_{{mean_var}}" := mean({{ mean_var }}), 
      "sd_{{sd_var}}" := mean({{ sd_var }})
    )
}

任意个表达式,这种使用场景更多。

my_summarise <- function(.data, ...) {
  .data %>%
    group_by(...) %>%
    summarise(mass = mean(mass, na.rm = TRUE), height = mean(height, na.rm = TRUE))
}
starwars %>% my_summarise(homeworld)
#> # A tibble: 49 x 3
#>   homeworld       mass height
#>   <chr>          <dbl>  <dbl>
#> 1 Alderaan          64   176.
#> 2 Aleen Minor       15    79 
#> 3 Bespin            79   175 
#> 4 Bestine IV       110   180 
#> 5 Cato Neimoidia    90   191 
#> 6 Cerea             82   198 
#> # ... with 43 more rows
starwars %>% my_summarise(sex, gender)
#> `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
#> # A tibble: 6 x 4
#> # Groups:   sex [5]
#>   sex            gender      mass height
#>   <chr>          <chr>      <dbl>  <dbl>
#> 1 female         feminine    54.7   169.
#> 2 hermaphroditic masculine 1358     175 
#> 3 male           masculine   81.0   179.
#> 4 none           feminine   NaN      96 
#> 5 none           masculine   69.8   140 
#> 6 <NA>           <NA>        48     181.

本小节做为拓展学习部分,建议完全掌握基础动词的用法后再学习,尤其是在有相关需求的时候再研究效果更好,甚至大部分的商业数据分析师并不需要掌握使用“dplyr 编程”,可以直接跳过不学习本小节。