7.6 运用

7.6.1 自定义函数计算

1.自定义函数处理列

按照自定义函数计算修改单列或多列

# 测试函数

fun <- function(x){
  x <- x^2+1
}

DT <-  data.table(x=rep(c("b","a","c"),each=3), v=c(1,1,1,2,2,1,1,2,2), y=c(1,3,6), a=1:9, b=9:1)

DT[,.(newcol=fun(y)),by=.(x)]
#>    x newcol
#> 1: b      2
#> 2: b     10
#> 3: b     37
#> 4: a      2
#> 5: a     10
#> 6: a     37
#> 7: c      2
#> 8: c     10
#> 9: c     37

#Not run
#DT[,lapply(.SD,fun),.SDcols=c('y','a'),by=.(x)] #多列参与计算


# 批量修改列
#Not run

# myfun <- function(x){
#   return(x)
# }
# 
# dt <- dt[,colnames(dt):=lapply(.SD[,1:ncol(dt)],myfun)] #很重要的用法

7.6.2 带汇总的聚合运算

按照by的字段级别汇总.

  1. rollup

分组聚合后设置id=TRUE将各个级别的汇总显示清晰,当by字段只有一个是和正常聚合计算没有区别.以下是官方案例.

#Usage
#rollup(x, j, by, .SDcols, id = FALSE, ...)
n = 24L
set.seed(25)
DT <- data.table(
    color = sample(c("green","yellow","red"), n, TRUE),
    year = as.Date(sample(paste0(2011:2015,"-01-01"), n, TRUE)),
    status = as.factor(sample(c("removed","active","inactive","archived"), n, TRUE)),
    amount = sample(1:5, n, TRUE),
    value = sample(c(3, 3.5, 2.5, 2), n, TRUE)
)
rollup(DT, j = sum(value), by = c("color","year","status")) # default id=FALSE
#>      color       year   status   V1
#>  1:    red 2015-01-01   active  3.5
#>  2:  green 2015-01-01 inactive  5.5
#>  3:  green 2014-01-01 archived  3.5
#>  4:  green 2015-01-01 archived  2.0
#>  5: yellow 2014-01-01   active  4.5
#>  6:    red 2013-01-01 inactive  2.0
#>  7:  green 2011-01-01   active  6.0
#>  8:    red 2014-01-01 inactive  2.5
#>  9:  green 2011-01-01 archived  2.5
#> 10: yellow 2015-01-01   active  2.0
#> 11:    red 2012-01-01 archived  2.0
#> 12:    red 2011-01-01  removed  3.5
#> 13:  green 2014-01-01 inactive  8.0
#> 14:  green 2011-01-01  removed  2.0
#> 15: yellow 2012-01-01 archived  2.5
#> 16:    red 2013-01-01  removed  3.5
#> 17:  green 2013-01-01   active  3.0
#> 18:  green 2014-01-01  removed  2.5
#> 19:    red 2011-01-01 archived  3.0
#> 20:    red 2015-01-01     <NA>  3.5
#> 21:  green 2015-01-01     <NA>  7.5
#> 22:  green 2014-01-01     <NA> 14.0
#> 23: yellow 2014-01-01     <NA>  4.5
#> 24:    red 2013-01-01     <NA>  5.5
#> 25:  green 2011-01-01     <NA> 10.5
#> 26:    red 2014-01-01     <NA>  2.5
#> 27: yellow 2015-01-01     <NA>  2.0
#> 28:    red 2012-01-01     <NA>  2.0
#> 29:    red 2011-01-01     <NA>  6.5
#> 30: yellow 2012-01-01     <NA>  2.5
#> 31:  green 2013-01-01     <NA>  3.0
#> 32:    red       <NA>     <NA> 20.0
#> 33:  green       <NA>     <NA> 35.0
#> 34: yellow       <NA>     <NA>  9.0
#> 35:   <NA>       <NA>     <NA> 64.0
#>      color       year   status   V1
#rollup(DT, j = sum(value), by = c("color","year","status"), id=TRUE)

个人运用,实际工作中常常需要汇总项,汇总项在Excel透视表中很简单,在R中我之前是构造重复的数据源聚合汇总出现汇总项,极大浪费内存,运算速度减慢.

  • 新方法 rollup
set.seed(25)
N <- 1000
dt <- data.table(col1=sample(LETTERS[1:5],N,replace = T),col2=sample(letters[1:5],N,replace = T),num=1:N)

rollup(dt,j=c(list(sum(num))),by=c('col1','col2'))
#>     col1 col2     V1
#>  1:    E    a  19926
#>  2:    D    a  20966
#>  3:    A    d  12927
#>  4:    A    b  20862
#>  5:    A    c  15331
#>  6:    B    d  15414
#>  7:    C    e  20794
#>  8:    D    e  16110
#>  9:    C    d  22152
#> 10:    A    a  18378
#> 11:    C    c  19474
#> 12:    E    d  18831
#> 13:    B    b  19941
#> 14:    C    a  19652
#> 15:    E    c  16734
#> 16:    E    e  24137
#> 17:    E    b  21988
#> 18:    D    b  16607
#> 19:    B    c  25720
#> 20:    B    a  22109
#> 21:    A    e  18724
#> 22:    C    b  24323
#> 23:    D    d  20508
#> 24:    D    c  19668
#> 25:    B    e  29224
#> 26:    E <NA> 101616
#> 27:    D <NA>  93859
#> 28:    A <NA>  86222
#> 29:    B <NA> 112408
#> 30:    C <NA> 106395
#> 31: <NA> <NA> 500500
#>     col1 col2     V1
#同上 添加汇总项名称 total
#rollup(dt,j=c(list(total=sum(num))),by=c('col1','col2'))
#添加id=TRUE参数,多出的grouping 列显示聚合级别
#rollup(dt,j=c(list(total=sum(num))),by=c('col1','col2'),id=TRUE)

2.groupingsets

按照指定字段聚合.包作者说相同与SQL中的 GROUPING SETS 操作.详情参照postgresql

res <- groupingsets(DT, j = c(list(count=.N), lapply(.SD, sum)), by = c("color","year","status"),
             sets = list("color", c("year","status"), character()), id=TRUE)
head(res)
#>    grouping  color       year   status count amount value
#> 1:        3    red       <NA>     <NA>     7     19  20.0
#> 2:        3  green       <NA>     <NA>    13     43  35.0
#> 3:        3 yellow       <NA>     <NA>     4     10   9.0
#> 4:        4   <NA> 2015-01-01   active     2      8   5.5
#> 5:        4   <NA> 2015-01-01 inactive     2      5   5.5
#> 6:        4   <NA> 2014-01-01 archived     1      3   3.5

注意groupingsets函数中sets参数,用list()包裹想要聚合的字段组合,最后还有一个character(),加上该部分相当于全部聚合.当by只有一个字段时,相当于汇总.用法类似sql中“().”

上述语句结果等同于下面sql.

select color ,year, status,count(*) count,sum(amount) amount,sum(value) value 
FROM dbo.DT
GROUP BY
GROUPING SETS(
(color),
(year,status),
() ---- 类似 character()
)

最后还有cube()函数,可?cube查看用法

7.6.3 行列转变

  • 一列变多行

用tstrsplit()函数实现

n <- 10
dt <- data.table(name=LETTERS[1:n],char=rep('我-爱-R-语-言'),n)
res <- dt[,.(newcol=tstrsplit(char,'-')),by=.(name)]
head(res)
#>    name newcol
#> 1:    A     我
#> 2:    A     爱
#> 3:    A      R
#> 4:    A     语
#> 5:    A     言
#> 6:    B     我
  • 多行变一列
res[,.(char=paste0(newcol,collapse = '-')),by=.(name)]
#>     name          char
#>  1:    A 我-爱-R-语-言
#>  2:    B 我-爱-R-语-言
#>  3:    C 我-爱-R-语-言
#>  4:    D 我-爱-R-语-言
#>  5:    E 我-爱-R-语-言
#>  6:    F 我-爱-R-语-言
#>  7:    G 我-爱-R-语-言
#>  8:    H 我-爱-R-语-言
#>  9:    I 我-爱-R-语-言
#> 10:    J 我-爱-R-语-言
# 同上
# res[,.(char=stringr::str_c(newcol,collapse = '-')),by=.(name)]