7.6 运用
7.6.1 自定义函数计算
1.自定义函数处理列
按照自定义函数计算修改单列或多列
# 测试函数
<- function(x){
fun <- x^2+1
x
}
<- data.table(x=rep(c("b","a","c"),each=3), v=c(1,1,1,2,2,1,1,2,2), y=c(1,3,6), a=1:9, b=9:1)
DT
newcol=fun(y)),by=.(x)]
DT[,.(#> x newcol
#> 1: b 2
#> 2: b 10
#> 3: b 37
#> 4: a 2
#> 5: a 10
#> 6: a 37
#> 7: c 2
#> 8: c 10
#> 9: c 37
#Not run
#DT[,lapply(.SD,fun),.SDcols=c('y','a'),by=.(x)] #多列参与计算
# 批量修改列
#Not run
# myfun <- function(x){
# return(x)
# }
#
# dt <- dt[,colnames(dt):=lapply(.SD[,1:ncol(dt)],myfun)] #很重要的用法
7.6.2 带汇总的聚合运算
按照by的字段级别汇总.
- rollup
分组聚合后设置id=TRUE将各个级别的汇总显示清晰,当by字段只有一个是和正常聚合计算没有区别.以下是官方案例.
#Usage
#rollup(x, j, by, .SDcols, id = FALSE, ...)
= 24L
n set.seed(25)
<- data.table(
DT color = sample(c("green","yellow","red"), n, TRUE),
year = as.Date(sample(paste0(2011:2015,"-01-01"), n, TRUE)),
status = as.factor(sample(c("removed","active","inactive","archived"), n, TRUE)),
amount = sample(1:5, n, TRUE),
value = sample(c(3, 3.5, 2.5, 2), n, TRUE)
)rollup(DT, j = sum(value), by = c("color","year","status")) # default id=FALSE
#> color year status V1
#> 1: red 2015-01-01 active 3.5
#> 2: green 2015-01-01 inactive 5.5
#> 3: green 2014-01-01 archived 3.5
#> 4: green 2015-01-01 archived 2.0
#> 5: yellow 2014-01-01 active 4.5
#> 6: red 2013-01-01 inactive 2.0
#> 7: green 2011-01-01 active 6.0
#> 8: red 2014-01-01 inactive 2.5
#> 9: green 2011-01-01 archived 2.5
#> 10: yellow 2015-01-01 active 2.0
#> 11: red 2012-01-01 archived 2.0
#> 12: red 2011-01-01 removed 3.5
#> 13: green 2014-01-01 inactive 8.0
#> 14: green 2011-01-01 removed 2.0
#> 15: yellow 2012-01-01 archived 2.5
#> 16: red 2013-01-01 removed 3.5
#> 17: green 2013-01-01 active 3.0
#> 18: green 2014-01-01 removed 2.5
#> 19: red 2011-01-01 archived 3.0
#> 20: red 2015-01-01 <NA> 3.5
#> 21: green 2015-01-01 <NA> 7.5
#> 22: green 2014-01-01 <NA> 14.0
#> 23: yellow 2014-01-01 <NA> 4.5
#> 24: red 2013-01-01 <NA> 5.5
#> 25: green 2011-01-01 <NA> 10.5
#> 26: red 2014-01-01 <NA> 2.5
#> 27: yellow 2015-01-01 <NA> 2.0
#> 28: red 2012-01-01 <NA> 2.0
#> 29: red 2011-01-01 <NA> 6.5
#> 30: yellow 2012-01-01 <NA> 2.5
#> 31: green 2013-01-01 <NA> 3.0
#> 32: red <NA> <NA> 20.0
#> 33: green <NA> <NA> 35.0
#> 34: yellow <NA> <NA> 9.0
#> 35: <NA> <NA> <NA> 64.0
#> color year status V1
#rollup(DT, j = sum(value), by = c("color","year","status"), id=TRUE)
个人运用,实际工作中常常需要汇总项,汇总项在Excel透视表中很简单,在R中我之前是构造重复的数据源聚合汇总出现汇总项,极大浪费内存,运算速度减慢.
- 新方法 rollup
set.seed(25)
<- 1000
N <- data.table(col1=sample(LETTERS[1:5],N,replace = T),col2=sample(letters[1:5],N,replace = T),num=1:N)
dt
rollup(dt,j=c(list(sum(num))),by=c('col1','col2'))
#> col1 col2 V1
#> 1: E a 19926
#> 2: D a 20966
#> 3: A d 12927
#> 4: A b 20862
#> 5: A c 15331
#> 6: B d 15414
#> 7: C e 20794
#> 8: D e 16110
#> 9: C d 22152
#> 10: A a 18378
#> 11: C c 19474
#> 12: E d 18831
#> 13: B b 19941
#> 14: C a 19652
#> 15: E c 16734
#> 16: E e 24137
#> 17: E b 21988
#> 18: D b 16607
#> 19: B c 25720
#> 20: B a 22109
#> 21: A e 18724
#> 22: C b 24323
#> 23: D d 20508
#> 24: D c 19668
#> 25: B e 29224
#> 26: E <NA> 101616
#> 27: D <NA> 93859
#> 28: A <NA> 86222
#> 29: B <NA> 112408
#> 30: C <NA> 106395
#> 31: <NA> <NA> 500500
#> col1 col2 V1
#同上 添加汇总项名称 total
#rollup(dt,j=c(list(total=sum(num))),by=c('col1','col2'))
#添加id=TRUE参数,多出的grouping 列显示聚合级别
#rollup(dt,j=c(list(total=sum(num))),by=c('col1','col2'),id=TRUE)
2.groupingsets
按照指定字段聚合.包作者说相同与SQL中的 GROUPING SETS 操作.详情参照postgresql
<- groupingsets(DT, j = c(list(count=.N), lapply(.SD, sum)), by = c("color","year","status"),
res sets = list("color", c("year","status"), character()), id=TRUE)
head(res)
#> grouping color year status count amount value
#> 1: 3 red <NA> <NA> 7 19 20.0
#> 2: 3 green <NA> <NA> 13 43 35.0
#> 3: 3 yellow <NA> <NA> 4 10 9.0
#> 4: 4 <NA> 2015-01-01 active 2 8 5.5
#> 5: 4 <NA> 2015-01-01 inactive 2 5 5.5
#> 6: 4 <NA> 2014-01-01 archived 1 3 3.5
注意groupingsets函数中sets参数,用list()包裹想要聚合的字段组合,最后还有一个character(),加上该部分相当于全部聚合.当by只有一个字段时,相当于汇总.用法类似sql中“().”
上述语句结果等同于下面sql.
select color ,year, status,count(*) count,sum(amount) amount,sum(value) value
FROM dbo.DT
GROUP BY
GROUPING SETS(
(color),year,status),
(---- 类似 character()
() )
最后还有cube()函数,可?cube查看用法
7.6.3 行列转变
- 一列变多行
用tstrsplit()函数实现
<- 10
n <- data.table(name=LETTERS[1:n],char=rep('我-爱-R-语-言'),n)
dt <- dt[,.(newcol=tstrsplit(char,'-')),by=.(name)]
res head(res)
#> name newcol
#> 1: A 我
#> 2: A 爱
#> 3: A R
#> 4: A 语
#> 5: A 言
#> 6: B 我
- 多行变一列
char=paste0(newcol,collapse = '-')),by=.(name)]
res[,.(#> name char
#> 1: A 我-爱-R-语-言
#> 2: B 我-爱-R-语-言
#> 3: C 我-爱-R-语-言
#> 4: D 我-爱-R-语-言
#> 5: E 我-爱-R-语-言
#> 6: F 我-爱-R-语-言
#> 7: G 我-爱-R-语-言
#> 8: H 我-爱-R-语-言
#> 9: I 我-爱-R-语-言
#> 10: J 我-爱-R-语-言
# 同上
# res[,.(char=stringr::str_c(newcol,collapse = '-')),by=.(name)]