2.2 基础用法
基础用法部分,我们将从行筛选,重命名、列位置调整、新增计算列、排序、分组聚合几个方面阐述dplyr
动词功能。
首先加载包,加载包时可能会有一些重名函数的提示,可以通过warn.conflict参数禁掉提示。如下所示:
# 禁掉提示
library(dplyr,warn.conflicts = FALSE)
2.2.1 filter
filter
动词顾名思义即筛选功能,按照一定条件筛选data.frame;与Excel中的筛选功能和SQL
中where
条件一致。
filter条件筛选中可以分为单条件筛选和多条件筛选;多条件中间用,
分隔。
- 单条件
条件为species == "Droid"
时,如下所示:
%>%
starwars filter(species == "Droid")
#> # A tibble: 6 x 14
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 C-3PO 167 75 <NA> gold yellow 112 none masculi~
#> 2 R2-D2 96 32 <NA> white, blue red 33 none masculi~
#> 3 R5-D4 97 32 <NA> white, red red NA none masculi~
#> 4 IG-88 200 140 none metal red 15 none masculi~
#> 5 R4-P17 96 NA none silver, red red, blue NA none feminine
#> 6 BB8 NA NA none none black NA none masculi~
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
select * from starwars where species = "Droid" -- 注意=与==的区别
- 多条件
多条件筛选时,用英文逗号隔开多个条件。sql中用“and”连接多个条件,与 R 中用逗号隔开效果相同,“and” 在 R 中用 & 表示。
%>%
starwars filter(species == "Droid",skin_color == "gold")
#> # A tibble: 1 x 14
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 C-3PO 167 75 <NA> gold yellow 112 none masculine
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# 同上
# starwars %>%
# filter(species == "Droid" & skin_color == "gold")
select * from starwars where species = "Droid" and skin_color = "gold"
- 多情况筛选
%in%
类似SQL
中 in
的用法,与Excel中筛选条件时“或”条件相当。
%>%
starwars filter(species %in% c("Droid",'Clawdite'))
#> # A tibble: 7 x 14
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 C-3PO 167 75 <NA> gold yellow 112 none mascu~
#> 2 R2-D2 96 32 <NA> white, blue red 33 none mascu~
#> 3 R5-D4 97 32 <NA> white, red red NA none mascu~
#> 4 IG-88 200 140 none metal red 15 none mascu~
#> 5 Zam We~ 168 55 blonde fair, green~ yellow NA fema~ femin~
#> 6 R4-P17 96 NA none silver, red red, blue NA none femin~
#> # ... with 1 more row, and 5 more variables: homeworld <chr>, species <chr>,
#> # films <list>, vehicles <list>, starships <list>
select * from starwars where species in ("Droid","Clawdite") --sql查询
- 逻辑关系筛选
条件运算分为逻辑运算、关系运算。
关系运算符 >、<、==、!=、>=、<=分别代表大于、小于、等于、不等于、大于等于、小于等于。
逻辑运算符 &、|、!。 |
为 或, &
为并、且条件,!
为非。
library(nycflights13)
filter(flights, !(arr_delay > 120 | dep_delay > 120))
#> # A tibble: 316,050 x 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # ... with 316,044 more rows, and 11 more variables: arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights, arr_delay <= 120, dep_delay <= 120)
#> # A tibble: 316,050 x 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # ... with 316,044 more rows, and 11 more variables: arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# same above
filter(flights, arr_delay <= 120 & dep_delay <= 120)
#> # A tibble: 316,050 x 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # ... with 316,044 more rows, and 11 more variables: arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# %in% 的反面
%>%
starwars filter(!species %in% c("Droid",'Clawdite'))
#> # A tibble: 80 x 14
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Luke Sk~ 172 77 blond fair blue 19 male mascu~
#> 2 Darth V~ 202 136 none white yellow 41.9 male mascu~
#> 3 Leia Or~ 150 49 brown light brown 19 fema~ femin~
#> 4 Owen La~ 178 120 brown, grey light blue 52 male mascu~
#> 5 Beru Wh~ 165 75 brown light blue 47 fema~ femin~
#> 6 Biggs D~ 183 84 black light brown 24 male mascu~
#> # ... with 74 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> # films <list>, vehicles <list>, starships <list>
!的运算级别相比 %in% 更高
2.2.2 select
当完整数据集列较多时,我们某次分析可能并不需要那么多列,通过动词select()
筛选列,剔除不需要的列。
- 基础用法
通过指定列名称筛选,并指定列之间顺序
%>%
starwars select(name,height,mass,hair_color,skin_color,eye_color)
#> # A tibble: 87 x 6
#> name height mass hair_color skin_color eye_color
#> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 blond fair blue
#> 2 C-3PO 167 75 <NA> gold yellow
#> 3 R2-D2 96 32 <NA> white, blue red
#> 4 Darth Vader 202 136 none white yellow
#> 5 Leia Organa 150 49 brown light brown
#> 6 Owen Lars 178 120 brown, grey light blue
#> # ... with 81 more rows
- 列索引
通过列名或数字向量索引,但是不建议用数字索引,避免原始数据列顺序变化后导致报错。
%>%
starwars select(name : eye_color)
#> # A tibble: 87 x 6
#> name height mass hair_color skin_color eye_color
#> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 blond fair blue
#> 2 C-3PO 167 75 <NA> gold yellow
#> 3 R2-D2 96 32 <NA> white, blue red
#> 4 Darth Vader 202 136 none white yellow
#> 5 Leia Organa 150 49 brown light brown
#> 6 Owen Lars 178 120 brown, grey light blue
#> # ... with 81 more rows
# 同上
%>%
starwars select(1:6)
#> # A tibble: 87 x 6
#> name height mass hair_color skin_color eye_color
#> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 blond fair blue
#> 2 C-3PO 167 75 <NA> gold yellow
#> 3 R2-D2 96 32 <NA> white, blue red
#> 4 Darth Vader 202 136 none white yellow
#> 5 Leia Organa 150 49 brown light brown
#> 6 Owen Lars 178 120 brown, grey light blue
#> # ... with 81 more rows
# starwars %>% select(c(1,2,4,5,7))
- 新增列筛选方式
# starwars %>% select(!(name:mass))
# iris %>% select(!ends_with("Width"))
# iris %>% select(starts_with("Petal") & ends_with("Width"))
# iris %>% select(starts_with("Petal") | ends_with("Width"))
2.2.3 rename
列重命名使用rename()
函数,新名称写前面,如下所示:
%>% rename(home_world = homeworld)
starwars #> # A tibble: 87 x 14
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Luke Sk~ 172 77 blond fair blue 19 male mascu~
#> 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
#> 3 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
#> 4 Darth V~ 202 136 none white yellow 41.9 male mascu~
#> 5 Leia Or~ 150 49 brown light brown 19 fema~ femin~
#> 6 Owen La~ 178 120 brown, grey light blue 52 male mascu~
#> # ... with 81 more rows, and 5 more variables: home_world <chr>, species <chr>,
#> # films <list>, vehicles <list>, starships <list>
# 多列同换
%>% rename(home_world = homeworld,skincolor = skin_color)
starwars #> # A tibble: 87 x 14
#> name height mass hair_color skincolor eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Luke Sk~ 172 77 blond fair blue 19 male mascu~
#> 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
#> 3 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
#> 4 Darth V~ 202 136 none white yellow 41.9 male mascu~
#> 5 Leia Or~ 150 49 brown light brown 19 fema~ femin~
#> 6 Owen La~ 178 120 brown, grey light blue 52 male mascu~
#> # ... with 81 more rows, and 5 more variables: home_world <chr>, species <chr>,
#> # films <list>, vehicles <list>, starships <list>
select * ,homeworld as home_word from starwars
select * ,homeworld home_word from starwars
as 可以省略,但中间有一个以上空格。与R的差异是新增home_word列,原始列继续存在,R中是替换列名。
有时我们需要批量修改列名,改如何实现?
names(starwars) <- paste0('col',1:ncol(starwars))
colnames(starwars) <- paste0('col',1:ncol(starwars))
# not run
::set_names(head(starwars), paste0, "_foo") purrr
2.2.4 relocate
更改列顺序,与使用select()
动词指定列顺序功能相似。
relocate参数如下:
relocate(.data, ..., .before = NULL, .after = NULL)
sex:homeworld 列在 height 列前面
%>% relocate(sex:homeworld, .before = height)
starwars #> # A tibble: 87 x 14
#> name sex gender homeworld height mass hair_color skin_color eye_color
#> <chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Luke Sk~ male mascul~ Tatooine 172 77 blond fair blue
#> 2 C-3PO none mascul~ Tatooine 167 75 <NA> gold yellow
#> 3 R2-D2 none mascul~ Naboo 96 32 <NA> white, bl~ red
#> 4 Darth V~ male mascul~ Tatooine 202 136 none white yellow
#> 5 Leia Or~ female femini~ Alderaan 150 49 brown light brown
#> 6 Owen La~ male mascul~ Tatooine 178 120 brown, gr~ light blue
#> # ... with 81 more rows, and 5 more variables: birth_year <dbl>, species <chr>,
#> # films <list>, vehicles <list>, starships <list>
2.2.5 mutate
动词mutate
可以新增计算列,删除列,更新已有列,列之间的计算都可以通过mutate实现。
- 新增计算列
%>%
starwars mutate(bmi = mass / ((height / 100) ^ 2)) %>%
select(name:mass,bmi)
#> # A tibble: 87 x 4
#> name height mass bmi
#> <chr> <int> <dbl> <dbl>
#> 1 Luke Skywalker 172 77 26.0
#> 2 C-3PO 167 75 26.9
#> 3 R2-D2 96 32 34.7
#> 4 Darth Vader 202 136 33.3
#> 5 Leia Organa 150 49 21.8
#> 6 Owen Lars 178 120 37.9
#> # ... with 81 more rows
- 新增计算列基础上新增列
%>%
starwars mutate(bmi = mass / ((height / 100) ^ 2),newbmi = bmi *2) %>%
select(name:mass,bmi,newbmi)
#> # A tibble: 87 x 5
#> name height mass bmi newbmi
#> <chr> <int> <dbl> <dbl> <dbl>
#> 1 Luke Skywalker 172 77 26.0 52.1
#> 2 C-3PO 167 75 26.9 53.8
#> 3 R2-D2 96 32 34.7 69.4
#> 4 Darth Vader 202 136 33.3 66.7
#> 5 Leia Organa 150 49 21.8 43.6
#> 6 Owen Lars 178 120 37.9 75.7
#> # ... with 81 more rows
- 删除列
%>% mutate(height = NULL)
starwars #> # A tibble: 87 x 13
#> name mass hair_color skin_color eye_color birth_year sex gender homeworld
#> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
#> 1 Luke ~ 77 blond fair blue 19 male mascu~ Tatooine
#> 2 C-3PO 75 <NA> gold yellow 112 none mascu~ Tatooine
#> 3 R2-D2 32 <NA> white, bl~ red 33 none mascu~ Naboo
#> 4 Darth~ 136 none white yellow 41.9 male mascu~ Tatooine
#> 5 Leia ~ 49 brown light brown 19 fema~ femin~ Alderaan
#> 6 Owen ~ 120 brown, gr~ light blue 52 male mascu~ Tatooine
#> # ... with 81 more rows, and 4 more variables: species <chr>, films <list>,
#> # vehicles <list>, starships <list>
2.2.6 arrange
- 单列排序,默认升序,通过
desc()
降序排列
%>%
starwars arrange(desc(mass))
#> # A tibble: 87 x 14
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Jabba ~ 175 1358 <NA> green-tan,~ orange 600 herm~ mascu~
#> 2 Grievo~ 216 159 none brown, whi~ green, y~ NA male mascu~
#> 3 IG-88 200 140 none metal red 15 none mascu~
#> 4 Darth ~ 202 136 none white yellow 41.9 male mascu~
#> 5 Tarfful 234 136 brown brown blue NA male mascu~
#> 6 Owen L~ 178 120 brown, grey light blue 52 male mascu~
#> # ... with 81 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> # films <list>, vehicles <list>, starships <list>
- 多列排序
%>%
starwars arrange(height,desc(mass))
#> # A tibble: 87 x 14
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Yoda 66 17 white green brown 896 male mascu~
#> 2 Ratts Ty~ 79 15 none grey, blue unknown NA male mascu~
#> 3 Wicket S~ 88 20 brown brown brown 8 male mascu~
#> 4 Dud Bolt 94 45 none blue, grey yellow NA male mascu~
#> 5 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
#> 6 R4-P17 96 NA none silver, r~ red, blue NA none femin~
#> # ... with 81 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> # films <list>, vehicles <list>, starships <list>
select * from starwars order by height,mass desc
2.2.7 summarise
summarise
常与group_by
结合使用。
%>%
mtcars summarise(mean = mean(disp), n = n())
#> # A tibble: 1 x 2
#> mean n
#> <dbl> <int>
#> 1 231. 32
n()是dplyr包中的计算当前组大小的函数,用在summarise()和mutate()中。通常用来组计算。
2.2.8 group_by
聚合前一般都需要分组,group_by()
动词实现该功能,与SQL
中group by ···
类似。
%>%
starwars group_by(species) %>%
summarise(
n = n(),
mass = mean(mass, na.rm = TRUE)
)#> # A tibble: 38 x 3
#> species n mass
#> <chr> <int> <dbl>
#> 1 Aleena 1 15
#> 2 Besalisk 1 102
#> 3 Cerean 1 82
#> 4 Chagrian 1 NaN
#> 5 Clawdite 1 55
#> 6 Droid 6 69.8
#> # ... with 32 more rows
SELECT species,
count(species) n,
AVG(mass) mass
FROM [spb].[dbo].[starwars]
GROUP BY species