2.2 基础用法

基础用法部分,我们将从行筛选,重命名、列位置调整、新增计算列、排序、分组聚合几个方面阐述dplyr动词功能。

首先加载包,加载包时可能会有一些重名函数的提示,可以通过warn.conflict参数禁掉提示。如下所示:

# 禁掉提示
library(dplyr,warn.conflicts = FALSE)

2.2.1 filter

filter动词顾名思义即筛选功能,按照一定条件筛选data.frame;与Excel中的筛选功能和SQLwhere条件一致。

filter条件筛选中可以分为单条件筛选和多条件筛选;多条件中间用,分隔。

  • 单条件

条件为species == "Droid"时,如下所示:

starwars %>% 
  filter(species == "Droid")
#> # A tibble: 6 x 14
#>   name   height  mass hair_color skin_color  eye_color birth_year sex   gender  
#>   <chr>   <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>   
#> 1 C-3PO     167    75 <NA>       gold        yellow           112 none  masculi~
#> 2 R2-D2      96    32 <NA>       white, blue red               33 none  masculi~
#> 3 R5-D4      97    32 <NA>       white, red  red               NA none  masculi~
#> 4 IG-88     200   140 none       metal       red               15 none  masculi~
#> 5 R4-P17     96    NA none       silver, red red, blue         NA none  feminine
#> 6 BB8        NA    NA none       none        black             NA none  masculi~
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>
select * from starwars where species = "Droid" -- 注意=与==的区别
  • 多条件

多条件筛选时,用英文逗号隔开多个条件。sql中用“and”连接多个条件,与 R 中用逗号隔开效果相同,“and” 在 R 中用 & 表示。

starwars %>% 
  filter(species == "Droid",skin_color == "gold")
#> # A tibble: 1 x 14
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender   
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>    
#> 1 C-3PO    167    75 <NA>       gold       yellow           112 none  masculine
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

# 同上
# starwars %>% 
#   filter(species == "Droid" & skin_color == "gold")
select * from starwars where species = "Droid" and skin_color = "gold" 
  • 多情况筛选

%in%类似SQLin 的用法,与Excel中筛选条件时“或”条件相当。

starwars %>% 
  filter(species %in%  c("Droid",'Clawdite'))
#> # A tibble: 7 x 14
#>   name    height  mass hair_color skin_color   eye_color birth_year sex   gender
#>   <chr>    <int> <dbl> <chr>      <chr>        <chr>          <dbl> <chr> <chr> 
#> 1 C-3PO      167    75 <NA>       gold         yellow           112 none  mascu~
#> 2 R2-D2       96    32 <NA>       white, blue  red               33 none  mascu~
#> 3 R5-D4       97    32 <NA>       white, red   red               NA none  mascu~
#> 4 IG-88      200   140 none       metal        red               15 none  mascu~
#> 5 Zam We~    168    55 blonde     fair, green~ yellow            NA fema~ femin~
#> 6 R4-P17      96    NA none       silver, red  red, blue         NA none  femin~
#> # ... with 1 more row, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>
select * from starwars where species in ("Droid","Clawdite") --sql查询
  • 逻辑关系筛选

条件运算分为逻辑运算、关系运算。

关系运算符 >、<、==、!=、>=、<=分别代表大于、小于、等于、不等于、大于等于、小于等于。

逻辑运算符 &、|、!。 |为 或, & 为并、且条件,!为非。

library(nycflights13)
filter(flights, !(arr_delay > 120 | dep_delay > 120))
#> # A tibble: 316,050 x 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1      533            529         4      850            830
#> 3  2013     1     1      542            540         2      923            850
#> 4  2013     1     1      544            545        -1     1004           1022
#> 5  2013     1     1      554            600        -6      812            837
#> 6  2013     1     1      554            558        -4      740            728
#> # ... with 316,044 more rows, and 11 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights, arr_delay <= 120, dep_delay <= 120)
#> # A tibble: 316,050 x 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1      533            529         4      850            830
#> 3  2013     1     1      542            540         2      923            850
#> 4  2013     1     1      544            545        -1     1004           1022
#> 5  2013     1     1      554            600        -6      812            837
#> 6  2013     1     1      554            558        -4      740            728
#> # ... with 316,044 more rows, and 11 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

# same above
filter(flights, arr_delay <= 120 & dep_delay <= 120)
#> # A tibble: 316,050 x 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1      533            529         4      850            830
#> 3  2013     1     1      542            540         2      923            850
#> 4  2013     1     1      544            545        -1     1004           1022
#> 5  2013     1     1      554            600        -6      812            837
#> 6  2013     1     1      554            558        -4      740            728
#> # ... with 316,044 more rows, and 11 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

# %in% 的反面
starwars %>% 
  filter(!species %in%  c("Droid",'Clawdite'))
#> # A tibble: 80 x 14
#>   name     height  mass hair_color  skin_color eye_color birth_year sex   gender
#>   <chr>     <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Luke Sk~    172    77 blond       fair       blue            19   male  mascu~
#> 2 Darth V~    202   136 none        white      yellow          41.9 male  mascu~
#> 3 Leia Or~    150    49 brown       light      brown           19   fema~ femin~
#> 4 Owen La~    178   120 brown, grey light      blue            52   male  mascu~
#> 5 Beru Wh~    165    75 brown       light      blue            47   fema~ femin~
#> 6 Biggs D~    183    84 black       light      brown           24   male  mascu~
#> # ... with 74 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

!的运算级别相比 %in% 更高

2.2.2 select

当完整数据集列较多时,我们某次分析可能并不需要那么多列,通过动词select()筛选列,剔除不需要的列。

  • 基础用法

通过指定列名称筛选,并指定列之间顺序

starwars %>% 
  select(name,height,mass,hair_color,skin_color,eye_color)
#> # A tibble: 87 x 6
#>   name           height  mass hair_color  skin_color  eye_color
#>   <chr>           <int> <dbl> <chr>       <chr>       <chr>    
#> 1 Luke Skywalker    172    77 blond       fair        blue     
#> 2 C-3PO             167    75 <NA>        gold        yellow   
#> 3 R2-D2              96    32 <NA>        white, blue red      
#> 4 Darth Vader       202   136 none        white       yellow   
#> 5 Leia Organa       150    49 brown       light       brown    
#> 6 Owen Lars         178   120 brown, grey light       blue     
#> # ... with 81 more rows
  • 列索引

通过列名或数字向量索引,但是不建议用数字索引,避免原始数据列顺序变化后导致报错。

starwars %>% 
  select(name : eye_color)
#> # A tibble: 87 x 6
#>   name           height  mass hair_color  skin_color  eye_color
#>   <chr>           <int> <dbl> <chr>       <chr>       <chr>    
#> 1 Luke Skywalker    172    77 blond       fair        blue     
#> 2 C-3PO             167    75 <NA>        gold        yellow   
#> 3 R2-D2              96    32 <NA>        white, blue red      
#> 4 Darth Vader       202   136 none        white       yellow   
#> 5 Leia Organa       150    49 brown       light       brown    
#> 6 Owen Lars         178   120 brown, grey light       blue     
#> # ... with 81 more rows

# 同上
starwars %>% 
  select(1:6)
#> # A tibble: 87 x 6
#>   name           height  mass hair_color  skin_color  eye_color
#>   <chr>           <int> <dbl> <chr>       <chr>       <chr>    
#> 1 Luke Skywalker    172    77 blond       fair        blue     
#> 2 C-3PO             167    75 <NA>        gold        yellow   
#> 3 R2-D2              96    32 <NA>        white, blue red      
#> 4 Darth Vader       202   136 none        white       yellow   
#> 5 Leia Organa       150    49 brown       light       brown    
#> 6 Owen Lars         178   120 brown, grey light       blue     
#> # ... with 81 more rows

# starwars %>% select(c(1,2,4,5,7))
  • 新增列筛选方式
# starwars %>% select(!(name:mass))
# iris %>% select(!ends_with("Width"))
# iris %>% select(starts_with("Petal") & ends_with("Width"))
# iris %>% select(starts_with("Petal") | ends_with("Width"))

2.2.3 rename

列重命名使用rename()函数,新名称写前面,如下所示:

starwars %>% rename(home_world = homeworld)
#> # A tibble: 87 x 14
#>   name     height  mass hair_color  skin_color eye_color birth_year sex   gender
#>   <chr>     <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Luke Sk~    172    77 blond       fair       blue            19   male  mascu~
#> 2 C-3PO       167    75 <NA>        gold       yellow         112   none  mascu~
#> 3 R2-D2        96    32 <NA>        white, bl~ red             33   none  mascu~
#> 4 Darth V~    202   136 none        white      yellow          41.9 male  mascu~
#> 5 Leia Or~    150    49 brown       light      brown           19   fema~ femin~
#> 6 Owen La~    178   120 brown, grey light      blue            52   male  mascu~
#> # ... with 81 more rows, and 5 more variables: home_world <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>
# 多列同换
starwars %>% rename(home_world = homeworld,skincolor = skin_color)
#> # A tibble: 87 x 14
#>   name     height  mass hair_color  skincolor  eye_color birth_year sex   gender
#>   <chr>     <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Luke Sk~    172    77 blond       fair       blue            19   male  mascu~
#> 2 C-3PO       167    75 <NA>        gold       yellow         112   none  mascu~
#> 3 R2-D2        96    32 <NA>        white, bl~ red             33   none  mascu~
#> 4 Darth V~    202   136 none        white      yellow          41.9 male  mascu~
#> 5 Leia Or~    150    49 brown       light      brown           19   fema~ femin~
#> 6 Owen La~    178   120 brown, grey light      blue            52   male  mascu~
#> # ... with 81 more rows, and 5 more variables: home_world <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>
select * ,homeworld as home_word from starwars 
select * ,homeworld  home_word from starwars 

as 可以省略,但中间有一个以上空格。与R的差异是新增home_word列,原始列继续存在,R中是替换列名。

有时我们需要批量修改列名,改如何实现?

names(starwars) <- paste0('col',1:ncol(starwars))
colnames(starwars) <- paste0('col',1:ncol(starwars))

# not run
purrr::set_names(head(starwars), paste0, "_foo")

2.2.4 relocate

更改列顺序,与使用select()动词指定列顺序功能相似。

relocate参数如下:

relocate(.data, ..., .before = NULL, .after = NULL)

sex:homeworld 列在 height 列前面

starwars %>% relocate(sex:homeworld, .before = height)
#> # A tibble: 87 x 14
#>   name     sex    gender  homeworld height  mass hair_color skin_color eye_color
#>   <chr>    <chr>  <chr>   <chr>      <int> <dbl> <chr>      <chr>      <chr>    
#> 1 Luke Sk~ male   mascul~ Tatooine     172    77 blond      fair       blue     
#> 2 C-3PO    none   mascul~ Tatooine     167    75 <NA>       gold       yellow   
#> 3 R2-D2    none   mascul~ Naboo         96    32 <NA>       white, bl~ red      
#> 4 Darth V~ male   mascul~ Tatooine     202   136 none       white      yellow   
#> 5 Leia Or~ female femini~ Alderaan     150    49 brown      light      brown    
#> 6 Owen La~ male   mascul~ Tatooine     178   120 brown, gr~ light      blue     
#> # ... with 81 more rows, and 5 more variables: birth_year <dbl>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

2.2.5 mutate

动词mutate可以新增计算列,删除列,更新已有列,列之间的计算都可以通过mutate实现。

  • 新增计算列
starwars %>% 
  mutate(bmi = mass / ((height / 100)  ^ 2)) %>% 
  select(name:mass,bmi)
#> # A tibble: 87 x 4
#>   name           height  mass   bmi
#>   <chr>           <int> <dbl> <dbl>
#> 1 Luke Skywalker    172    77  26.0
#> 2 C-3PO             167    75  26.9
#> 3 R2-D2              96    32  34.7
#> 4 Darth Vader       202   136  33.3
#> 5 Leia Organa       150    49  21.8
#> 6 Owen Lars         178   120  37.9
#> # ... with 81 more rows
  • 新增计算列基础上新增列
starwars %>% 
  mutate(bmi = mass / ((height / 100)  ^ 2),newbmi = bmi *2) %>% 
  select(name:mass,bmi,newbmi)
#> # A tibble: 87 x 5
#>   name           height  mass   bmi newbmi
#>   <chr>           <int> <dbl> <dbl>  <dbl>
#> 1 Luke Skywalker    172    77  26.0   52.1
#> 2 C-3PO             167    75  26.9   53.8
#> 3 R2-D2              96    32  34.7   69.4
#> 4 Darth Vader       202   136  33.3   66.7
#> 5 Leia Organa       150    49  21.8   43.6
#> 6 Owen Lars         178   120  37.9   75.7
#> # ... with 81 more rows
  • 删除列
starwars %>% mutate(height = NULL)
#> # A tibble: 87 x 13
#>   name    mass hair_color skin_color eye_color birth_year sex   gender homeworld
#>   <chr>  <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>  <chr>    
#> 1 Luke ~    77 blond      fair       blue            19   male  mascu~ Tatooine 
#> 2 C-3PO     75 <NA>       gold       yellow         112   none  mascu~ Tatooine 
#> 3 R2-D2     32 <NA>       white, bl~ red             33   none  mascu~ Naboo    
#> 4 Darth~   136 none       white      yellow          41.9 male  mascu~ Tatooine 
#> 5 Leia ~    49 brown      light      brown           19   fema~ femin~ Alderaan 
#> 6 Owen ~   120 brown, gr~ light      blue            52   male  mascu~ Tatooine 
#> # ... with 81 more rows, and 4 more variables: species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

2.2.6 arrange

  • 单列排序,默认升序,通过desc()降序排列
starwars %>% 
  arrange(desc(mass))
#> # A tibble: 87 x 14
#>   name    height  mass hair_color  skin_color  eye_color birth_year sex   gender
#>   <chr>    <int> <dbl> <chr>       <chr>       <chr>          <dbl> <chr> <chr> 
#> 1 Jabba ~    175  1358 <NA>        green-tan,~ orange         600   herm~ mascu~
#> 2 Grievo~    216   159 none        brown, whi~ green, y~       NA   male  mascu~
#> 3 IG-88      200   140 none        metal       red             15   none  mascu~
#> 4 Darth ~    202   136 none        white       yellow          41.9 male  mascu~
#> 5 Tarfful    234   136 brown       brown       blue            NA   male  mascu~
#> 6 Owen L~    178   120 brown, grey light       blue            52   male  mascu~
#> # ... with 81 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>
  • 多列排序
starwars %>% 
  arrange(height,desc(mass))
#> # A tibble: 87 x 14
#>   name      height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Yoda          66    17 white      green      brown            896 male  mascu~
#> 2 Ratts Ty~     79    15 none       grey, blue unknown           NA male  mascu~
#> 3 Wicket S~     88    20 brown      brown      brown              8 male  mascu~
#> 4 Dud Bolt      94    45 none       blue, grey yellow            NA male  mascu~
#> 5 R2-D2         96    32 <NA>       white, bl~ red               33 none  mascu~
#> 6 R4-P17        96    NA none       silver, r~ red, blue         NA none  femin~
#> # ... with 81 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>
select * from starwars order by height,mass desc

2.2.7 summarise

summarise常与group_by结合使用。

mtcars %>%
  summarise(mean = mean(disp), n = n())
#> # A tibble: 1 x 2
#>    mean     n
#>   <dbl> <int>
#> 1  231.    32

n()是dplyr包中的计算当前组大小的函数,用在summarise()和mutate()中。通常用来组计算。

2.2.8 group_by

聚合前一般都需要分组,group_by()动词实现该功能,与SQLgroup by ···类似。

starwars %>%
  group_by(species) %>%
  summarise(
    n = n(),
    mass = mean(mass, na.rm = TRUE)
  )
#> # A tibble: 38 x 3
#>   species      n  mass
#>   <chr>    <int> <dbl>
#> 1 Aleena       1  15  
#> 2 Besalisk     1 102  
#> 3 Cerean       1  82  
#> 4 Chagrian     1 NaN  
#> 5 Clawdite     1  55  
#> 6 Droid        6  69.8
#> # ... with 32 more rows
SELECT species,
  count(species) n,
  AVG(mass) mass
FROM [spb].[dbo].[starwars]
GROUP BY  species