Chapter 4 挑選適當的欄位Select()

cheatsheet截圖對照

第一部分

Select 是針對欄位(variables)做子集合

  1. 基本取法
#若我想看每一個航班的基本資料,包含起飛與目的地
flights %>% select(year,month,day,carrier,flight,tailnum,origin,dest)
## # A tibble: 336,776 x 8
##     year month   day carrier flight tailnum origin
##    <int> <int> <int> <chr>    <int> <chr>   <chr> 
##  1  2013     1     1 UA        1545 N14228  EWR   
##  2  2013     1     1 UA        1714 N24211  LGA   
##  3  2013     1     1 AA        1141 N619AA  JFK   
##  4  2013     1     1 B6         725 N804JB  JFK   
##  5  2013     1     1 DL         461 N668DN  LGA   
##  6  2013     1     1 UA        1696 N39463  EWR   
##  7  2013     1     1 B6         507 N516JB  EWR   
##  8  2013     1     1 EV        5708 N829AS  LGA   
##  9  2013     1     1 B6          79 N593JB  JFK   
## 10  2013     1     1 AA         301 N3ALAA  LGA   
## # ... with 336,766 more rows, and 1 more variable:
## #   dest <chr>
  1. 取連續某幾行的欄位
#跟上面的寫法,有什麼異同?
flights %>% select(year:day,carrier:dest)
## # A tibble: 336,776 x 8
##     year month   day carrier flight tailnum origin
##    <int> <int> <int> <chr>    <int> <chr>   <chr> 
##  1  2013     1     1 UA        1545 N14228  EWR   
##  2  2013     1     1 UA        1714 N24211  LGA   
##  3  2013     1     1 AA        1141 N619AA  JFK   
##  4  2013     1     1 B6         725 N804JB  JFK   
##  5  2013     1     1 DL         461 N668DN  LGA   
##  6  2013     1     1 UA        1696 N39463  EWR   
##  7  2013     1     1 B6         507 N516JB  EWR   
##  8  2013     1     1 EV        5708 N829AS  LGA   
##  9  2013     1     1 B6          79 N593JB  JFK   
## 10  2013     1     1 AA         301 N3ALAA  LGA   
## # ... with 336,766 more rows, and 1 more variable:
## #   dest <chr>
  1. 取特定類型的欄位
#若我有想看有關起飛與抵達的資料,contains(),就能派上用場,當然也有其他內容,請參考cheatsheet
flights %>% select(contains("dep"), contains("arr"))
## # A tibble: 336,776 x 7
##    dep_time sched_dep_time dep_delay arr_time
##       <int>          <int>     <dbl>    <int>
##  1      517            515         2      830
##  2      533            529         4      850
##  3      542            540         2      923
##  4      544            545        -1     1004
##  5      554            600        -6      812
##  6      554            558        -4      740
##  7      555            600        -5      913
##  8      557            600        -3      709
##  9      557            600        -3      838
## 10      558            600        -2      753
## # ... with 336,766 more rows, and 3 more variables:
## #   sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>
  1. 去掉某些變數(drop certain variables)
#若我留意有些變數,對後續分析沒有幫助...
flights %>% select(-c(time_hour))
## # A tibble: 336,776 x 18
##     year month   day dep_time sched_dep_time dep_delay
##    <int> <int> <int>    <int>          <int>     <dbl>
##  1  2013     1     1      517            515         2
##  2  2013     1     1      533            529         4
##  3  2013     1     1      542            540         2
##  4  2013     1     1      544            545        -1
##  5  2013     1     1      554            600        -6
##  6  2013     1     1      554            558        -4
##  7  2013     1     1      555            600        -5
##  8  2013     1     1      557            600        -3
##  9  2013     1     1      557            600        -3
## 10  2013     1     1      558            600        -2
## # ... with 336,766 more rows, and 12 more variables:
## #   arr_time <int>, sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>
#一次可以去掉多個變數嗎?
flights %>% select(-c(time_hour,tailnum,flight))
## # A tibble: 336,776 x 16
##     year month   day dep_time sched_dep_time dep_delay
##    <int> <int> <int>    <int>          <int>     <dbl>
##  1  2013     1     1      517            515         2
##  2  2013     1     1      533            529         4
##  3  2013     1     1      542            540         2
##  4  2013     1     1      544            545        -1
##  5  2013     1     1      554            600        -6
##  6  2013     1     1      554            558        -4
##  7  2013     1     1      555            600        -5
##  8  2013     1     1      557            600        -3
##  9  2013     1     1      557            600        -3
## 10  2013     1     1      558            600        -2
## # ... with 336,766 more rows, and 10 more variables:
## #   arr_time <int>, sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>
  1. everthing()
#everything(),意思是all variables的意思。
#稍等與mutate()一起使用可以提升資料整理的效率性。
flights %>% select(everything())
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay
##    <int> <int> <int>    <int>          <int>     <dbl>
##  1  2013     1     1      517            515         2
##  2  2013     1     1      533            529         4
##  3  2013     1     1      542            540         2
##  4  2013     1     1      544            545        -1
##  5  2013     1     1      554            600        -6
##  6  2013     1     1      554            558        -4
##  7  2013     1     1      555            600        -5
##  8  2013     1     1      557            600        -3
##  9  2013     1     1      557            600        -3
## 10  2013     1     1      558            600        -2
## # ... with 336,766 more rows, and 13 more variables:
## #   arr_time <int>, sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

範例

若完成,請直接貼到open chat

  1. 若有一位買家對於這32台車子很有興趣,特別是在省油表現(Miles/(US) gallon, mpg),以及馬力表現(hp, Gross horsepower)有很大的興趣,你要怎麼整理資料給他?
mtcars
mtcars %>% select(car_name, mpg, hp)

自主練習

  1. 若有一名球隊總管,很重視防守,他想要查一下聯盟中誰防守比較好。需要其名字(Name)、所屬隊伍(Team)、守備位置(Position), 他搶了幾個籃板球(TotalRebounds),有幾次抄球 (Steals),請問應該怎麼準備資料?