第 19 章 回望tidyverse之旅

library(tidyverse)

前面几章先后介绍了tidyverse套餐的若干部件。感觉很难吗?如果是,那说明你认真听了。

本章做个小结,通过案例复习和串讲下tidyverse中常用的核心部件(事实上,tidyverse套餐比我们列出的要丰富)。

图片来源Silvia Canelón在R-Ladies Chicago的报告

图 19.1: 图片来源Silvia Canelón在R-Ladies Chicago的报告

19.1 readr 宏包

读入数据是第一步,我们可以用readr导入数据

提示:

  • 逗号(,)分割的文件 read_csv()
  • 制表符(tab)分割的文件 read_tsv()
  • 任意的分割符 read_delim()
  • 固定宽度的文件 read_fwf()
  • 空格分割的文件 read_table()
  • 网页log文件 read_log()

读取外部数据

penguins <- read_csv("./demo_data/penguins.csv") 

保存到外部文件

penguins %>% write_csv("newdata.csv")

19.2 tibble 宏包

tibble 是升级版的 dataframe, 之所以是升级版,是因为在tidyversetibble做很多优化。下面你可以看到两者的区别:

as_tibble(penguins)
## # A tibble: 344 x 8
##    species island bill_length_mm bill_depth_mm
##    <chr>   <chr>           <dbl>         <dbl>
##  1 Adelie  Torge~           39.1          18.7
##  2 Adelie  Torge~           39.5          17.4
##  3 Adelie  Torge~           40.3          18  
##  4 Adelie  Torge~           NA            NA  
##  5 Adelie  Torge~           36.7          19.3
##  6 Adelie  Torge~           39.3          20.6
##  7 Adelie  Torge~           38.9          17.8
##  8 Adelie  Torge~           39.2          19.6
##  9 Adelie  Torge~           34.1          18.1
## 10 Adelie  Torge~           42            20.2
## # ... with 334 more rows, and 4 more variables:
## #   flipper_length_mm <dbl>, body_mass_g <dbl>,
## #   sex <chr>, year <dbl>
as.data.frame(penguins) %>% head()
##   species    island bill_length_mm bill_depth_mm
## 1  Adelie Torgersen           39.1          18.7
## 2  Adelie Torgersen           39.5          17.4
## 3  Adelie Torgersen           40.3          18.0
## 4  Adelie Torgersen             NA            NA
## 5  Adelie Torgersen           36.7          19.3
## 6  Adelie Torgersen           39.3          20.6
##   flipper_length_mm body_mass_g    sex year
## 1               181        3750   male 2007
## 2               186        3800 female 2007
## 3               195        3250 female 2007
## 4                NA          NA   <NA> 2007
## 5               193        3450 female 2007
## 6               190        3650   male 2007

在R Markdown里两者区别不大,但在console中,区别很明显的。比如tibble不一样的地方有:

  • 列出了变量的类型(这个很不错)
  • 只列出10行
  • 只列出有限的列数(与屏幕适应的)
  • 高亮 NAs

19.3 ggplot2 宏包

19.3.1 查看数据

我们先查看下数据

glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <chr> "Adelie", "Adelie", "Ade...
## $ island            <chr> "Torgersen", "Torgersen"...
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36...
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19...
## $ flipper_length_mm <dbl> 181, 186, 195, NA, 193, ...
## $ body_mass_g       <dbl> 3750, 3800, 3250, NA, 34...
## $ sex               <chr> "male", "female", "femal...
## $ year              <dbl> 2007, 2007, 2007, 2007, ...

19.3.2 散点图

体重在性别上有很大区别?

ggplot(data = penguins, aes(x = sex, y = body_mass_g)) +
  geom_point()

19.3.3 箱线图

ggplot(data = penguins, aes(x = sex, y = body_mass_g)) +
  geom_boxplot()

ggplot(data = penguins, aes(x = sex, y = body_mass_g)) +
  geom_boxplot(aes(fill = species))

我们可能看到:

  • Gentoo 类的企鹅 比 Adelie 和 Chinstrap 类的企鹅体重更重
  • Gentoo 类型中,雄性企鹅比雌性企鹅体重更重
  • Adelie 和 Chinstrap 两种类型的企鹅,区别不是很明显
  • sex 这个变量有缺失值,主要集中在 Gentoo 和 Chinstrap 两种类型

那么每种类型的企鹅,数据中有多少是NA呢? 上dplyr吧!

19.4 dplyr 宏包

dplyr 宏包可以:

  • 创建新变量 mutate()
  • 分组统计 summarize() + group_by()
  • 筛选 filter()
  • 重命名变量 rename()
  • 排序 arrange()
  • 更多

19.4.1 选取列

下面两个有什么区别?

select(penguins, species, sex, body_mass_g)
## # A tibble: 344 x 3
##    species sex    body_mass_g
##    <chr>   <chr>        <dbl>
##  1 Adelie  male          3750
##  2 Adelie  female        3800
##  3 Adelie  female        3250
##  4 Adelie  <NA>            NA
##  5 Adelie  female        3450
##  6 Adelie  male          3650
##  7 Adelie  female        3625
##  8 Adelie  male          4675
##  9 Adelie  <NA>          3475
## 10 Adelie  <NA>          4250
## # ... with 334 more rows
penguins %>%
  select(species, sex, body_mass_g)
## # A tibble: 344 x 3
##    species sex    body_mass_g
##    <chr>   <chr>        <dbl>
##  1 Adelie  male          3750
##  2 Adelie  female        3800
##  3 Adelie  female        3250
##  4 Adelie  <NA>            NA
##  5 Adelie  female        3450
##  6 Adelie  male          3650
##  7 Adelie  female        3625
##  8 Adelie  male          4675
##  9 Adelie  <NA>          3475
## 10 Adelie  <NA>          4250
## # ... with 334 more rows

19.4.2 行方向排序

glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <chr> "Adelie", "Adelie", "Ade...
## $ island            <chr> "Torgersen", "Torgersen"...
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36...
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19...
## $ flipper_length_mm <dbl> 181, 186, 195, NA, 193, ...
## $ body_mass_g       <dbl> 3750, 3800, 3250, NA, 34...
## $ sex               <chr> "male", "female", "femal...
## $ year              <dbl> 2007, 2007, 2007, 2007, ...
penguins %>%
  select(species, sex, body_mass_g) %>%
  arrange(desc(body_mass_g))
## # A tibble: 344 x 3
##    species sex   body_mass_g
##    <chr>   <chr>       <dbl>
##  1 Gentoo  male         6300
##  2 Gentoo  male         6050
##  3 Gentoo  male         6000
##  4 Gentoo  male         6000
##  5 Gentoo  male         5950
##  6 Gentoo  male         5950
##  7 Gentoo  male         5850
##  8 Gentoo  male         5850
##  9 Gentoo  male         5850
## 10 Gentoo  male         5800
## # ... with 334 more rows

19.4.3 分组统计

penguins %>% 
  group_by(species, sex) %>%
  summarize(count = n())
## # A tibble: 8 x 3
## # Groups:   species [3]
##   species   sex    count
##   <chr>     <chr>  <int>
## 1 Adelie    female    73
## 2 Adelie    male      73
## 3 Adelie    <NA>       6
## 4 Chinstrap female    34
## 5 Chinstrap male      34
## 6 Gentoo    female    58
## 7 Gentoo    male      61
## 8 Gentoo    <NA>       5

19.4.4 增加列

penguins %>% 
  group_by(species) %>%
  mutate(count_species = n()) %>%
  ungroup() %>%
  group_by(species, sex, count_species) %>%
  summarize(count = n()) %>%
  mutate(prop = count/count_species*100)
## # A tibble: 8 x 5
## # Groups:   species, sex [8]
##   species   sex    count_species count  prop
##   <chr>     <chr>          <int> <int> <dbl>
## 1 Adelie    female           152    73 48.0 
## 2 Adelie    male             152    73 48.0 
## 3 Adelie    <NA>             152     6  3.95
## 4 Chinstrap female            68    34 50   
## 5 Chinstrap male              68    34 50   
## 6 Gentoo    female           124    58 46.8 
## 7 Gentoo    male             124    61 49.2 
## 8 Gentoo    <NA>             124     5  4.03

19.4.5 筛选

penguins %>% 
  group_by(species) %>%
  mutate(count_species = n()) %>%
  ungroup() %>%
  group_by(species, sex, count_species) %>%
  summarize(count = n()) %>%
  mutate(percentage = count/count_species*100) %>%
  filter(species == "Chinstrap")
## # A tibble: 2 x 5
## # Groups:   species, sex [2]
##   species   sex    count_species count percentage
##   <chr>     <chr>          <int> <int>      <dbl>
## 1 Chinstrap female            68    34         50
## 2 Chinstrap male              68    34         50

19.5 forcats 宏包

forcats 宏包主要用于分类变量和因子型变量,比如这里的 species, island, sex.

对于不是因子型的变量,比如这里 year 是数值型变量,我们也可以通过 factor() 函数 将它转换成因子型变量。

penguins %>%
  mutate(year_factor = factor(year, levels = unique(year)))
## # A tibble: 344 x 9
##    species island bill_length_mm bill_depth_mm
##    <chr>   <chr>           <dbl>         <dbl>
##  1 Adelie  Torge~           39.1          18.7
##  2 Adelie  Torge~           39.5          17.4
##  3 Adelie  Torge~           40.3          18  
##  4 Adelie  Torge~           NA            NA  
##  5 Adelie  Torge~           36.7          19.3
##  6 Adelie  Torge~           39.3          20.6
##  7 Adelie  Torge~           38.9          17.8
##  8 Adelie  Torge~           39.2          19.6
##  9 Adelie  Torge~           34.1          18.1
## 10 Adelie  Torge~           42            20.2
## # ... with 334 more rows, and 5 more variables:
## #   flipper_length_mm <dbl>, body_mass_g <dbl>,
## #   sex <chr>, year <dbl>, year_factor <fct>

我们保存到新的数据集中,再看看有什么变化

penguins_new <-
  penguins %>%
  mutate(year_factor = factor(year, levels = unique(year)))
penguins_new
## # A tibble: 344 x 9
##    species island bill_length_mm bill_depth_mm
##    <chr>   <chr>           <dbl>         <dbl>
##  1 Adelie  Torge~           39.1          18.7
##  2 Adelie  Torge~           39.5          17.4
##  3 Adelie  Torge~           40.3          18  
##  4 Adelie  Torge~           NA            NA  
##  5 Adelie  Torge~           36.7          19.3
##  6 Adelie  Torge~           39.3          20.6
##  7 Adelie  Torge~           38.9          17.8
##  8 Adelie  Torge~           39.2          19.6
##  9 Adelie  Torge~           34.1          18.1
## 10 Adelie  Torge~           42            20.2
## # ... with 334 more rows, and 5 more variables:
## #   flipper_length_mm <dbl>, body_mass_g <dbl>,
## #   sex <chr>, year <dbl>, year_factor <fct>
class(penguins_new$year_factor)
## [1] "factor"
levels(penguins_new$year_factor)
## [1] "2007" "2008" "2009"

大家回想下,弄成因子型变量有什么好处呢?

19.6 stringr 宏包

stringr宏包包含了非常丰富的处理字符串的函数,比如

  • 匹配
  • 字符串子集
  • 字符串长度
  • 字符串合并
  • 字符串分割
  • 更多

19.6.1 字符串转换

penguins %>%
  select(species, island) %>%
  mutate(ISLAND = str_to_upper(island))
## # A tibble: 344 x 3
##    species island    ISLAND   
##    <chr>   <chr>     <chr>    
##  1 Adelie  Torgersen TORGERSEN
##  2 Adelie  Torgersen TORGERSEN
##  3 Adelie  Torgersen TORGERSEN
##  4 Adelie  Torgersen TORGERSEN
##  5 Adelie  Torgersen TORGERSEN
##  6 Adelie  Torgersen TORGERSEN
##  7 Adelie  Torgersen TORGERSEN
##  8 Adelie  Torgersen TORGERSEN
##  9 Adelie  Torgersen TORGERSEN
## 10 Adelie  Torgersen TORGERSEN
## # ... with 334 more rows

19.6.2 字符串合并

penguins %>%
  select(species, island) %>%
  mutate(ISLAND = str_to_upper(island)) %>%
  mutate(species_island = str_c(species, ISLAND, sep = "_"))
## # A tibble: 344 x 4
##    species island    ISLAND    species_island  
##    <chr>   <chr>     <chr>     <chr>           
##  1 Adelie  Torgersen TORGERSEN Adelie_TORGERSEN
##  2 Adelie  Torgersen TORGERSEN Adelie_TORGERSEN
##  3 Adelie  Torgersen TORGERSEN Adelie_TORGERSEN
##  4 Adelie  Torgersen TORGERSEN Adelie_TORGERSEN
##  5 Adelie  Torgersen TORGERSEN Adelie_TORGERSEN
##  6 Adelie  Torgersen TORGERSEN Adelie_TORGERSEN
##  7 Adelie  Torgersen TORGERSEN Adelie_TORGERSEN
##  8 Adelie  Torgersen TORGERSEN Adelie_TORGERSEN
##  9 Adelie  Torgersen TORGERSEN Adelie_TORGERSEN
## 10 Adelie  Torgersen TORGERSEN Adelie_TORGERSEN
## # ... with 334 more rows

19.7 tidyr 宏包

想想什么叫tidy data?

19.7.1 长表格变宽表格

untidy_penguins <-
  penguins %>%
    pivot_wider(names_from = sex,
                values_from = body_mass_g)
untidy_penguins
## # A tibble: 344 x 9
##    species island bill_length_mm bill_depth_mm
##    <chr>   <chr>           <dbl>         <dbl>
##  1 Adelie  Torge~           39.1          18.7
##  2 Adelie  Torge~           39.5          17.4
##  3 Adelie  Torge~           40.3          18  
##  4 Adelie  Torge~           NA            NA  
##  5 Adelie  Torge~           36.7          19.3
##  6 Adelie  Torge~           39.3          20.6
##  7 Adelie  Torge~           38.9          17.8
##  8 Adelie  Torge~           39.2          19.6
##  9 Adelie  Torge~           34.1          18.1
## 10 Adelie  Torge~           42            20.2
## # ... with 334 more rows, and 5 more variables:
## #   flipper_length_mm <dbl>, year <dbl>, male <dbl>,
## #   female <dbl>, `NA` <dbl>

19.7.2 宽表格变长表格

untidy_penguins %>%
  pivot_longer(cols = male:`NA`, 
               names_to = "sex",
               values_to = "body_mass_g")
## # A tibble: 1,032 x 8
##    species island bill_length_mm bill_depth_mm
##    <chr>   <chr>           <dbl>         <dbl>
##  1 Adelie  Torge~           39.1          18.7
##  2 Adelie  Torge~           39.1          18.7
##  3 Adelie  Torge~           39.1          18.7
##  4 Adelie  Torge~           39.5          17.4
##  5 Adelie  Torge~           39.5          17.4
##  6 Adelie  Torge~           39.5          17.4
##  7 Adelie  Torge~           40.3          18  
##  8 Adelie  Torge~           40.3          18  
##  9 Adelie  Torge~           40.3          18  
## 10 Adelie  Torge~           NA            NA  
## # ... with 1,022 more rows, and 4 more variables:
## #   flipper_length_mm <dbl>, year <dbl>, sex <chr>,
## #   body_mass_g <dbl>

19.8 purrr 宏包

purrr 宏包提供了map()等一系列函数,取代 forwhile循环方式,实现高效迭代,保持语法一致性,同时增强了代码的可读性。

penguins %>% map(~sum(is.na(.)))
## $species
## [1] 0
## 
## $island
## [1] 0
## 
## $bill_length_mm
## [1] 2
## 
## $bill_depth_mm
## [1] 2
## 
## $flipper_length_mm
## [1] 2
## 
## $body_mass_g
## [1] 2
## 
## $sex
## [1] 11
## 
## $year
## [1] 0
penguins %>%
  group_nest(species) %>%
  mutate(model = purrr::map(data, ~ lm(bill_depth_mm ~ bill_length_mm, data = .))) %>%
  mutate(result = purrr::map(model, ~ broom::tidy(.))) %>%
  tidyr::unnest(result)
## # A tibble: 6 x 8
##   species      data model term  estimate std.error
##   <chr>   <list<tb> <lis> <chr>    <dbl>     <dbl>
## 1 Adelie  [152 x 7] <lm>  (Int~   11.4      1.34  
## 2 Adelie  [152 x 7] <lm>  bill~    0.179    0.0344
## 3 Chinst~  [68 x 7] <lm>  (Int~    7.57     1.55  
## 4 Chinst~  [68 x 7] <lm>  bill~    0.222    0.0317
## 5 Gentoo  [124 x 7] <lm>  (Int~    5.25     1.05  
## 6 Gentoo  [124 x 7] <lm>  bill~    0.205    0.0222
## # ... with 2 more variables: statistic <dbl>,
## #   p.value <dbl>