14.2 Exploring
14.2.1 tabyl
tabyl() 的设计初衷是替代 Base R 中 table(),后者有几个缺点:
- 不接受数据框输入
 - 不返回数据框
 - 返回的结果很难进一步修饰
 
tabyl() 用于构建 1 ~ 3 个变量的(交叉)频数表,它 建立在 dplyr 和 tidyr 之上,所以以数据框基本输入、输出对象(但也可以接受一维向量),janitor 还提供了 adorn_* 函数族对其返回的表格进行修饰。以 starwars的一个子集演示 tabyl() 的用法:
One-way tabyl
一维频数表
t1 <- humans %>%
  tabyl(eye_color)
t1
#>  eye_color  n percent
#>       blue 12  0.3429
#>  blue-gray  1  0.0286
#>      brown 17  0.4857
#>       dark  1  0.0286
#>      hazel  2  0.0571
#>     yellow  2  0.0571tably() 可以聪明地处理数据中包含缺失值的情况:
x <- c("big", "big", "small", "small", "small", NA)
tabyl(x)
#>      x n percent valid_percent
#>    big 2   0.333           0.4
#>  small 3   0.500           0.6
#>   <NA> 1   0.167            NA
tabyl(x, show_na = F)
#>      x n percent
#>    big 2     0.4
#>  small 3     0.6大部分 adorn_* 函数主要用于二维列联表,但也可以适用一维频数表:
t1 %>% 
  adorn_pct_formatting()
#>  eye_color  n percent
#>       blue 12   34.3%
#>  blue-gray  1    2.9%
#>      brown 17   48.6%
#>       dark  1    2.9%
#>      hazel  2    5.7%
#>     yellow  2    5.7%Two-way tabyl
df %>% tabyl(var_1, var_2) 等同于 df %>% count(var_1, var_2) 后 pivot_wider() 展开其中的某一列,生成列联表:
t2 <- humans %>%
  tabyl(gender, eye_color)
t2
#>  gender blue blue-gray brown dark hazel yellow
#>  female    3         0     5    0     1      0
#>    male    9         1    12    1     1      2# count() + pivot_wider()
humans %>% 
  count(gender, eye_color) %>% 
  pivot_wider(names_from = eye_color, values_from = n)
#> # A tibble: 2 x 7
#>   gender  blue brown hazel `blue-gray`  dark yellow
#>   <chr>  <int> <int> <int>       <int> <int>  <int>
#> 1 female     3     5     1          NA    NA     NA
#> 2 male       9    12     1           1     1      2用于修饰的 adorn_* 函数有:
adorn_totals(c("row", "col")): 添加行列汇总
adorn_percentages(c("row", "col")): 将交叉表的指替换为行或列百分比
adorn_pct_formatting(digits, rounding): 决定百分比的格式
adorn_rounding(): Round a data.frame of numbers (usually the result ofadorn_percentages), either using the base Rround()function or using janitor’sround_half_up()to round all ties up (thanks, StackOverflow).- e.g., round 10.5 up to 11, consistent with Excel’s tie-breaking behavior.
 - This contrasts with rounding 10.5 down to 10 as in base R’s 
round(10.5). adorn_rounding()returns columns of class numeric, allowing for graphing, sorting, etc. It’s a less-aggressive substitute foradorn_pct_formatting(); these two functions should not be called together.
adorn_ns(): addNsto atabyl. These can be drawn from the tabyl’s underlying counts, which are attached to the tabyl as metadata, or they can be supplied by the user.adorn_title(placement, row_name, col_name): “combined” 或者 “top”,调整行变量名称的位置
注意在应用这些帮助函数时要遵从一定的逻辑顺序。例如,adorn_ns() 和 adorn_percent_fomatting() 应该在调用 adorn_percentages() 之后。
对 t2 应用 adorn_* 函数:
t2 %>% 
  adorn_totals("col") %>% 
  adorn_percentages("row") %>%
  adorn_pct_formatting(digits = 2) %>% 
  adorn_ns() %>%
  adorn_title("combined")
#>  gender/eye_color       blue blue-gray       brown      dark      hazel
#>            female 33.33% (3) 0.00% (0) 55.56%  (5) 0.00% (0) 11.11% (1)
#>              male 34.62% (9) 3.85% (1) 46.15% (12) 3.85% (1)  3.85% (1)
#>     yellow        Total
#>  0.00% (0) 100.00%  (9)
#>  7.69% (2) 100.00% (26)tabyl 对象最终可以传入 knitr::kabel() 中呈现
t2 %>% 
  adorn_totals("row") %>% 
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits = 1) %>% 
  adorn_ns() %>%
  adorn_title("top", row_name = "gender", col_name = "color") %>%
  knitr::kable()| col | or | |||||
|---|---|---|---|---|---|---|
| gender | blue | blue-gray | brown | dark | hazel | yellow | 
| female | 25.0% (3) | 0.0% (0) | 29.4% (5) | 0.0% (0) | 50.0% (1) | 0.0% (0) | 
| male | 75.0% (9) | 100.0% (1) | 70.6% (12) | 100.0% (1) | 50.0% (1) | 100.0% (2) | 
| Total | 100.0% (12) | 100.0% (1) | 100.0% (17) | 100.0% (1) | 100.0% (2) | 100.0% (2) | 
Three-way tabyl
在 tabyl() 中传入三个变量时,返回一个二维 tabyl 的列表:
t3 <- humans %>%
  tabyl(eye_color, skin_color, gender)
t3
#> $female
#>  eye_color dark fair light pale tan white
#>       blue    0    2     1    0   0     0
#>  blue-gray    0    0     0    0   0     0
#>      brown    0    1     4    0   0     0
#>       dark    0    0     0    0   0     0
#>      hazel    0    0     1    0   0     0
#>     yellow    0    0     0    0   0     0
#> 
#> $male
#>  eye_color dark fair light pale tan white
#>       blue    0    7     2    0   0     0
#>  blue-gray    0    1     0    0   0     0
#>      brown    3    4     3    0   2     0
#>       dark    1    0     0    0   0     0
#>      hazel    0    1     0    0   0     0
#>     yellow    0    0     0    1   0     1这时的 adorn_* 函数将会应用于列表中的每个 tabyl 元素:
t3 %>% 
  adorn_percentages("row") %>%
  adorn_pct_formatting(digits = 0) %>%
  adorn_ns()
#> $female
#>  eye_color   dark    fair    light   pale    tan  white
#>       blue 0% (0) 67% (2)  33% (1) 0% (0) 0% (0) 0% (0)
#>  blue-gray  - (0)   - (0)    - (0)  - (0)  - (0)  - (0)
#>      brown 0% (0) 20% (1)  80% (4) 0% (0) 0% (0) 0% (0)
#>       dark  - (0)   - (0)    - (0)  - (0)  - (0)  - (0)
#>      hazel 0% (0)  0% (0) 100% (1) 0% (0) 0% (0) 0% (0)
#>     yellow  - (0)   - (0)    - (0)  - (0)  - (0)  - (0)
#> 
#> $male
#>  eye_color     dark     fair   light    pale     tan   white
#>       blue   0% (0)  78% (7) 22% (2)  0% (0)  0% (0)  0% (0)
#>  blue-gray   0% (0) 100% (1)  0% (0)  0% (0)  0% (0)  0% (0)
#>      brown  25% (3)  33% (4) 25% (3)  0% (0) 17% (2)  0% (0)
#>       dark 100% (1)   0% (0)  0% (0)  0% (0)  0% (0)  0% (0)
#>      hazel   0% (0) 100% (1)  0% (0)  0% (0)  0% (0)  0% (0)
#>     yellow   0% (0)   0% (0)  0% (0) 50% (1)  0% (0) 50% (1)14.2.2 get_dupes
get_dupes(dat, ...) 返回数据框dat中在变量...上重复的观测,以及重复的次数:
mtcars %>% 
  get_dupes(wt, cyl)
#> # A tibble: 4 x 12
#>      wt   cyl dupe_count   mpg  disp    hp  drat  qsec    vs    am  gear  carb
#>   <dbl> <dbl>      <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  3.44     6          2  19.2  168.   123  3.92  18.3     1     0     4     4
#> 2  3.44     6          2  17.8  168.   123  3.92  18.9     1     0     4     4
#> 3  3.57     8          2  14.3  360    245  3.21  15.8     0     0     3     4
#> 4  3.57     8          2  15    301    335  3.54  14.6     0     1     5     814.2.3 remove_
14.2.3.1 remove_empty
remove_empty(c("rows", "cols")) 移除行或列(或行和列)上全为 NA 值的观测:
q <- data.frame(v1 = c(1, NA, 3),
                v2 = c(NA, NA, NA),
                v3 = c("a", NA, "b"))
q %>%
  remove_empty(c("rows", "cols"))
#>   v1 v3
#> 1  1  a
#> 3  3  b
q %>% 
  remove_empty("rows")
#>   v1 v2 v3
#> 1  1 NA  a
#> 3  3 NA  b
q %>% 
  remove_empty("cols")
#>   v1   v3
#> 1  1    a
#> 2 NA <NA>
#> 3  3    bremove_empty 的实现原理很简单,以移除空的行观测为例:如果某行全为 NA,则该行对应的 rowSums(is.na(dat)) = ncol(dat):
function (dat, which = c("rows", "cols")) 
{
    if (missing(which) && !missing(dat)) {
        message("value for \"which\" not specified, defaulting to c(\"rows\", \"cols\")")
        which <- c("rows", "cols")
    }
    if ((sum(which %in% c("rows", "cols")) != length(which)) && 
        !missing(dat)) {
        stop("\"which\" must be one of \"rows\", \"cols\", or c(\"rows\", \"cols\")")
    }
    if ("rows" %in% which) {
        dat <- dat[rowSums(is.na(dat)) != ncol(dat), , drop = FALSE]
    }
    if ("cols" %in% which) {
        dat <- dat[, colSums(!is.na(dat)) > 0, drop = FALSE]
    }
    dat
}14.2.3.2 remove_constant
remove_constant() 移除数据框中的常数列:
14.2.4 round_half_up
Base R 中的取整函数 round() 采取的规则是 “四舍六入五留双”(Banker’s Rounding,当小数位是 .5 时,若前一位是奇数,则进 1 ; 若前一位数偶数,则退一):
round_half_up 遵循最简单的四舍五入规则:
若希望取整到特定的小数位,例如 0, 0.25, 0.5, 0.75, 1。可以用 round_half_fraction() 并指定除数
14.2.5 excel_numeric_to_date
excel_numeric_to_date() 按照 Excel 编码日期的规则(1989/12/31 = 1) 将整数转换为数字:
14.2.6 top_levels
在李克特量表数据的分析中,常需要知道某个态度变量中占比最高的几个水平,这样的变量在 R 中以有序因子的方式储存,top_levels() 将有序因子的所有水平分为三组(左,中间,右),并分别呈现各组的频数:
f <- factor(c("strongly agree", "agree", "neutral", "neutral", "disagree", "strongly agree"),
            levels = c("strongly agree", "agree", "neutral", "disagree", "strongly disagree"))
top_levels(f)
#>                            f n percent
#>        strongly agree, agree 3   0.500
#>                      neutral 2   0.333
#>  disagree, strongly disagree 1   0.167top_levels(as.factor(mtcars$hp))
#>                  as.factor(mtcars$hp)  n percent
#>                                52, 62  2  0.0625
#>  <<< Middle Group (18 categories) >>> 28  0.8750
#>                              264, 335  2  0.0625改变两侧分组包含水平的个数:
14.2.7 row_to_names
row_to_names() 将某个观测行提升至列名:
dirt <- data.frame(X_1 = c(NA, "ID", 1:3),
           X_2 = c(NA, "Value", 4:6))
dirt
#>    X_1   X_2
#> 1 <NA>  <NA>
#> 2   ID Value
#> 3    1     4
#> 4    2     5
#> 5    3     6
dirt %>% 
  row_to_names(row_number = 2, remove_rows_above = F)  
#>     ID Value
#> 1 <NA>  <NA>
#> 3    1     4
#> 4    2     5
#> 5    3     6
dirt %>% 
  row_to_names(row_number = 2, remove_rows_above = T)  
#>   ID Value
#> 3  1     4
#> 4  2     5
#> 5  3     6