14.2 Exploring
14.2.1 tabyl
tabyl()
的设计初衷是替代 Base R 中 table()
,后者有几个缺点:
- 不接受数据框输入
- 不返回数据框
- 返回的结果很难进一步修饰
tabyl()
用于构建 1 ~ 3 个变量的(交叉)频数表,它 建立在 dplyr 和 tidyr 之上,所以以数据框基本输入、输出对象(但也可以接受一维向量),janitor 还提供了 adorn_*
函数族对其返回的表格进行修饰。以 starwars
的一个子集演示 tabyl()
的用法:
One-way tabyl
一维频数表
t1 <- humans %>%
tabyl(eye_color)
t1
#> eye_color n percent
#> blue 12 0.3429
#> blue-gray 1 0.0286
#> brown 17 0.4857
#> dark 1 0.0286
#> hazel 2 0.0571
#> yellow 2 0.0571
tably()
可以聪明地处理数据中包含缺失值的情况:
x <- c("big", "big", "small", "small", "small", NA)
tabyl(x)
#> x n percent valid_percent
#> big 2 0.333 0.4
#> small 3 0.500 0.6
#> <NA> 1 0.167 NA
tabyl(x, show_na = F)
#> x n percent
#> big 2 0.4
#> small 3 0.6
大部分 adorn_*
函数主要用于二维列联表,但也可以适用一维频数表:
t1 %>%
adorn_pct_formatting()
#> eye_color n percent
#> blue 12 34.3%
#> blue-gray 1 2.9%
#> brown 17 48.6%
#> dark 1 2.9%
#> hazel 2 5.7%
#> yellow 2 5.7%
Two-way tabyl
df %>% tabyl(var_1, var_2)
等同于 df %>% count(var_1, var_2)
后 pivot_wider()
展开其中的某一列,生成列联表:
t2 <- humans %>%
tabyl(gender, eye_color)
t2
#> gender blue blue-gray brown dark hazel yellow
#> female 3 0 5 0 1 0
#> male 9 1 12 1 1 2
# count() + pivot_wider()
humans %>%
count(gender, eye_color) %>%
pivot_wider(names_from = eye_color, values_from = n)
#> # A tibble: 2 x 7
#> gender blue brown hazel `blue-gray` dark yellow
#> <chr> <int> <int> <int> <int> <int> <int>
#> 1 female 3 5 1 NA NA NA
#> 2 male 9 12 1 1 1 2
用于修饰的 adorn_*
函数有:
adorn_totals(c("row", "col"))
: 添加行列汇总
adorn_percentages(c("row", "col"))
: 将交叉表的指替换为行或列百分比
adorn_pct_formatting(digits, rounding)
: 决定百分比的格式
adorn_rounding()
: Round a data.frame of numbers (usually the result ofadorn_percentages
), either using the base Rround()
function or using janitor’sround_half_up()
to round all ties up (thanks, StackOverflow).- e.g., round 10.5 up to 11, consistent with Excel’s tie-breaking behavior.
- This contrasts with rounding 10.5 down to 10 as in base R’s
round(10.5)
. adorn_rounding()
returns columns of class numeric, allowing for graphing, sorting, etc. It’s a less-aggressive substitute foradorn_pct_formatting()
; these two functions should not be called together.
adorn_ns()
: addNs
to atabyl
. These can be drawn from the tabyl’s underlying counts, which are attached to the tabyl as metadata, or they can be supplied by the user.adorn_title(placement, row_name, col_name)
: “combined” 或者 “top”,调整行变量名称的位置
注意在应用这些帮助函数时要遵从一定的逻辑顺序。例如,adorn_ns()
和 adorn_percent_fomatting()
应该在调用 adorn_percentages()
之后。
对 t2
应用 adorn_*
函数:
t2 %>%
adorn_totals("col") %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits = 2) %>%
adorn_ns() %>%
adorn_title("combined")
#> gender/eye_color blue blue-gray brown dark hazel
#> female 33.33% (3) 0.00% (0) 55.56% (5) 0.00% (0) 11.11% (1)
#> male 34.62% (9) 3.85% (1) 46.15% (12) 3.85% (1) 3.85% (1)
#> yellow Total
#> 0.00% (0) 100.00% (9)
#> 7.69% (2) 100.00% (26)
tabyl
对象最终可以传入 knitr::kabel()
中呈现
t2 %>%
adorn_totals("row") %>%
adorn_percentages("col") %>%
adorn_pct_formatting(digits = 1) %>%
adorn_ns() %>%
adorn_title("top", row_name = "gender", col_name = "color") %>%
knitr::kable()
col | or | |||||
---|---|---|---|---|---|---|
gender | blue | blue-gray | brown | dark | hazel | yellow |
female | 25.0% (3) | 0.0% (0) | 29.4% (5) | 0.0% (0) | 50.0% (1) | 0.0% (0) |
male | 75.0% (9) | 100.0% (1) | 70.6% (12) | 100.0% (1) | 50.0% (1) | 100.0% (2) |
Total | 100.0% (12) | 100.0% (1) | 100.0% (17) | 100.0% (1) | 100.0% (2) | 100.0% (2) |
Three-way tabyl
在 tabyl()
中传入三个变量时,返回一个二维 tabyl
的列表:
t3 <- humans %>%
tabyl(eye_color, skin_color, gender)
t3
#> $female
#> eye_color dark fair light pale tan white
#> blue 0 2 1 0 0 0
#> blue-gray 0 0 0 0 0 0
#> brown 0 1 4 0 0 0
#> dark 0 0 0 0 0 0
#> hazel 0 0 1 0 0 0
#> yellow 0 0 0 0 0 0
#>
#> $male
#> eye_color dark fair light pale tan white
#> blue 0 7 2 0 0 0
#> blue-gray 0 1 0 0 0 0
#> brown 3 4 3 0 2 0
#> dark 1 0 0 0 0 0
#> hazel 0 1 0 0 0 0
#> yellow 0 0 0 1 0 1
这时的 adorn_*
函数将会应用于列表中的每个 tabyl
元素:
t3 %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits = 0) %>%
adorn_ns()
#> $female
#> eye_color dark fair light pale tan white
#> blue 0% (0) 67% (2) 33% (1) 0% (0) 0% (0) 0% (0)
#> blue-gray - (0) - (0) - (0) - (0) - (0) - (0)
#> brown 0% (0) 20% (1) 80% (4) 0% (0) 0% (0) 0% (0)
#> dark - (0) - (0) - (0) - (0) - (0) - (0)
#> hazel 0% (0) 0% (0) 100% (1) 0% (0) 0% (0) 0% (0)
#> yellow - (0) - (0) - (0) - (0) - (0) - (0)
#>
#> $male
#> eye_color dark fair light pale tan white
#> blue 0% (0) 78% (7) 22% (2) 0% (0) 0% (0) 0% (0)
#> blue-gray 0% (0) 100% (1) 0% (0) 0% (0) 0% (0) 0% (0)
#> brown 25% (3) 33% (4) 25% (3) 0% (0) 17% (2) 0% (0)
#> dark 100% (1) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0)
#> hazel 0% (0) 100% (1) 0% (0) 0% (0) 0% (0) 0% (0)
#> yellow 0% (0) 0% (0) 0% (0) 50% (1) 0% (0) 50% (1)
14.2.2 get_dupes
get_dupes(dat, ...)
返回数据框dat
中在变量...
上重复的观测,以及重复的次数:
mtcars %>%
get_dupes(wt, cyl)
#> # A tibble: 4 x 12
#> wt cyl dupe_count mpg disp hp drat qsec vs am gear carb
#> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3.44 6 2 19.2 168. 123 3.92 18.3 1 0 4 4
#> 2 3.44 6 2 17.8 168. 123 3.92 18.9 1 0 4 4
#> 3 3.57 8 2 14.3 360 245 3.21 15.8 0 0 3 4
#> 4 3.57 8 2 15 301 335 3.54 14.6 0 1 5 8
14.2.3 remove_
14.2.3.1 remove_empty
remove_empty(c("rows", "cols"))
移除行或列(或行和列)上全为 NA
值的观测:
q <- data.frame(v1 = c(1, NA, 3),
v2 = c(NA, NA, NA),
v3 = c("a", NA, "b"))
q %>%
remove_empty(c("rows", "cols"))
#> v1 v3
#> 1 1 a
#> 3 3 b
q %>%
remove_empty("rows")
#> v1 v2 v3
#> 1 1 NA a
#> 3 3 NA b
q %>%
remove_empty("cols")
#> v1 v3
#> 1 1 a
#> 2 NA <NA>
#> 3 3 b
remove_empty
的实现原理很简单,以移除空的行观测为例:如果某行全为 NA
,则该行对应的 rowSums(is.na(dat)) = ncol(dat)
:
function (dat, which = c("rows", "cols"))
{
if (missing(which) && !missing(dat)) {
message("value for \"which\" not specified, defaulting to c(\"rows\", \"cols\")")
which <- c("rows", "cols")
}
if ((sum(which %in% c("rows", "cols")) != length(which)) &&
!missing(dat)) {
stop("\"which\" must be one of \"rows\", \"cols\", or c(\"rows\", \"cols\")")
}
if ("rows" %in% which) {
dat <- dat[rowSums(is.na(dat)) != ncol(dat), , drop = FALSE]
}
if ("cols" %in% which) {
dat <- dat[, colSums(!is.na(dat)) > 0, drop = FALSE]
}
dat
}
14.2.3.2 remove_constant
remove_constant()
移除数据框中的常数列:
14.2.4 round_half_up
Base R 中的取整函数 round()
采取的规则是 “四舍六入五留双”(Banker’s Rounding,当小数位是 .5 时,若前一位是奇数,则进 1 ; 若前一位数偶数,则退一):
round_half_up
遵循最简单的四舍五入规则:
若希望取整到特定的小数位,例如 0, 0.25, 0.5, 0.75, 1。可以用 round_half_fraction()
并指定除数
14.2.5 excel_numeric_to_date
excel_numeric_to_date()
按照 Excel 编码日期的规则(1989/12/31 = 1) 将整数转换为数字:
14.2.6 top_levels
在李克特量表数据的分析中,常需要知道某个态度变量中占比最高的几个水平,这样的变量在 R 中以有序因子的方式储存,top_levels()
将有序因子的所有水平分为三组(左,中间,右),并分别呈现各组的频数:
f <- factor(c("strongly agree", "agree", "neutral", "neutral", "disagree", "strongly agree"),
levels = c("strongly agree", "agree", "neutral", "disagree", "strongly disagree"))
top_levels(f)
#> f n percent
#> strongly agree, agree 3 0.500
#> neutral 2 0.333
#> disagree, strongly disagree 1 0.167
top_levels(as.factor(mtcars$hp))
#> as.factor(mtcars$hp) n percent
#> 52, 62 2 0.0625
#> <<< Middle Group (18 categories) >>> 28 0.8750
#> 264, 335 2 0.0625
改变两侧分组包含水平的个数:
14.2.7 row_to_names
row_to_names()
将某个观测行提升至列名:
dirt <- data.frame(X_1 = c(NA, "ID", 1:3),
X_2 = c(NA, "Value", 4:6))
dirt
#> X_1 X_2
#> 1 <NA> <NA>
#> 2 ID Value
#> 3 1 4
#> 4 2 5
#> 5 3 6
dirt %>%
row_to_names(row_number = 2, remove_rows_above = F)
#> ID Value
#> 1 <NA> <NA>
#> 3 1 4
#> 4 2 5
#> 5 3 6
dirt %>%
row_to_names(row_number = 2, remove_rows_above = T)
#> ID Value
#> 3 1 4
#> 4 2 5
#> 5 3 6