Chapter 11 因子資料處理
因子物件 (factor) 為處理 類別資料 (categorical data), 提供的一種有效的方法. 類別變數 將變數值分成互斥的類別水準, 在類別變數定義中的特定幾種類別水準是否有大小差距, 又可分類成 名目變數 (nominal variable) 與 有序變數 (ordinal variable), 或稱 名目尺度 與 順序尺度.
在 {R} 中特別使用
因子
(factor)
來表示.
因子是一種特殊的文字向量,
文字向量中的每一個元素, 取一個離散值,
因子物件有一個特殊屬性, 稱為
層次,
水平,
水準
或
類別水準
(levels),
表示這組所有可能的離散值.
因子常常是用文字或字串輸入,
有時會使用數值或整數代表,
一但變數設定為因素或因子向量,
{R} 在列印或輸出時,
並不會加上雙引號 "
,
且數值或有大小順序的文字,
{R} 在統計分析上都必須特別處理.
在 {R} 中若要設定因子,
可以簡單地用函式
factor()
產生無序因子物件.
tidyverse
套件系統提供 forcats
套件,
主要用來處理因子變數與類別資料.
11.1 forcats 套件: 基本函式
forcats
套件的基本函式包含
fct_count(f, sort = FALSE, prop = FALSE)
: 計算類別水準數目.fct_unique(f)
: 呈現專一類別水準名稱.fct_c(f1, f2)
: 合併不同類別水準的 2 個因子物件.
library(dplyr)
library(ggplot2)
library(forcats)
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.0.3
set.seed(100)
1:5]
letters[## [1] "a" "b" "c" "d" "e"
factor(sample(letters[1:5])[rpois(100, 5)])
f <-table(f)
## f
## a b c d e
## 21 2 3 21 15
fct_count(f)
## # A tibble: 6 x 2
## f n
## <fct> <int>
## 1 a 21
## 2 b 2
## 3 c 3
## 4 d 21
## 5 e 15
## 6 <NA> 38
fct_count(f, sort = TRUE)
## # A tibble: 6 x 2
## f n
## <fct> <int>
## 1 <NA> 38
## 2 a 21
## 3 d 21
## 4 e 15
## 5 c 3
## 6 b 2
fct_count(f, sort = TRUE, prop = TRUE)
## # A tibble: 6 x 3
## f n p
## <fct> <int> <dbl>
## 1 <NA> 38 0.38
## 2 a 21 0.21
## 3 d 21 0.21
## 4 e 15 0.15
## 5 c 3 0.03
## 6 b 2 0.02
##
factor(letters[1:3])
f1 <- factor(letters[c(1, 2, 23)])
f2 <-
f1## [1] a b c
## Levels: a b c
f2## [1] a b w
## Levels: a b w
fct_c(f1, f2)
## [1] a b c a b w
## Levels: a b c w
11.2 移除或增加部分類別水準
資料分析中常因抽樣 0 值, 而必須特別處理, 例如必須移除部分類別水準或是必須增加因子變數中的類別水準. 下列函式可以執行這些功能.
fct_drop(f, only) # R baes base::droplevels()
fct_expand(f, ...)
fct_explicit_na(f, na_level = "(Missing)")
函式 fct_drop()
可以移除部分類別水準,
函式 fct_expand()
可以增加因子變數中的類別水準.
函式 fct_explicit_na
可以明確設性缺失值為 1 項類別水準.
factor(c("F", "M"), levels = c("F", "M", "Other"))
f <-
f## [1] F M
## Levels: F M Other
fct_drop(f)
## [1] F M
## Levels: F M
# Set only to restrict which levels to drop
fct_drop(f, only = "F")
## [1] F M
## Levels: F M Other
fct_drop(f, only = "Other")
## [1] F M
## Levels: F M
##
fct_expand(f, "B", "T")
## [1] F M
## Levels: F M Other B T
##
factor(c("F", "M", "M", "F", "F", "B", "T", NA, NA))
f <-
f## [1] F M M F F B T <NA> <NA>
## Levels: B F M T
fct_explicit_na(f)
## [1] F M M F F B T (Missing)
## [9] (Missing)
## Levels: B F M T (Missing)
fct_explicit_na(f, na_level = "Other")
## [1] F M M F F B T Other Other
## Levels: B F M T Other
11.3 改變類別水準函式
因子變數中的類別水準的名稱呈現, 常常視需求而更改,
例如, H
改成 High
, F
改成 female
等等.
函式 fct_recode()
可以改變的名稱呈現.
factor(c("apple", "bear", "banana", "dear"))
x <-
x## [1] apple bear banana dear
## Levels: apple banana bear dear
fct_recode(x, fruit = "apple", fruit = "banana")
## [1] fruit bear fruit dear
## Levels: fruit bear dear
# If you name the level NULL it will be removed
fct_recode(x, NULL = "apple", fruit = "banana")
## [1] <NA> bear fruit dear
## Levels: fruit bear dear
# When passing a named vector to rename levels use !!! to splice
factor(c("apple", "bear", "banana", "dear"))
x <- c(fruit = "apple", fruit = "banana")
levels <-fct_recode(x, !!!levels)
## [1] fruit bear fruit dear
## Levels: fruit bear dear
##
factor(c("F", "M", "M", "F", "F", "B", "T"))
x <-
x## [1] F M M F F B T
## Levels: B F M T
fct_recode(x, Male = "M", Female = "F", Other = "B", Other = "T")
## [1] Female Male Male Female Female Other Other
## Levels: Other Female Male
函式 fct_collapse()
可以改變的名稱呈現,
將不同類別水準合併且更名.
##
factor(c("F", "M", "M", "F", "F", "B", "T"))
x <-
x## [1] F M M F F B T
## Levels: B F M T
fct_collapse(x, Biological = c("M", "F"),
Other = "B", Other = "T")
## [1] Biological Biological Biological Biological Biological Other Other
## Levels: Other Biological
函式 fct_other()
可將部分類別水準合併或移除.
fct_other(f, keep, drop, other_level = "Other")`
其中引數 keep
, drop
保留或移除部分原有類別水準,
將其餘類別水準合併成 other_level
.
factor(rep(LETTERS[1:4], times = 3))
x <-
x## [1] A B C D A B C D A B C D
## Levels: A B C D
fct_other(x, keep = c("A", "B"))
## [1] A B Other Other A B Other Other A B Other Other
## Levels: A B Other
fct_other(x, drop = c("A", "B"), other_level = "change")
## [1] change change C D change change C D change change C D
## Levels: C D change
11.4 改變或合併類別水準函式 fct_lump()
因子變數中的類別水準常常因頻率過低影響分析與推論,
常常常須要合併類別水準.
系列函式 fct_lump()
可將部分類別水準合併.
fct_lump_min()
: 合併類別水準頻率計數低於設定的最小值.
- `fct_lump_prop(): 合併類別水準相對頻率低於設定的最小值
fct_lump_n()
: 合併類別水準最多 n 種主要類別.fct_lump_lowfreq()
: 合併類別水準, 且確保other
類別的頻率仍是最低. 函式使用程式語言結構為
fct_lump(f, n, prop, w = NULL, other_level = "Other",
ties.method = c("min", "average", "first", "last", "random", "max"))
fct_lump_min(f, min, w = NULL, other_level = "Other")
fct_lump_prop(f, prop, w = NULL, other_level = "Other")
fct_lump_n(f, n, w = NULL, other_level = "Other",
ties.method = c("min", "average", "first", "last", "random", "max"))
fct_lump_lowfreq(f, other_level = "Other")
其中引數 f
為因子向量,
n
設定最多 n 種主要類別,
prop
設定正值百分率, 合併小於 prop
的類別,
設定負值百分率, 合併大於 prop
的類別.
w
設定權重.
other_level
設定合併後的類別名稱.
ties.method
處理相同排序方式.
min
保留至少出現 min 次類別.
factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
x <-%>% table()
x ## .
## A B C D E F G H I
## 40 10 5 27 1 1 1 1 1
%>% fct_lump_n(3) %>% table()
x ## .
## A B D Other
## 40 10 27 10
%>% fct_lump_prop(0.10) %>% table()
x ## .
## A B D Other
## 40 10 27 10
%>% fct_lump_min(5) %>% table()
x ## .
## A B C D Other
## 40 10 5 27 5
%>% fct_lump_lowfreq() %>% table()
x ## .
## A D Other
## 40 27 20
##
set.seed(123)
factor(letters[rpois(50, 5)])
x <-
x## [1] d g d h i b e h e e i e f e b h c b d i h f f k f f e e d c i h f g a e f c d c c d
## [43] d d c c c e d g
## Levels: a b c d e f g h i k
table(x)
## x
## a b c d e f g h i k
## 1 3 8 9 9 7 3 5 4 1
table(fct_lump_lowfreq(x))
##
## b c d e f g h i Other
## 3 8 9 9 7 3 5 4 2
## Use positive values to collapse the rarest
fct_lump_n(x, n = 3)
## [1] d Other d Other Other Other e Other e e Other e Other e
## [15] Other Other c Other d Other Other Other Other Other Other Other e e
## [29] d c Other Other Other Other Other e Other c d c c d
## [43] d d c c c e d Other
## Levels: c d e Other
fct_lump_prop(x, prop = 0.1)
## [1] d Other d Other Other Other e Other e e Other e f e
## [15] Other Other c Other d Other Other f f Other f f e e
## [29] d c Other Other f Other Other e f c d c c d
## [43] d d c c c e d Other
## Levels: c d e f Other
## Use negative values to collapse the most common
fct_lump_n(x, n = -3)
## [1] Other g Other Other Other b Other Other Other Other Other Other Other Other
## [15] b Other Other b Other Other Other Other Other k Other Other Other Other
## [29] Other Other Other Other Other g a Other Other Other Other Other Other Other
## [43] Other Other Other Other Other Other Other g
## Levels: a b g k Other
fct_lump_prop(x, prop = -0.1)
## [1] Other g Other h i b Other h Other Other i Other Other Other
## [15] b h Other b Other i h Other Other k Other Other Other Other
## [29] Other Other i h Other g a Other Other Other Other Other Other Other
## [43] Other Other Other Other Other Other Other g
## Levels: a b g h i k Other
## Use weighted frequencies
c(rep(2, 25), rep(1, 25))
w <-fct_lump_n(x, n = 5, w = w)
## [1] d Other d h Other Other e h e e Other e f e
## [15] Other h c Other d Other h f f Other f f e e
## [29] d c Other h f Other Other e f c d c c d
## [43] d d c c c e d Other
## Levels: c d e f h Other
fct_lump_n(x, n = 6)
## [1] d Other d h i Other e h e e i e f e
## [15] Other h c Other d i h f f Other f f e e
## [29] d c i h f Other Other e f c d c c d
## [43] d d c c c e d Other
## Levels: c d e f h i Other
fct_lump_n(x, n = 6, ties.method = "max")
## [1] d Other d h i Other e h e e i e f e
## [15] Other h c Other d i h f f Other f f e e
## [29] d c i h f Other Other e f c d c c d
## [43] d d c c c e d Other
## Levels: c d e f h i Other
## Use fct_lump_min() to lump together all levels with fewer than `n` values
table(fct_lump_min(x, min = 10))
##
## Other
## 50
以第 5 章中 退伍軍人肺癌臨床試驗 資料 (survVATrial.csv) 為例,
細胞型態 cellcode
有 4 種類別.
將類別水準合併成 2 種主要類別, 其餘類別合併.
read.table("./Data/survVATrial.csv",
dd <-header = TRUE,
sep = ",",
quote = "\"'",
dec = ".",
row.names = NULL,
# col.names,
as.is = TRUE,
# as.is = !stringsAsFactors,
na.strings = c(".", "NA"))
head(dd)
## treat cellcode time censor diagtime kps age prior
## 1 0 1 72 1 60 7 69 0
## 2 0 1 411 1 70 5 64 10
## 3 0 1 228 1 60 3 38 0
## 4 0 1 126 1 60 9 63 10
## 5 0 1 118 1 70 11 65 10
## 6 0 1 10 1 20 5 49 0
str(dd)
## 'data.frame': 137 obs. of 8 variables:
## $ treat : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cellcode: int 1 1 1 1 1 1 1 1 1 1 ...
## $ time : int 72 411 228 126 118 10 82 110 314 100 ...
## $ censor : int 1 1 1 1 1 1 1 1 1 0 ...
## $ diagtime: int 60 70 60 60 70 20 40 80 50 70 ...
## $ kps : int 7 5 3 9 11 5 10 29 18 6 ...
## $ age : int 69 64 38 63 65 49 69 68 43 70 ...
## $ prior : int 0 10 0 10 10 0 10 0 0 0 ...
$treat <- factor(dd$treat, labels = c("placebo", "test"))
dd$cellcode <- factor(dd$cellcode,
ddlabels = c("squamous", "small", "adeno", "large"))
$censor <- factor(dd$censor, labels = c("survival", "dead"))
dd$prior <- factor(dd$prior, labels = c("no", "yes"))
ddhead(dd)
## treat cellcode time censor diagtime kps age prior
## 1 placebo squamous 72 dead 60 7 69 no
## 2 placebo squamous 411 dead 70 5 64 yes
## 3 placebo squamous 228 dead 60 3 38 no
## 4 placebo squamous 126 dead 60 9 63 yes
## 5 placebo squamous 118 dead 70 11 65 yes
## 6 placebo squamous 10 dead 20 5 49 no
str(dd)
## 'data.frame': 137 obs. of 8 variables:
## $ treat : Factor w/ 2 levels "placebo","test": 1 1 1 1 1 1 1 1 1 1 ...
## $ cellcode: Factor w/ 4 levels "squamous","small",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ time : int 72 411 228 126 118 10 82 110 314 100 ...
## $ censor : Factor w/ 2 levels "survival","dead": 2 2 2 2 2 2 2 2 2 1 ...
## $ diagtime: int 60 70 60 60 70 20 40 80 50 70 ...
## $ kps : int 7 5 3 9 11 5 10 29 18 6 ...
## $ age : int 69 64 38 63 65 49 69 68 43 70 ...
## $ prior : Factor w/ 2 levels "no","yes": 1 2 1 2 2 1 2 1 1 1 ...
##
%>% select(cellcode) %>% table()
dd ## .
## squamous small adeno large
## 35 48 27 27
%>% select(cellcode) %>%
dd mutate(cellcode = fct_lump(cellcode, n = 1)) %>%
table()
## .
## small Other
## 48 89
11.5 類別水準的頻率排序函式 fct_infreq()
函式系列 fct_infreq()
可將類別水準的頻率排序記錄.
fct_inorder()
: 依照類別水準出現的次序排序.
fct_infreq()
: 依照類別水準的頻率排序 (由多到少).
fct_inseq()
: 依照類別水準的儲存數值排序.
factor(c("b", "b", "a", "c", "c", "c"))
f <-
f## [1] b b a c c c
## Levels: a b c
fct_inorder(f)
## [1] b b a c c c
## Levels: b a c
fct_infreq(f)
## [1] b b a c c c
## Levels: c b a
##
factor(1:3, levels = c("3", "2", "1"))
f <-
f## [1] 1 2 3
## Levels: 3 2 1
#> Levels: 3 2 1
fct_inseq(f)
## [1] 1 2 3
## Levels: 1 2 3
以第 5 章中 退伍軍人肺癌臨床試驗 資料 (survVATrial.csv) 為例,
細胞型態 cellcode
有 4 種類別.
%>% select(cellcode) %>% table()
dd ## .
## squamous small adeno large
## 35 48 27 27
## fct_inorder()
%>% select(cellcode) %>%
dd mutate(cellcode = fct_inorder(cellcode)) %>%
table()
## .
## squamous small adeno large
## 35 48 27 27
## fct_infreq()
%>% select(cellcode) %>%
dd mutate(cellcode = fct_infreq(cellcode)) %>%
table()
## .
## small squamous adeno large
## 48 35 27 27
## fct_inseq()
%>% select(cellcode) %>%
dd mutate(cellcode = fct_inseq(factor(as.numeric(cellcode)))) %>% table()
## .
## 1 2 3 4
## 35 48 27 27
以第 5 章中 退伍軍人肺癌臨床試驗 資料 (survVATrial.csv) 為例,
細胞型態 cellcode
有 4 種類別.
首先繪製長條圖.
## barplot
ggplot(dd, aes(x = cellcode)) +
geom_bar() +
coord_flip()
圖中的類別出現並非照類別水準的頻率順序,
函式 fct_infreq()
可將類別水準的頻率排序記錄.
ggplot(dd, aes(x = fct_infreq(cellcode))) +
geom_bar() +
coord_flip()
11.6 依照其他變數將類別重新排序函式 fct_reorder()
因子變數中的類別水準常常因其他變數而須將類別重新排序.
fct_rev(f)
: 將反轉原有類別出現的排列順序.fct_shuffle(f, n = 1L)
: 將原有類別出現的排列順序隨機變更.fct_reorder(.f, .x, .fun = median, ..., .desc = FALSE)
fct_reorder2(.f, .x, .y, .fun = last2, ..., .desc = TRUE)
first2(.x, .y)
last2(.x, .y)
fct_reorder()
將因子 f
類別出現的排列順序依照其他變數更動,
fct_reorder2()
保留因子 f
原有類別出現的排列順序.
當 y
變數依照 x
變數排序, 函式 first2(.x, .y)
與 last2(.x, .y)
可尋找 y
變數的最前與最後的 2 個數值.
其中引數
.f
: 為主要因子變數..x, .y
: 為其他變數..fun
: 為摘要函式..desc = FALSE
由小到大排序.
factor(c("a", "b", "c"))
f <-fct_rev(f)
## [1] a b c
## Levels: c b a
fct_shuffle(f)
## [1] a b c
## Levels: a c b
fct_shuffle(f)
## [1] a b c
## Levels: c b a
##
tibble::tribble(
df <-~color, ~a, ~b,
"blue", 1, 2,
"green", 6, 2,
"purple", 3, 3,
"red", 2, 3,
"yellow", 5, 1
)$color <- factor(df$color)
df##
fct_reorder(df$color, df$a, min)
## [1] blue green purple red yellow
## Levels: blue red purple yellow green
fct_reorder2(df$color, df$a, df$b)
## [1] blue green purple red yellow
## Levels: purple red blue green yellow
以第 5 章中 退伍軍人肺癌臨床試驗 資料 (survVATrial.csv) 為例,
細胞型態 cellcode
有 4 種類別.
boxplot(time ~ cellcode, data = dd)
boxplot(time ~ fct_reorder(cellcode, time), data = dd)
boxplot(time ~ fct_reorder(cellcode, time, .desc = TRUE), data = dd)