5.2 Sorting
5.2.1 Sorting by frequency, appearance, or numeric order
fct_infreq()
reorder factor levels by frequency of each level, NA
levels come last regardless of frequency.
# What's the most frequent hair color in starwars ?
ggplot(starwars) +
geom_bar(aes(fct_infreq(hair_color)))
fct_inorder()
: sort a factor by the order in which they first appear. This can be useful when dealing with time series data.
f <- factor(c("b", "b", "a", "c", "c", "c"))
levels(f) # alphabetic order
#> [1] "a" "b" "c"
fct_inorder(f)
#> [1] b b a c c c
#> Levels: b a c
fct_inseq()
: sort a factor by numeric value of a level. This is only applicable when at least one existing level can be coercible to numeric
5.2.2 Sorting by another variable
fct_reorder()
其实就是 base::reorder()
在 forcats
中的实现,它根据因子在其他变量上的统计量(中位数、平均数、···)的值对各个水平排序,当绘制非频次条形图时很有用。
Use .fun
to set a summarizing function (defaults to median()
), .desc = TRUE
to sort the factor in descending order, NA
levels always come the last regardless of the corresponding variable, fct_explicit_na()
in Section 5.3.4 fix this.
# reorder hair_color by median of height, then summarize
med_height <- starwars %>%
mutate(hair_color = fct_reorder(hair_color, height)) %>%
group_by(hair_color) %>%
summarize(med_height = median(height, na.rm = TRUE))
med_height %>%
ggplot(aes(hair_color, med_height)) +
geom_col()
Sometimes a factor is mapped to a non-position aesthetic, fct_reorder2(.f, .x, .y, .fun = last2)
is designed for this kind of 2d displays of a factor. last2()
and first2()
are helpers for fct_reorder2()
; last2()
finds the last value of .y
when sorted by .x
; first2()
finds the first value.
5.2.3 Sorting manually
fct_infreq()
和 fct_reorder()
排序的依据是明确的,但我们有时也需要人工指定、修改排序结果。fct_relevel()
接受一个向量调整因子水平的排序。
这个例子中使用forcats::gss_cat
,该数据集是综合社会调查(General Social Survey)的一份抽样。综合社会调查是美国芝加哥大学的独立研究组织 NORC 进行的一项长期美国社会调查
gss_cat
#> # A tibble: 21,483 x 9
#> year marital age race rincome partyid relig denom tvhours
#> <int> <fct> <int> <fct> <fct> <fct> <fct> <fct> <int>
#> 1 2000 Never mar~ 26 White $8000 to ~ Ind,near r~ Protesta~ Souther~ 12
#> 2 2000 Divorced 48 White $8000 to ~ Not str re~ Protesta~ Baptist~ NA
#> 3 2000 Widowed 67 White Not appli~ Independent Protesta~ No deno~ 2
#> 4 2000 Never mar~ 39 White Not appli~ Ind,near r~ Orthodox~ Not app~ 4
#> 5 2000 Divorced 25 White Not appli~ Not str de~ None Not app~ 1
#> 6 2000 Married 25 White $20000 - ~ Strong dem~ Protesta~ Souther~ NA
#> # ... with 21,477 more rows
levels(gss_cat$rincome)
#> [1] "No answer" "Don't know" "Refused" "$25000 or more"
#> [5] "$20000 - 24999" "$15000 - 19999" "$10000 - 14999" "$8000 to 9999"
#> [9] "$7000 to 7999" "$6000 to 6999" "$5000 to 5999" "$4000 to 4999"
#> [13] "$3000 to 3999" "$1000 to 2999" "Lt $1000" "Not applicable"
在这个数据集中,因子 rincome
个水平的顺序排列是正确的。为了演示fct_relevel()
的用法,先用 fct_shuffle()
打乱该因子的水平顺序:
reshuffled_income <- fct_shuffle(gss_cat$rincome)
## reordering the levels of rincome randomly with fct_shuffle():
levels(reshuffled_income)
#> [1] "$3000 to 3999" "No answer" "$10000 - 14999" "$15000 - 19999"
#> [5] "Not applicable" "$8000 to 9999" "$25000 or more" "Don't know"
#> [9] "$7000 to 7999" "$20000 - 24999" "$5000 to 5999" "$1000 to 2999"
#> [13] "Refused" "$6000 to 6999" "Lt $1000" "$4000 to 4999"
在 fct_relevel()
中,通过一个包含水平名称的向量调整排序。默认情况下,向量中的第一个水平被调整到第一个位置上,第二个水平被调整到第二个位置上,以此类推,你只需要指定那些需要调整的水平。可以通过 after
指定向量中各水平被调整到什么地方, after = -Inf
时第一个水平将被调整到排序的最后一位:
## move Lt $1000 and $1000 to 2999 to the front
fct_relevel(reshuffled_income,
c("Lt $1000", "$1000 to 2999")) %>%
levels()
#> [1] "Lt $1000" "$1000 to 2999" "$3000 to 3999" "No answer"
#> [5] "$10000 - 14999" "$15000 - 19999" "Not applicable" "$8000 to 9999"
#> [9] "$25000 or more" "Don't know" "$7000 to 7999" "$20000 - 24999"
#> [13] "$5000 to 5999" "Refused" "$6000 to 6999" "$4000 to 4999"
## move Lt $1000 and $1000 to 2999 to the second and third place
fct_relevel(reshuffled_income,
c("Lt $1000", "$1000 to 2999"), after = 1) %>%
levels()
#> [1] "$3000 to 3999" "Lt $1000" "$1000 to 2999" "No answer"
#> [5] "$10000 - 14999" "$15000 - 19999" "Not applicable" "$8000 to 9999"
#> [9] "$25000 or more" "Don't know" "$7000 to 7999" "$20000 - 24999"
#> [13] "$5000 to 5999" "Refused" "$6000 to 6999" "$4000 to 4999"