5.2 Sorting

5.2.1 Sorting by frequency, appearance, or numeric order

fct_infreq() reorder factor levels by frequency of each level, NA levels come last regardless of frequency.

# What's the most frequent hair color in starwars ?
ggplot(starwars) +
geom_bar(aes(fct_infreq(hair_color)))

fct_inorder(): sort a factor by the order in which they first appear. This can be useful when dealing with time series data.

f <- factor(c("b", "b", "a", "c", "c", "c"))
levels(f) # alphabetic order
#> [1] "a" "b" "c"

fct_inorder(f)
#> [1] b b a c c c
#> Levels: b a c

fct_inseq(): sort a factor by numeric value of a level. This is only applicable when at least one existing level can be coercible to numeric

f <- factor(1:3, levels = c("3", "2", "1"))
fct_inseq(f)
#> [1] 1 2 3
#> Levels: 1 2 3

5.2.2 Sorting by another variable

fct_reorder() 其实就是 base::reorder()forcats 中的实现，它根据因子在其他变量上的统计量（中位数、平均数、···）的值对各个水平排序，当绘制非频次条形图时很有用。

Use .fun to set a summarizing function (defaults to median()), .desc = TRUE to sort the factor in descending order, NA levels always come the last regardless of the corresponding variable, fct_explicit_na() in Section 5.3.4 fix this.

# reorder hair_color by median of height, then summarize
med_height <- starwars %>%
mutate(hair_color = fct_reorder(hair_color, height)) %>%
group_by(hair_color) %>%
summarize(med_height = median(height, na.rm = TRUE))

med_height %>%
ggplot(aes(hair_color, med_height)) +
geom_col()

Sometimes a factor is mapped to a non-position aesthetic, fct_reorder2(.f, .x, .y, .fun = last2) is designed for this kind of 2d displays of a factor. last2() and first2() are helpers for fct_reorder2(); last2() finds the last value of .y when sorted by .x; first2() finds the first value.

chks <- ChickWeight %>%
as_tibble() %>%
filter(as.integer(Chick) < 10) %>%
mutate(Chick = fct_shuffle(Chick))  # random order

ggplot(chks, aes(Time, weight, color = Chick)) +
geom_point() +
geom_line()


# change the order of weight,
# so that points with largest Weight, last time are assigned the first color
# Note that lines match order in legend
ggplot(chks, aes(Time, weight, color = fct_reorder2(Chick, Time, weight))) +
geom_point() +
geom_line() +
labs(colour = "Chick")

5.2.3 Sorting manually

fct_infreq()fct_reorder() 排序的依据是明确的，但我们有时也需要人工指定、修改排序结果。fct_relevel()接受一个向量调整因子水平的排序。

gss_cat
#> # A tibble: 21,483 x 9
#>    year marital      age race  rincome    partyid     relig     denom    tvhours
#>   <int> <fct>      <int> <fct> <fct>      <fct>       <fct>     <fct>      <int>
#> 1  2000 Never mar~    26 White $8000 to ~ Ind,near r~ Protesta~ Souther~ 12 #> 2 2000 Divorced 48 White$8000 to ~ Not str re~ Protesta~ Baptist~      NA
#> 3  2000 Widowed       67 White Not appli~ Independent Protesta~ No deno~       2
#> 4  2000 Never mar~    39 White Not appli~ Ind,near r~ Orthodox~ Not app~       4
#> 5  2000 Divorced      25 White Not appli~ Not str de~ None      Not app~       1
#> 6  2000 Married       25 White $20000 - ~ Strong dem~ Protesta~ Souther~ NA #> # ... with 21,477 more rows levels(gss_cat$rincome)
#>  [1] "No answer"      "Don't know"     "Refused"        "$25000 or more" #> [5] "$20000 - 24999" "$15000 - 19999" "$10000 - 14999" "$8000 to 9999" #> [9] "$7000 to 7999"  "$6000 to 6999" "$5000 to 5999"  "$4000 to 4999" #> [13] "$3000 to 3999"  "$1000 to 2999" "Lt$1000"       "Not applicable"

reshuffled_income <- fct_shuffle(gss_cat$rincome) ## reordering the levels of rincome randomly with fct_shuffle(): levels(reshuffled_income) #> [1] "$3000 to 3999"  "No answer"      "$10000 - 14999" "$15000 - 19999"
#>  [5] "Not applicable" "$8000 to 9999" "$25000 or more" "Don't know"
#>  [9] "$7000 to 7999" "$20000 - 24999" "$5000 to 5999" "$1000 to 2999"
#> [13] "Refused"        "$6000 to 6999" "Lt$1000"       "$4000 to 4999" fct_relevel() 中，通过一个包含水平名称的向量调整排序。默认情况下，向量中的第一个水平被调整到第一个位置上，第二个水平被调整到第二个位置上，以此类推，你只需要指定那些需要调整的水平。可以通过 after 指定向量中各水平被调整到什么地方, after = -Inf 时第一个水平将被调整到排序的最后一位： ## move Lt$1000 and $1000 to 2999 to the front fct_relevel(reshuffled_income, c("Lt$1000", "$1000 to 2999")) %>% levels() #> [1] "Lt$1000"       "$1000 to 2999" "$3000 to 3999"  "No answer"
#>  [5] "$10000 - 14999" "$15000 - 19999" "Not applicable" "$8000 to 9999" #> [9] "$25000 or more" "Don't know"     "$7000 to 7999" "$20000 - 24999"
#> [13] "$5000 to 5999" "Refused" "$6000 to 6999"  "$4000 to 4999" ## move Lt$1000 and $1000 to 2999 to the second and third place fct_relevel(reshuffled_income, c("Lt$1000", "$1000 to 2999"), after = 1) %>% levels() #> [1] "$3000 to 3999"  "Lt $1000" "$1000 to 2999"  "No answer"
#>  [5] "$10000 - 14999" "$15000 - 19999" "Not applicable" "$8000 to 9999" #> [9] "$25000 or more" "Don't know"     "$7000 to 7999" "$20000 - 24999"
#> [13] "$5000 to 5999" "Refused" "$6000 to 6999"  "\$4000 to 4999"