5.2 Sorting

5.2.1 Sorting by frequency, appearance, or numeric order

fct_infreq() reorder factor levels by frequency of each level, NA levels come last regardless of frequency.

# What's the most frequent hair color in starwars ?
ggplot(starwars) +
  geom_bar(aes(fct_infreq(hair_color)))

fct_inorder(): sort a factor by the order in which they first appear. This can be useful when dealing with time series data.

f <- factor(c("b", "b", "a", "c", "c", "c"))
levels(f) # alphabetic order
#> [1] "a" "b" "c"

fct_inorder(f)
#> [1] b b a c c c
#> Levels: b a c

fct_inseq(): sort a factor by numeric value of a level. This is only applicable when at least one existing level can be coercible to numeric

f <- factor(1:3, levels = c("3", "2", "1"))
fct_inseq(f)
#> [1] 1 2 3
#> Levels: 1 2 3

5.2.2 Sorting by another variable

fct_reorder() 其实就是 base::reorder() 在 forcats 中的实现，它根据因子在其他变量上的统计量（中位数、平均数、···）的值对各个水平排序，当绘制非频次条形图时很有用。

Use .fun to set a summarizing function (defaults to median()), .desc = TRUE to sort the factor in descending order, NA levels always come the last regardless of the corresponding variable, fct_explicit_na() in Section 5.3.4 fix this.

# reorder hair_color by median of height, then summarize
med_height <- starwars %>% 
  mutate(hair_color = fct_reorder(hair_color, height)) %>% 
  group_by(hair_color) %>%
  summarize(med_height = median(height, na.rm = TRUE))  

med_height %>% 
  ggplot(aes(hair_color, med_height)) + 
  geom_col()

Sometimes a factor is mapped to a non-position aesthetic, fct_reorder2(.f, .x, .y, .fun = last2) is designed for this kind of 2d displays of a factor. last2() and first2() are helpers for fct_reorder2(); last2() finds the last value of .y when sorted by .x; first2() finds the first value.

chks <- ChickWeight %>% 
  as_tibble() %>% 
  filter(as.integer(Chick) < 10) %>% 
  mutate(Chick = fct_shuffle(Chick))  # random order

ggplot(chks, aes(Time, weight, color = Chick)) +
  geom_point() +
  geom_line()


# change the order of weight, 
# so that points with largest Weight, last time are assigned the first color
# Note that lines match order in legend
ggplot(chks, aes(Time, weight, color = fct_reorder2(Chick, Time, weight))) +
  geom_point() +
  geom_line() +
  labs(colour = "Chick")

5.2.3 Sorting manually

fct_infreq() 和 fct_reorder() 排序的依据是明确的，但我们有时也需要人工指定、修改排序结果。fct_relevel()接受一个向量调整因子水平的排序。

这个例子中使用forcats::gss_cat，该数据集是综合社会调查（General Social Survey）的一份抽样。综合社会调查是美国芝加哥大学的独立研究组织 NORC 进行的一项长期美国社会调查

gss_cat
#> # A tibble: 21,483 x 9
#>    year marital      age race  rincome    partyid     relig     denom    tvhours
#>   <int> <fct>      <int> <fct> <fct>      <fct>       <fct>     <fct>      <int>
#> 1  2000 Never mar~    26 White $8000 to ~ Ind,near r~ Protesta~ Souther~      12
#> 2  2000 Divorced      48 White $8000 to ~ Not str re~ Protesta~ Baptist~      NA
#> 3  2000 Widowed       67 White Not appli~ Independent Protesta~ No deno~       2
#> 4  2000 Never mar~    39 White Not appli~ Ind,near r~ Orthodox~ Not app~       4
#> 5  2000 Divorced      25 White Not appli~ Not str de~ None      Not app~       1
#> 6  2000 Married       25 White $20000 - ~ Strong dem~ Protesta~ Souther~      NA
#> # ... with 21,477 more rows
levels(gss_cat$rincome)
#>  [1] "No answer"      "Don't know"     "Refused"        "$25000 or more"
#>  [5] "$20000 - 24999" "$15000 - 19999" "$10000 - 14999" "$8000 to 9999" 
#>  [9] "$7000 to 7999"  "$6000 to 6999"  "$5000 to 5999"  "$4000 to 4999" 
#> [13] "$3000 to 3999"  "$1000 to 2999"  "Lt $1000"       "Not applicable"

在这个数据集中，因子 rincome 个水平的顺序排列是正确的。为了演示fct_relevel()的用法，先用 fct_shuffle() 打乱该因子的水平顺序：

reshuffled_income <- fct_shuffle(gss_cat$rincome)
## reordering the levels of rincome randomly with fct_shuffle():
levels(reshuffled_income)
#>  [1] "$3000 to 3999"  "No answer"      "$10000 - 14999" "$15000 - 19999"
#>  [5] "Not applicable" "$8000 to 9999"  "$25000 or more" "Don't know"    
#>  [9] "$7000 to 7999"  "$20000 - 24999" "$5000 to 5999"  "$1000 to 2999" 
#> [13] "Refused"        "$6000 to 6999"  "Lt $1000"       "$4000 to 4999"

在 fct_relevel() 中，通过一个包含水平名称的向量调整排序。默认情况下，向量中的第一个水平被调整到第一个位置上，第二个水平被调整到第二个位置上，以此类推，你只需要指定那些需要调整的水平。可以通过 after 指定向量中各水平被调整到什么地方, after = -Inf 时第一个水平将被调整到排序的最后一位：

## move Lt $1000 and $1000 to 2999 to the front
fct_relevel(reshuffled_income, 
            c("Lt $1000", "$1000 to 2999")) %>%
  levels()
#>  [1] "Lt $1000"       "$1000 to 2999"  "$3000 to 3999"  "No answer"     
#>  [5] "$10000 - 14999" "$15000 - 19999" "Not applicable" "$8000 to 9999" 
#>  [9] "$25000 or more" "Don't know"     "$7000 to 7999"  "$20000 - 24999"
#> [13] "$5000 to 5999"  "Refused"        "$6000 to 6999"  "$4000 to 4999"

## move Lt $1000 and $1000 to 2999 to the second and third place
fct_relevel(reshuffled_income, 
            c("Lt $1000", "$1000 to 2999"), after = 1) %>%
  levels()
#>  [1] "$3000 to 3999"  "Lt $1000"       "$1000 to 2999"  "No answer"     
#>  [5] "$10000 - 14999" "$15000 - 19999" "Not applicable" "$8000 to 9999" 
#>  [9] "$25000 or more" "Don't know"     "$7000 to 7999"  "$20000 - 24999"
#> [13] "$5000 to 5999"  "Refused"        "$6000 to 6999"  "$4000 to 4999"