5.3 Chaninge number of levels

5.3.1 Lumping levels

To demonstrate how to lump multiple levels of a factor, we will start with fct_count() to count factor levels. It’s basically a variant of dplyr::count(), taking a factor (factor) as its first argument instead of a data frame, which makes it a nice function in mutate().

skin_color has 31 levels overall, and the top 5 to 6 levels occupy more than 50% percent of occurence. In fact, there are 24 levels whose frequency is less than 3%.

In this case, We may want to collpase some of the less frequent levels into one, say, a level called “other”.

forcats provides a family of functions that lumps together factor levels that meet some criteria into a new level “other”.

  • fct_lump_min(): lumps levels that appear fewer than min times

  • fct_lump_prop(): lumps levels that appear fewer than prop * n times

  • fct_lump_n(): lumps all levels except for the n most frequent (or least frequent if n < 0)

  • fct_lump_lowfreq() lumps together the least frequent levels, ensuring that “other” is still the smallest level.

Similarly, positive prop preserves values that appear at least prop of the time. Negative prop preserves values that appear at most -prop of the time.

Use argument other_level to change default name “other”

fct_other(f, keep, drop, other_level) provides a way of manually replacing values with “other”. Pcik one of keep and drop:

  • keep will preserve listed levels, replacing all others with other_level
  • drop will replace listed levels with other_level, keeping all as is.

5.3.3 Dropping levels

有时候我们希望在数据中取出一个子集,这可能导致在子集中,因子在某些水平上的频次为 0,但 R 并不会自动舍弃舍弃频次为 0 的水平:

还可以通过给 only 参数指定一个向量指定想要丢弃的水平,只有频次为0且包含在该向量中的水平才会被丢弃:

5.3.4 Transforming NA levels

When a factor has missing values, these NAs will not be listed as a valid level. Though in some cases NA in a factor could be meaningful. As such we can replace factor() with fct_explicit_na() if necessary

fct_explicit_na() gives a explicit factor level na_level to the NA: