Factors are used in R to represent categorical data. In the following, I will briefly introduce you to the forcats(Wickham 2022) package (nice anagram, Hadley!). Factors are augmented vectors which build upon integers. If you want to learn more about them, consider reading this paper.
[1] CDU Greens FDP CSU CSU CSU Greens AfD
[9] CDU AfD FDP Leftists AfD SPD CSU Greens
[17] Greens AfD CDU Greens AfD CSU Leftists CSU
[25] SPD Leftists AfD FDP CSU SPD Greens FDP
[33] Leftists Leftists CSU Leftists CSU CSU CDU SPD
[41] FDP SPD SPD CDU AfD AfD CSU AfD
[49] CSU <NA>
Levels: AfD CDU CSU FDP Greens Leftists SPD
Rows: 18475 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): county, fips, cand, st, lead
dbl (4): pct_report, votes, total_votes, pct
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Sometimes you want to reorder factors – for instance, when you want to create plots. (Note: you will learn more on plots in the next session on data visualization)
Two orders would make sense: alphabetical and according to their number of votes. fct_reorder() takes another variable and orders the factor according to it.
If you want to do bar plots, which you can use to depict the frequency of a value, you can order them according to the frequency they appear in using fct_infreq():
election_data_w_fct |>mutate(lead = lead |>fct_infreq() |>fct_rev()) |>ggplot(aes(x = lead)) +geom_bar()
5.2.2 Modifying levels
Remember the first factor? You need to put some graphs together and decide that you would rather like to use the original German names for the parties. Go for fct_recode().
parties_fct_ger <-fct_recode(parties_fct,"Buendnis90/Die Gruenen"="Greens", "Die Linke"="Leftists")
Damn, now the levels are not in alphabetical order anymore.
Now you need to write something for someone who is not particular familiar with the political landscape in Germany and rather wants “left,” “center,” and “right” instead of the party’s names. Give fct_collapse() a shot – and feel free to change it if you disagree with my classification.
Another thing you could do – and this is handy for the election data set – is collapsing things together according to their frequency of appearance. In the case of the election data set, this might be handy to lump together the candidates into three groups: Donald Trump, Hillary Clinton, and other.
election_data_w_fct |>mutate(candidate =fct_lump(candidate, n =2))
# A tibble: 18,007 × 8
county candidate state pct_r…¹ votes total…² pct lead
<fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <fct>
1 Los Angeles County Hillary Clinton CA 1 1.65e6 2314275 0.715 Hill…
2 Los Angeles County Donald Trump CA 1 5.43e5 2314275 0.234 Hill…
3 Los Angeles County Gary Johnson CA 1 5.69e4 2314275 0.0246 Hill…
4 Los Angeles County Other CA 1 4.67e4 2314275 0.0202 Hill…
5 Los Angeles County Other CA 1 1.35e4 2314275 0.00582 Hill…
6 Cook County Hillary Clinton IL 0.975 1.53e6 2055215 0.744 Hill…
7 Cook County Donald Trump IL 0.975 4.40e5 2055215 0.214 Hill…
8 Cook County Gary Johnson IL 0.975 5.59e4 2055215 0.0272 Hill…
9 Cook County Other IL 0.975 3.05e4 2055215 0.0149 Hill…
10 Harris County Hillary Clinton TX 1 7.06e5 1302887 0.542 Hill…
# … with 17,997 more rows, and abbreviated variable names ¹pct_report,
# ²total_votes
The problem here is that Gary Johnson appears as often as the two other candidates (have you ever heard of him?). Hence, fct_lump() cannot decide which levels to lump together. However, it has saved me a couple lines of code:
test <- election_data_w_fct |>mutate(candidate =fct_lump(candidate, n =2) |>fct_recode("Other"="Gary Johnson"))