5.4 Recoding
比修改因子水平顺序、改变水平个数更强大的操作时修改水平的值。修改水平的值不仅可以使图形标签更为美观清晰,以满足出版发行的要求,还可以将水平汇集成更高层次的显示。修改水平最常用、最强大的工具是fct_recode()
函数,它可以对每个水平进行修改或重新编码。例如,我们来看一下综合社会调查数据中的因子变量partyid
:
fct_count(gss_cat$partyid)
#> # A tibble: 10 x 2
#> f n
#> <fct> <int>
#> 1 No answer 154
#> 2 Don't know 1
#> 3 Other party 393
#> 4 Strong republican 2314
#> 5 Not str republican 3032
#> 6 Ind,near rep 1791
#> # ... with 4 more rows
在这个因子中,对水平的描述太过简单,而且不一致,我们用 fct_recode()
将其修改为较为详细的排比结构,格式为fct_recode(f,level_new = level_old)
:
gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican,strong" = "Strong republican",
"Republican weak" = "Not str republican",
"Independent,near rep" ="Ind,near rep",
"Independent,near dem" = "Ind,near dem",
"Democrat,weak" = "Not str democrat",
"Democrat,strong" = "Strong democrat")) %>%
count(partyid)
#> # A tibble: 10 x 2
#> partyid n
#> <fct> <int>
#> 1 No answer 154
#> 2 Don't know 1
#> 3 Other party 393
#> 4 Republican,strong 2314
#> 5 Republican weak 3032
#> 6 Independent,near rep 1791
#> # ... with 4 more rows
fct_recode()
函数会让没有明确提及的水平保持原样,如果不小心修改了一个不存在的水平,那么它也会给出警告。
可以将多个原水平赋给同一个新水平,这样就可以合并原来的分类:
## 将"no answer"、"Don't know"和"Other party"合并为"Other"
gss_cat %>% mutate(partyid_recode = fct_recode( partyid,
"Republican,strong" = "Strong republican",
"Republican weak" = "Not str republican",
"Independent,near rep" ="Ind,near rep",
"Independent,near dem" = "Ind,near dem",
"Democrat,weak" = "Not str democrat",
"Democrat,strong" = "Strong democrat",
"Other" = "No answer",
"Other" = "Don't know",
"Other" = "Other party"
)) %>%
count(partyid_recode)
#> # A tibble: 8 x 2
#> partyid_recode n
#> <fct> <int>
#> 1 Other 548
#> 2 Republican,strong 2314
#> 3 Republican weak 3032
#> 4 Independent,near rep 1791
#> 5 Independent 4119
#> 6 Independent,near dem 2499
#> # ... with 2 more rows
As a variant of fct_recode()
, fct_collapse()
collapses factor levels into manually defined groups
gss_cat %>%
mutate(partyid = fct_collapse(partyid,
other = c("No answer","Don't know","Other party"),
rep = c("Strong republican","Not str republican"),
ind = c("Ind,near rep","Independent","Ind,near dem"),
dem = c("Not str democrat","Strong democrat"))) %>%
count(partyid)
#> # A tibble: 4 x 2
#> partyid n
#> <fct> <int>
#> 1 other 548
#> 2 rep 5346
#> 3 ind 8409
#> 4 dem 7180
Unmentioned levels stay as is. To collapse this levels, specify other_level
, this is always placed at the end of levels.
# collapse two republican levels into "rep", and others into "I don't care"
gss_cat %>%
mutate(partyid = fct_collapse(partyid,
rep = c("Strong republican","Not str republican"),
other_level = "I don't care")) %>%
count(partyid)
#> # A tibble: 2 x 2
#> partyid n
#> <fct> <int>
#> 1 rep 5346
#> 2 I don't care 16137
5.4.1 Exercises
Exercise 5.1 美国民主党,共和党和中间派的人数是如何随时间变化的?
gss_cat_collapse <- gss_cat %>%
mutate(partyid = fct_collapse(partyid,
other = c("No answer","Don't know","Other party"),
rep = c("Strong republican","Not str republican"),
ind = c("Ind,near rep","Independent","Ind,near dem"),
dem = c("Not str democrat","Strong democrat")))
gss_cat_collapse %>%
group_by(year) %>%
count(partyid) %>%
ggplot(aes(year,n,color = partyid))+
geom_line()+
geom_point(size = 2, shape= 1)