6.2 Pivoting
细看两表,不难发现它们实质上相同的数据(相对于第二张表,第一张是以 id 为行字段,key 为列字段,val 为值的数据透视表)。第一种称为宽数据 (wide data,Cartesian data,笛卡尔型数据),需要看行和列的交叉点来找到对应的值。而第二种形式称为长数据(long data,indexed data,指标型数据),在长数据(指标型)数据汇总,你需要看指标来找到需要变量的数值(变量x,y,z的值)。。很难简单地说哪一种格式更优,因为两种形式都有可能是整洁的,这取决于值“A”、“B”、“C”、“D”的含义。
数据整理常需要化宽为长,但偶尔也需要化长为宽, tidyr 分别提供了cpivot_longer()
和 pivot_wider()
来实现以上两种形式的转换操作(统称为 pivoting)。在 tidyr 1.0.0 及更早的版本中,gather()
和 spread()
分别承担相同的工作,实际效果与 pivot_
函数一样,但后者有更易理解的命名和 api。
names_to
和 values_to
参数相当于原来 gather()
中的 key
和 value
,其中 “键” 列的默认名称变为 “name”
同理, names_from
和 values_from
相当于原来 spread()
中的 key
和 value
6.2.1 pivot_longer()
tidyr::relig_income
是一个典型的宽数据,除第一列以外的所有列表示收入的不同水平,值表示对应的人数:
relig_income
#> # A tibble: 18 x 11
#> religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Agnostic 27 34 60 81 76 137 122
#> 2 Atheist 12 27 37 52 35 70 73
#> 3 Buddhist 27 21 30 34 33 58 62
#> 4 Catholic 418 617 732 670 638 1116 949
#> 5 Don’t k~ 15 14 15 11 10 35 21
#> 6 Evangel~ 575 869 1064 982 881 1486 949
#> # ... with 12 more rows, and 3 more variables: `$100-150k` <dbl>,
#> # `>150k` <dbl>, `Don't know/refused` <dbl>
relig_income %>% pivot_longer(-religion,
names_to = "income",
values_to = "population")
#> pivot_longer: reorganized (<$10k, $10-20k, $20-30k, $30-40k, $40-50k, …) into (income, population) [was 18x11, now 180x3]
#> # A tibble: 180 x 3
#> religion income population
#> <chr> <chr> <dbl>
#> 1 Agnostic <$10k 27
#> 2 Agnostic $10-20k 34
#> 3 Agnostic $20-30k 60
#> 4 Agnostic $30-40k 81
#> 5 Agnostic $40-50k 76
#> 6 Agnostic $50-75k 137
#> # ... with 174 more rows
另一个例子:美国劳工市场的月度数据,首先创建一个 messy data:
ec2 <- economics %>% as_tibble() %>%
transmute(year = year(date),
month = month(date),
rate = unemploy) %>%
filter(year > 2005) %>%
pivot_wider(names_from = "month", values_from = "rate")
#> transmute: dropped 6 variables (date, pce, pop, psavert, uempmed, …)
#> new variable 'year' with 49 unique values and 0% NA
#> new variable 'month' with 12 unique values and 0% NA
#> new variable 'rate' with 550 unique values and 0% NA
#> filter: removed 462 rows (80%), 112 rows remaining
#> pivot_wider: reorganized (month, rate) into (1, 2, 3, 4, 5, …) [was 112x3, now 10x13]
ec2
#> # A tibble: 10 x 13
#> year `1` `2` `3` `4` `5` `6` `7` `8` `9` `10` `11` `12`
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2006 7064 7184 7072 7120 6980 7001 7175 7091 6847 6727 6872 6762
#> 2 2007 7116 6927 6731 6850 6766 6979 7149 7067 7170 7237 7240 7645
#> 3 2008 7685 7497 7822 7637 8395 8575 8937 9438 9494 10074 10538 11286
#> 4 2009 12058 12898 13426 13853 14499 14707 14601 14814 15009 15352 15219 15098
#> 5 2010 15046 15113 15202 15325 14849 14474 14512 14648 14579 14516 15081 14348
#> 6 2011 14013 13820 13737 13957 13855 13962 13763 13818 13948 13594 13302 13093
#> # ... with 4 more rows
化宽为长:
ec2 %>%
pivot_longer(-year, names_to = "month", values_to = "value")
#> pivot_longer: reorganized (1, 2, 3, 4, 5, …) into (month, value) [was 10x13, now 120x3]
#> # A tibble: 120 x 3
#> year month value
#> <dbl> <chr> <dbl>
#> 1 2006 1 7064
#> 2 2006 2 7184
#> 3 2006 3 7072
#> 4 2006 4 7120
#> 5 2006 5 6980
#> 6 2006 6 7001
#> # ... with 114 more rows
以上就是 pviot_longer
的基本用法,下面来处理一些更复杂的情况。
6.2.1.1 Numeric data in column names
pivot_longer()
提供了 names_ptype
和 values_ptypes
调整数据集变长后键列和值列的数据类型。看一下 billboard
数据集:
billboard
#> # A tibble: 317 x 79
#> artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
#> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2 Pac Baby~ 2000-02-26 87 82 72 77 87 94 99 NA
#> 2 2Ge+h~ The ~ 2000-09-02 91 87 92 NA NA NA NA NA
#> 3 3 Doo~ Kryp~ 2000-04-08 81 70 68 67 66 57 54 53
#> 4 3 Doo~ Loser 2000-10-21 76 76 72 69 67 65 55 59
#> 5 504 B~ Wobb~ 2000-04-15 57 34 25 17 17 31 36 49
#> 6 98^0 Give~ 2000-08-19 51 39 34 26 26 19 2 2
#> # ... with 311 more rows, and 68 more variables: wk9 <dbl>, wk10 <dbl>,
#> # wk11 <dbl>, wk12 <dbl>, wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>,
#> # wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>,
#> # wk23 <dbl>, wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>,
#> # wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>,
#> # wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>,
#> # wk41 <dbl>, wk42 <dbl>, wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>,
#> # wk47 <dbl>, wk48 <dbl>, wk49 <dbl>, wk50 <dbl>, wk51 <dbl>, wk52 <dbl>,
#> # wk53 <dbl>, wk54 <dbl>, wk55 <dbl>, wk56 <dbl>, wk57 <dbl>, wk58 <dbl>,
#> # wk59 <dbl>, wk60 <dbl>, wk61 <dbl>, wk62 <dbl>, wk63 <dbl>, wk64 <dbl>,
#> # wk65 <dbl>, wk66 <lgl>, wk67 <lgl>, wk68 <lgl>, wk69 <lgl>, wk70 <lgl>,
#> # wk71 <lgl>, wk72 <lgl>, wk73 <lgl>, wk74 <lgl>, wk75 <lgl>, wk76 <lgl>
显然,我们希望将所有以 "wk"
开头的列聚合以得到整洁数据,键列和值列分别命名为 “week” 和 “rank”。另外要考虑的一点是,我们很可能之后想计算歌曲保持在榜单上的周数,故需要将 “week” 列转换为数值类型:
billboard_tidy <- billboard %>%
pivot_longer(cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
names_prefix = "wk",
names_ptypes = list(week = integer()),
values_drop_na = T)
#> pivot_longer: reorganized (wk1, wk2, wk3, wk4, wk5, …) into (week, rank) [was 317x79, now 5307x5]
billboard_tidy
#> # A tibble: 5,307 x 5
#> artist track date.entered week rank
#> <chr> <chr> <date> <int> <dbl>
#> 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87
#> 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82
#> 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72
#> 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77
#> 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87
#> 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94
#> # ... with 5,301 more rows
names_prefix
去除前缀 “wk”(否则无法从字符串转换为数值),names_ptype
以列表的形式转换键列的数据类型。同理, values_ptype
可以转换值列的数据类型。
## 计算保持周数
billboard_tidy %>%
group_by(track) %>%
summarise(stay = max(week) - min(week) + 1) %>%
arrange(desc(stay))
#> group_by: one grouping variable (track)
#> summarise: now 316 rows and 2 columns, ungrouped
#> # A tibble: 316 x 2
#> track stay
#> <chr> <dbl>
#> 1 Higher 65
#> 2 Amazed 64
#> 3 Breathe 53
#> 4 Kryptonite 53
#> 5 With Arms Wide Open 47
#> 6 I Wanna Know 44
#> # ... with 310 more rows
6.2.1.2 Many variables in column names
在 tidyr 1.0.0 之前,当进行一定处理后发现多个变量被糅合到一列中时,可能会考虑使用 separate()
或者 extract()
:
who
#> # A tibble: 7,240 x 60
#> country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
#> <chr> <chr> <chr> <int> <int> <int> <int> <int>
#> 1 Afghan~ AF AFG 1980 NA NA NA NA
#> 2 Afghan~ AF AFG 1981 NA NA NA NA
#> 3 Afghan~ AF AFG 1982 NA NA NA NA
#> 4 Afghan~ AF AFG 1983 NA NA NA NA
#> 5 Afghan~ AF AFG 1984 NA NA NA NA
#> 6 Afghan~ AF AFG 1985 NA NA NA NA
#> # ... with 7,234 more rows, and 52 more variables: new_sp_m4554 <int>,
#> # new_sp_m5564 <int>, new_sp_m65 <int>, new_sp_f014 <int>,
#> # new_sp_f1524 <int>, new_sp_f2534 <int>, new_sp_f3544 <int>,
#> # new_sp_f4554 <int>, new_sp_f5564 <int>, new_sp_f65 <int>,
#> # new_sn_m014 <int>, new_sn_m1524 <int>, new_sn_m2534 <int>,
#> # new_sn_m3544 <int>, new_sn_m4554 <int>, new_sn_m5564 <int>,
#> # new_sn_m65 <int>, new_sn_f014 <int>, new_sn_f1524 <int>,
#> # new_sn_f2534 <int>, new_sn_f3544 <int>, new_sn_f4554 <int>,
#> # new_sn_f5564 <int>, new_sn_f65 <int>, new_ep_m014 <int>,
#> # new_ep_m1524 <int>, new_ep_m2534 <int>, new_ep_m3544 <int>,
#> # new_ep_m4554 <int>, new_ep_m5564 <int>, new_ep_m65 <int>,
#> # new_ep_f014 <int>, new_ep_f1524 <int>, new_ep_f2534 <int>,
#> # new_ep_f3544 <int>, new_ep_f4554 <int>, new_ep_f5564 <int>,
#> # new_ep_f65 <int>, newrel_m014 <int>, newrel_m1524 <int>,
#> # newrel_m2534 <int>, newrel_m3544 <int>, newrel_m4554 <int>,
#> # newrel_m5564 <int>, newrel_m65 <int>, newrel_f014 <int>,
#> # newrel_f1524 <int>, newrel_f2534 <int>, newrel_f3544 <int>,
#> # newrel_f4554 <int>, newrel_f5564 <int>, newrel_f65 <int>
## 原书 tidyr 一章中使用的方法
who %>%
gather(starts_with("new"),
key = key,
value = value,
na.rm = T) %>%
extract(key,
into = c("diagnosis", "gender", "age"),
regex = "new_?(.*)_(.)(.*)")
#> gather: reorganized (new_sp_m014, new_sp_m1524, new_sp_m2534, new_sp_m3544, new_sp_m4554, …) into (key, value) [was 7240x60, now 76046x6]
#> # A tibble: 76,046 x 8
#> country iso2 iso3 year diagnosis gender age value
#> <chr> <chr> <chr> <int> <chr> <chr> <chr> <int>
#> 1 Afghanistan AF AFG 1997 sp m 014 0
#> 2 Afghanistan AF AFG 1998 sp m 014 30
#> 3 Afghanistan AF AFG 1999 sp m 014 8
#> 4 Afghanistan AF AFG 2000 sp m 014 52
#> 5 Afghanistan AF AFG 2001 sp m 014 129
#> 6 Afghanistan AF AFG 2002 sp m 014 90
#> # ... with 76,040 more rows
现在,pivot_longer()
现在可以在化宽为长的下一步直接完成拆分任务,可以直接在 names_to
中传入一个向量表示分裂后的各个键列,并在 names_sep
(分隔符) 或者 names_pattern
中(正则表达式)指定分裂的模式:
who %>%
pivot_longer(starts_with("new"),
names_to = c("diagonosis", "gender", "age"),
names_pattern = "new_?(.*)_(.)(.*)",
values_to = "count",
values_drop_na = T)
#> pivot_longer: reorganized (new_sp_m014, new_sp_m1524, new_sp_m2534, new_sp_m3544, new_sp_m4554, …) into (diagonosis, gender, age, count) [was 7240x60, now 76046x8]
#> # A tibble: 76,046 x 8
#> country iso2 iso3 year diagonosis gender age count
#> <chr> <chr> <chr> <int> <chr> <chr> <chr> <int>
#> 1 Afghanistan AF AFG 1997 sp m 014 0
#> 2 Afghanistan AF AFG 1997 sp m 1524 10
#> 3 Afghanistan AF AFG 1997 sp m 2534 6
#> 4 Afghanistan AF AFG 1997 sp m 3544 3
#> 5 Afghanistan AF AFG 1997 sp m 4554 5
#> 6 Afghanistan AF AFG 1997 sp m 5564 2
#> # ... with 76,040 more rows
更进一步,顺便设定好整理后 gender
和 age
的类型:
who %>%
pivot_longer(cols = starts_with("new"),
names_to = c("diagonosis", "gender", "age"),
names_pattern = "new_?(.*)_(.)(.*)",
names_ptypes = list(
gender = factor(levels = c("f", "m")),
age = factor(
levels = c("014", "1524", "2534", "3544", "4554", "5564", "65"),
ordered = TRUE)
),
values_to = "count",
values_drop_na = T)
#> pivot_longer: reorganized (new_sp_m014, new_sp_m1524, new_sp_m2534, new_sp_m3544, new_sp_m4554, …) into (diagonosis, gender, age, count) [was 7240x60, now 76046x8]
#> # A tibble: 76,046 x 8
#> country iso2 iso3 year diagonosis gender age count
#> <chr> <chr> <chr> <int> <chr> <fct> <ord> <int>
#> 1 Afghanistan AF AFG 1997 sp m 014 0
#> 2 Afghanistan AF AFG 1997 sp m 1524 10
#> 3 Afghanistan AF AFG 1997 sp m 2534 6
#> 4 Afghanistan AF AFG 1997 sp m 3544 3
#> 5 Afghanistan AF AFG 1997 sp m 4554 5
#> 6 Afghanistan AF AFG 1997 sp m 5564 2
#> # ... with 76,040 more rows
6.2.1.3 Multiple observations per row
(多个值列)
So far, we have been working with data frames that have one observation per row, but many important pivotting problems involve multiple observations per row. You can usually recognise this case because name of the column that you want to appear in the output is part of the column name in the input. In this section, you’ll learn how to pivot this sort of data.
family <- tribble(
~family, ~dob_child1, ~dob_child2, ~gender_child1, ~gender_child2,
1L, "1998-11-26", "2000-01-29", 1L, 2L,
2L, "1996-06-22", NA, 2L, NA,
3L, "2002-07-11", "2004-04-05", 2L, 2L,
4L, "2004-10-10", "2009-08-27", 1L, 1L,
5L, "2000-12-05", "2005-02-28", 2L, 1L,
)
family <- family %>% mutate_at(vars(starts_with("dob")), ymd)
#> mutate_at: converted 'dob_child1' from character to Date (0 new NA)
#> converted 'dob_child2' from character to Date (0 new NA)
family
#> # A tibble: 5 x 5
#> family dob_child1 dob_child2 gender_child1 gender_child2
#> <int> <date> <date> <int> <int>
#> 1 1 1998-11-26 2000-01-29 1 2
#> 2 2 1996-06-22 NA 2 NA
#> 3 3 2002-07-11 2004-04-05 2 2
#> 4 4 2004-10-10 2009-08-27 1 1
#> 5 5 2000-12-05 2005-02-28 2 1
理想中的数据格式(两个值列)
family | child | dob | gender |
---|---|---|---|
1 | 1 | 1998-11-26 | 1 |
1 | 2 | 2000-01-29 | 2 |
2 | 1 | 1996-06-22 | 2 |
3 | 1 | 2002-07-11 | 2 |
3 | 2 | 2004-04-05 | 2 |
4 | 1 | 2004-10-10 | 1 |
4 | 2 | 2009-08-27 | 1 |
5 | 1 | 2000-12-05 | 2 |
5 | 2 | 2005-02-28 | 1 |
Note that we have two pieces of information (or values) for each child: their gender
and their dob
(date of birth). These need to go into separate columns in the result. Again we supply multiple variables to names_to
, using names_sep
to split up each variable name. Note the special name .value
: this tells pivot_longer()
that that part of the column name specifies the “value” being measured (which will become a variable in the output)
.value
在这里指代 dob
和 gender
两个值列
family %>%
pivot_longer(
-family,
names_to = c(".value", "child"), ## child 为每个 family 中的标识变量
names_sep = "_",
values_drop_na = TRUE
)
#> pivot_longer: reorganized (dob_child1, dob_child2, gender_child1, gender_child2) into (child, dob, gender) [was 5x5, now 9x4]
#> # A tibble: 9 x 4
#> family child dob gender
#> <int> <chr> <date> <int>
#> 1 1 child1 1998-11-26 1
#> 2 1 child2 2000-01-29 2
#> 3 2 child1 1996-06-22 2
#> 4 3 child1 2002-07-11 2
#> 5 3 child2 2004-04-05 2
#> 6 4 child1 2004-10-10 1
#> # ... with 3 more rows
在这里,dob_child1
、dob_child2
、gender_child1
、gender_child2
四个列名的后半部分被当做键列的值。例如,可以认为对于 family == 1
的观测,首先生成了如下的结构:
family | child | dob | dob | gender | gender | |
---|---|---|---|---|---|---|
1 | child1 | 1998-11-16 | 2000-01-29 | 1 | 2 | |
2 | child2 |
而后名称相同的值列合并:
family | child | dob | gender |
---|---|---|---|
1 | child1 | 1998-11-26 | 1 |
1 | child2 | 2000-01-29 | 2 |
另一个例子:
anscombe
#> x1 x2 x3 x4 y1 y2 y3 y4
#> 1 10 10 10 8 8.04 9.14 7.46 6.58
#> 2 8 8 8 8 6.95 8.14 6.77 5.76
#> 3 13 13 13 8 7.58 8.74 12.74 7.71
#> 4 9 9 9 8 8.81 8.77 7.11 8.84
#> 5 11 11 11 8 8.33 9.26 7.81 8.47
#> 6 14 14 14 8 9.96 8.10 8.84 7.04
#> 7 6 6 6 8 7.24 6.13 6.08 5.25
#> 8 4 4 4 19 4.26 3.10 5.39 12.50
#> 9 12 12 12 8 10.84 9.13 8.15 5.56
#> 10 7 7 7 8 4.82 7.26 6.42 7.91
#> 11 5 5 5 8 5.68 4.74 5.73 6.89
anscombe %>%
pivot_longer(everything(),
names_to = c(".value", "set"),
names_pattern = "([xy])([1234])")
#> pivot_longer: reorganized (x1, x2, x3, x4, y1, …) into (set, x, y) [was 11x8, now 44x3]
#> # A tibble: 44 x 3
#> set x y
#> <chr> <dbl> <dbl>
#> 1 1 10 8.04
#> 2 2 10 9.14
#> 3 3 10 7.46
#> 4 4 8 6.58
#> 5 1 8 6.95
#> 6 2 8 8.14
#> # ... with 38 more rows
叕一个例子:
pnl <- tibble(
x = 1:4,
a = c(1, 1,0, 0),
b = c(0, 1, 1, 1),
y1 = rnorm(4),
y2 = rnorm(4),
z1 = rep(3, 4),
z2 = rep(-2, 4),
)
pnl
#> # A tibble: 4 x 7
#> x a b y1 y2 z1 z2
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 0 0.788 -1.59 3 -2
#> 2 2 1 1 -0.422 0.597 3 -2
#> 3 3 0 1 0.0569 1.22 3 -2
#> 4 4 0 1 0.711 -0.312 3 -2
pnl %>%
pivot_longer(-(x:b),
names_to = c(".value", "time"),
names_pattern = "([yz])([12])")
#> pivot_longer: reorganized (y1, y2, z1, z2) into (time, y, z) [was 4x7, now 8x6]
#> # A tibble: 8 x 6
#> x a b time y z
#> <int> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 1 1 0 1 0.788 3
#> 2 1 1 0 2 -1.59 -2
#> 3 2 1 1 1 -0.422 3
#> 4 2 1 1 2 0.597 -2
#> 5 3 0 1 1 0.0569 3
#> 6 3 0 1 2 1.22 -2
#> # ... with 2 more rows
6.2.1.4 Duplicated column names
如果某个数据框中各列有重复的名字,用 gather()
聚合这些变量所在的列时会返回一条错误:
这是因为被聚合的列名被当做 key
列的值,又因这些值是重复的,故不能唯一标识一条记录。pivot_longer()
针对这一点做了优化,尝试聚合这些列时,会自动生成一个标识列:
# To create a tibble with duplicated names
# you have to explicitly opt out of the name repair
# that usually prevents you from creating such a dataset:
(df <- tibble(x = 1:3, y = 4:6, y = 5:7, y = 7:9, .name_repair = "minimal"))
#> # A tibble: 3 x 4
#> x y y y
#> <int> <int> <int> <int>
#> 1 1 4 5 7
#> 2 2 5 6 8
#> 3 3 6 7 9
6.2.2 pivot_wider()
pivot_wider()
是 pivot_longer()
的逆操作,虽然在获得 tidy data 上,前者没有后者常用,但它经常被用来创建一些 summary table。
fish_encounters
数据记录了一些沿河观测站对一批鱼群的观测情况(seen
为发现次数):
fish_encounters
#> # A tibble: 114 x 3
#> fish station seen
#> <fct> <fct> <int>
#> 1 4842 Release 1
#> 2 4842 I80_1 1
#> 3 4842 Lisbon 1
#> 4 4842 Rstr 1
#> 5 4842 Base_TD 1
#> 6 4842 BCE 1
#> # ... with 108 more rows
很多后续分析工具需要每个观测站的观测情况单独成一列,使用 pivot_wider()
:
# 数据中已有的列不需要引号便可引用
fish_encounters %>%
pivot_wider(names_from = station, values_from = seen)
#> pivot_wider: reorganized (station, seen) into (Release, I80_1, Lisbon, Rstr, Base_TD, …) [was 114x3, now 19x12]
#> # A tibble: 19 x 12
#> fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
#> <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 4842 1 1 1 1 1 1 1 1 1 1 1
#> 2 4843 1 1 1 1 1 1 1 1 1 1 1
#> 3 4844 1 1 1 1 1 1 1 1 1 1 1
#> 4 4845 1 1 1 1 1 NA NA NA NA NA NA
#> 5 4847 1 1 1 NA NA NA NA NA NA NA NA
#> 6 4848 1 1 1 1 NA NA NA NA NA NA NA
#> # ... with 13 more rows
关于 pivot_wider()
,很重要的一点是它会暴露出数据中的隐式缺失值(implicit missing value)。这些没有出现在原数据中的 NA 值不是源自于记录错误或者遗失,只是没有对应的观测而已(观测站只能记录发生了的观测)。参数 values_fill 可以以一个列表填充 pivot_wider()
结果中的 NA, 当然如何处理这些隐式缺失值要按具体情境而定,在鱼群的例子里,用 0 填充是合适的:
fish_encounters %>%
pivot_wider(names_from = station, values_from = seen,
values_fill = list(seen = 0))
#> pivot_wider: reorganized (station, seen) into (Release, I80_1, Lisbon, Rstr, Base_TD, …) [was 114x3, now 19x12]
#> # A tibble: 19 x 12
#> fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
#> <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 4842 1 1 1 1 1 1 1 1 1 1 1
#> 2 4843 1 1 1 1 1 1 1 1 1 1 1
#> 3 4844 1 1 1 1 1 1 1 1 1 1 1
#> 4 4845 1 1 1 1 1 0 0 0 0 0 0
#> 5 4847 1 1 1 0 0 0 0 0 0 0 0
#> 6 4848 1 1 1 1 0 0 0 0 0 0 0
#> # ... with 13 more rows
fish_encoutners
的贡献者 Myfanwy Johnston 在个人网站上有一篇相关的文章
6.2.2.1 Aggregation
pivot_wider()
可以用来执行一些简单的聚合操作。warpbreaks
是一个关于经纱强度的控制试验,每个处理 (wool, tension)
上进行了 9 次试验:
(warpbreaks <- warpbreaks %>% as_tibble() %>% select(wool, tension, breaks))
#> select: columns reordered (wool, tension, breaks)
#> # A tibble: 54 x 3
#> wool tension breaks
#> <fct> <fct> <dbl>
#> 1 A L 26
#> 2 A L 30
#> 3 A L 54
#> 4 A L 25
#> 5 A L 70
#> 6 A L 52
#> # ... with 48 more rows
warpbreaks %>% count(wool, tension)
#> count: now 6 rows and 3 columns, ungrouped
#> # A tibble: 6 x 3
#> wool tension n
#> <fct> <fct> <int>
#> 1 A L 9
#> 2 A M 9
#> 3 A H 9
#> 4 B L 9
#> 5 B M 9
#> 6 B H 9
现在想知道每个处理下的平均断头次数,只需展开 wool
或 tension
中的任意一个:
warpbreaks %>%
pivot_wider(names_from = wool, values_from = breaks)
#> Warning: Values in `breaks` are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list(breaks = list)` to suppress this warning.
#> * Use `values_fn = list(breaks = length)` to identify where the duplicates arise
#> * Use `values_fn = list(breaks = summary_fun)` to summarise duplicates
#> pivot_wider: reorganized (wool, breaks) into (A, B) [was 54x3, now 3x3]
#> # A tibble: 3 x 3
#> tension A B
#> <fct> <list> <list>
#> 1 L <dbl [9]> <dbl [9]>
#> 2 M <dbl [9]> <dbl [9]>
#> 3 H <dbl [9]> <dbl [9]>
由于 (wool, tension)
不能唯一确认一个观测,多个观测被压缩至一个列表中,values_fn(breaks = mean)
求得平均值:
warpbreaks %>%
pivot_wider(names_from = wool, values_from = breaks,
values_fn = list(breaks = mean))
#> pivot_wider: reorganized (wool, breaks) into (A, B) [was 54x3, now 3x3]
#> # A tibble: 3 x 3
#> tension A B
#> <fct> <dbl> <dbl>
#> 1 L 44.6 28.2
#> 2 M 24 28.8
#> 3 H 24.6 18.8
For more complex summary operations, I recommend summarising before reshaping, but for simple cases it’s often convenient to summarise within pivot_wider()
6.2.2.2 Generate column name from multiple variables
现有一个数据集存储了关于产品、生产国家、生产年份和产量的水平组合:
production <- expand_grid(
product = c("A", "B"),
country = c("AI", "EI"),
year = 2000:2014
) %>%
filter((product == "A" & country == "AI") | product == "B") %>%
mutate(production = rnorm(nrow(.)))
#> filter: removed 15 rows (25%), 45 rows remaining
#> mutate: new variable 'production' with 45 unique values and 0% NA
production
#> # A tibble: 45 x 4
#> product country year production
#> <chr> <chr> <int> <dbl>
#> 1 A AI 2000 -0.209
#> 2 A AI 2001 -0.369
#> 3 A AI 2002 0.330
#> 4 A AI 2003 1.88
#> 5 A AI 2004 -0.482
#> 6 A AI 2005 1.74
#> # ... with 39 more rows
假设现在希望对于每个 product
和 country
的组合均创建一列,关键是在 names_from
中传入一个向量:
production %>%
pivot_wider(names_from = c(product, country), values_from = production)
#> pivot_wider: reorganized (product, country, production) into (A_AI, B_AI, B_EI) [was 45x4, now 15x4]
#> # A tibble: 15 x 4
#> year A_AI B_AI B_EI
#> <int> <dbl> <dbl> <dbl>
#> 1 2000 -0.209 1.02 -1.14
#> 2 2001 -0.369 0.598 0.143
#> 3 2002 0.330 1.38 -0.0472
#> 4 2003 1.88 -1.06 -1.26
#> 5 2004 -0.482 0.197 3.39
#> 6 2005 1.74 1.37 0.120
#> # ... with 9 more rows
names_sep
可以指定除了 _
以外的分隔符。names_prefix
为展开后的各列添加前缀(as opposed to removing in pivot_longer()
)
6.2.2.3 Multiple value columns
The us_rent_income
dataset contains information about median income and rent for each state in the US for 2017 (from the American Community Survey, retrieved with the tidycensus package).
us_rent_income
#> # A tibble: 104 x 5
#> GEOID NAME variable estimate moe
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 01 Alabama income 24476 136
#> 2 01 Alabama rent 747 3
#> 3 02 Alaska income 32940 508
#> 4 02 Alaska rent 1200 13
#> 5 04 Arizona income 27517 148
#> 6 04 Arizona rent 972 4
#> # ... with 98 more rows
Here both estimate
and moe
are values columns, so we can supply them to values_from
:
us_rent_income %>% pivot_wider(names_from = variable, values_from = c(estimate, moe))
#> pivot_wider: reorganized (variable, estimate, moe) into (estimate_income, estimate_rent, moe_income, moe_rent) [was 104x5, now 52x6]
#> # A tibble: 52 x 6
#> GEOID NAME estimate_income estimate_rent moe_income moe_rent
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 01 Alabama 24476 747 136 3
#> 2 02 Alaska 32940 1200 508 13
#> 3 04 Arizona 27517 972 148 4
#> 4 05 Arkansas 23789 709 165 5
#> 5 06 California 29454 1358 109 3
#> 6 08 Colorado 32401 1125 109 5
#> # ... with 46 more rows
Note that the name of the variable is automatically appended to the output columns.
6.2.2.4 When there is no identifying variable
A final challenge is inspired by Jiena Gu. Imagine you have a contact list that you’ve copied and pasted from a website:
contacts <- tribble(
~field, ~value,
"name", "Jiena McLellan",
"company", "Toyota",
"name", "John Smith",
"company", "google",
"email", "john@google.com",
"name", "Huxley Ratcliffe"
)
contacts
#> # A tibble: 6 x 2
#> field value
#> <chr> <chr>
#> 1 name Jiena McLellan
#> 2 company Toyota
#> 3 name John Smith
#> 4 company google
#> 5 email john@google.com
#> 6 name Huxley Ratcliffe
This is challenging because there’s no variable that identifies which observations belong together.
直接化宽时,出现列表列(没有第三个标识变量)
contacts %>% pivot_wider(names_from = field, values_from = value)
#> pivot_wider: reorganized (field, value) into (name, company, email) [was 6x2, now 1x3]
#> # A tibble: 1 x 3
#> name company email
#> <list> <list> <list>
#> 1 <chr [3]> <chr [2]> <chr [1]>
We can fix this by noting that every contact starts with a name, so we can create a unique id by counting every time we see “name” as the field
:
(contacts <- contacts %>%
mutate(
person_id = cumsum(field == "name")
))
#> mutate: new variable 'person_id' with 3 unique values and 0% NA
#> # A tibble: 6 x 3
#> field value person_id
#> <chr> <chr> <int>
#> 1 name Jiena McLellan 1
#> 2 company Toyota 1
#> 3 name John Smith 2
#> 4 company google 2
#> 5 email john@google.com 2
#> 6 name Huxley Ratcliffe 3
contacts %>%
pivot_wider(names_from = field, values_from = value)
#> pivot_wider: reorganized (field, value) into (name, company, email) [was 6x3, now 3x4]
#> # A tibble: 3 x 4
#> person_id name company email
#> <int> <chr> <chr> <chr>
#> 1 1 Jiena McLellan Toyota <NA>
#> 2 2 John Smith google john@google.com
#> 3 3 Huxley Ratcliffe <NA> <NA>
6.2.3 Combining pivot_longer()
and pivot_wider()
Some problems can’t be solved by pivotting in a single direction. The examples in this section show how you might combine pivot_longer()
and pivot_wider()
to solve more complex problems.
6.2.3.1 world bank data
world_bank_pop
contains data from the World Bank about population per country from 2000 to 2018.
world_bank_pop
#> # A tibble: 1,056 x 20
#> country indicator `2000` `2001` `2002` `2003` `2004` `2005` `2006`
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 ABW SP.URB.T~ 4.24e4 4.30e4 4.37e4 4.42e4 4.47e+4 4.49e+4 4.49e+4
#> 2 ABW SP.URB.G~ 1.18e0 1.41e0 1.43e0 1.31e0 9.51e-1 4.91e-1 -1.78e-2
#> 3 ABW SP.POP.T~ 9.09e4 9.29e4 9.50e4 9.70e4 9.87e+4 1.00e+5 1.01e+5
#> 4 ABW SP.POP.G~ 2.06e0 2.23e0 2.23e0 2.11e0 1.76e+0 1.30e+0 7.98e-1
#> 5 AFG SP.URB.T~ 4.44e6 4.65e6 4.89e6 5.16e6 5.43e+6 5.69e+6 5.93e+6
#> 6 AFG SP.URB.G~ 3.91e0 4.66e0 5.13e0 5.23e0 5.12e+0 4.77e+0 4.12e+0
#> # ... with 1,050 more rows, and 11 more variables: `2007` <dbl>, `2008` <dbl>,
#> # `2009` <dbl>, `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>,
#> # `2014` <dbl>, `2015` <dbl>, `2016` <dbl>, `2017` <dbl>
It’s not obvious exactly what steps are needed yet, but I’ll start with the most obvious problem: year
is spread across multiple columns.
pop2 <- world_bank_pop %>%
pivot_longer(`2000`:`2017`, names_to = "year", values_to = "value")
#> pivot_longer: reorganized (2000, 2001, 2002, 2003, 2004, …) into (year, value) [was 1056x20, now 19008x4]
pop2
#> # A tibble: 19,008 x 4
#> country indicator year value
#> <chr> <chr> <chr> <dbl>
#> 1 ABW SP.URB.TOTL 2000 42444
#> 2 ABW SP.URB.TOTL 2001 43048
#> 3 ABW SP.URB.TOTL 2002 43670
#> 4 ABW SP.URB.TOTL 2003 44246
#> 5 ABW SP.URB.TOTL 2004 44669
#> 6 ABW SP.URB.TOTL 2005 44889
#> # ... with 19,002 more rows
Next we need to consider the indicator
variable:
world_bank_pop %>% count(indicator)
#> count: now 4 rows and 2 columns, ungrouped
#> # A tibble: 4 x 2
#> indicator n
#> <chr> <int>
#> 1 SP.POP.GROW 264
#> 2 SP.POP.TOTL 264
#> 3 SP.URB.GROW 264
#> 4 SP.URB.TOTL 264
Here SP.POP.GROW
is population growth, SP.POP.TOTL
is total population, and SP.URB.*
are the same but only for urban areas. Let’s split this up into two variables: area
(total or urban) and the actual variable (population or growth):
# Use NA to omit the variable in the output.
pop3 <- pop2 %>%
separate(indicator, c(NA, "area", "variable"), sep = "\\.") # sep takes a regex
pop3
#> # A tibble: 19,008 x 5
#> country area variable year value
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 ABW URB TOTL 2000 42444
#> 2 ABW URB TOTL 2001 43048
#> 3 ABW URB TOTL 2002 43670
#> 4 ABW URB TOTL 2003 44246
#> 5 ABW URB TOTL 2004 44669
#> 6 ABW URB TOTL 2005 44889
#> # ... with 19,002 more rows
Now we can complete the tidying by pivoting variable
and value
to make TOTL
and GROW
columns:
pop3 %>%
pivot_wider(names_from = variable, values_from = value)
#> pivot_wider: reorganized (variable, value) into (TOTL, GROW) [was 19008x5, now 9504x5]
#> # A tibble: 9,504 x 5
#> country area year TOTL GROW
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 ABW URB 2000 42444 1.18
#> 2 ABW URB 2001 43048 1.41
#> 3 ABW URB 2002 43670 1.43
#> 4 ABW URB 2003 44246 1.31
#> 5 ABW URB 2004 44669 0.951
#> 6 ABW URB 2005 44889 0.491
#> # ... with 9,498 more rows
6.2.3.2 mutli choice data
Based on a suggestion by Maxime Wack, https://github.com/tidyverse/tidyr/issues/384), the final example shows how to deal with a common way of recording multiple choice data. Often you will get such data as follows:
(multi <- tribble(
~id, ~choice1, ~choice2, ~choice3,
1, "A", "B", "C",
2, "C", "B", NA,
3, "D", NA, NA,
4, "B", "D", NA
))
#> # A tibble: 4 x 4
#> id choice1 choice2 choice3
#> <dbl> <chr> <chr> <chr>
#> 1 1 A B C
#> 2 2 C B <NA>
#> 3 3 D <NA> <NA>
#> 4 4 B D <NA>
But the actual order isn’t important, and you’d prefer to have the individual questions in the columns. You can achieve the desired transformation in two steps. First, you make the data longer, eliminating the explcit NA
s, and adding a column to indicate that this choice was chosen:
multi2 <- multi %>%
pivot_longer(-id, values_drop_na = TRUE) %>%
mutate(checked = TRUE)
#> pivot_longer: reorganized (choice1, choice2, choice3) into (name, value) [was 4x4, now 8x3]
#> mutate: new variable 'checked' with one unique value and 0% NA
multi2
#> # A tibble: 8 x 4
#> id name value checked
#> <dbl> <chr> <chr> <lgl>
#> 1 1 choice1 A TRUE
#> 2 1 choice2 B TRUE
#> 3 1 choice3 C TRUE
#> 4 2 choice1 C TRUE
#> 5 2 choice2 B TRUE
#> 6 3 choice1 D TRUE
#> # ... with 2 more rows
Then you make the data wider, filling in the missing observations with FALSE
; note the use of id_cols = id
here, this eliminated the name
column and combines mutilples rows per person into one row, since we don’t need name
in identifying an observation:
multi2 %>%
pivot_wider(id_cols = id,
names_from = value, values_from = checked,
values_fill = list(checked = FALSE))
#> pivot_wider: reorganized (name, value, checked) into (A, B, C, D) [was 8x4, now 4x5]
#> # A tibble: 4 x 5
#> id A B C D
#> <dbl> <lgl> <lgl> <lgl> <lgl>
#> 1 1 TRUE TRUE TRUE FALSE
#> 2 2 FALSE TRUE TRUE FALSE
#> 3 3 FALSE FALSE FALSE TRUE
#> 4 4 FALSE TRUE FALSE TRUE
6.2.4 Exercises
pivot_longer()
和 pivot_wider()
不是完美对称的
(stocks <- tibble(
year = c(2015, 2015, 2016, 2016),
half = c(1, 2, 1, 2),
return = c(1.88, 0.59, 0.92, 0.17)
))
#> # A tibble: 4 x 3
#> year half return
#> <dbl> <dbl> <dbl>
#> 1 2015 1 1.88
#> 2 2015 2 0.59
#> 3 2016 1 0.92
#> 4 2016 2 0.17
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(-half, names_to = "year", values_to = "return")
#> pivot_wider: reorganized (year, return) into (2015, 2016) [was 4x3, now 2x3]
#> pivot_longer: reorganized (2015, 2016) into (year, return) [was 2x3, now 4x3]
#> # A tibble: 4 x 3
#> half year return
#> <dbl> <chr> <dbl>
#> 1 1 2015 1.88
#> 2 1 2016 0.92
#> 3 2 2015 0.59
#> 4 2 2016 0.17
先后使用 pivot_wider()
和 pivot_longer()
无法得到一个相同的数据集(除了列的顺序)是因为,数据整理有时会丢失列的类型信息。当 pivot_wider()
将变量 year
的值 2015
和 2016
用作列的名字时,它们自然被转化为了字符串"2015"
和"2016"
;随后 pivot_longer()
把列名用作键列year
的值,从而year
自然变成了一个字符向量,可以用 names_ptypes
避免这一点
。
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(-half, names_to = "year", values_to = "return",
names_ptypes = list(year = double()))
#> pivot_wider: reorganized (year, return) into (2015, 2016) [was 4x3, now 2x3]
#> pivot_longer: reorganized (2015, 2016) into (year, return) [was 2x3, now 4x3]
#> # A tibble: 4 x 3
#> half year return
#> <dbl> <dbl> <dbl>
#> 1 1 2015 1.88
#> 2 1 2016 0.92
#> 3 2 2015 0.59
#> 4 2 2016 0.17
pivot_wider()
?可以添加一列解决这个问题吗?
(people <- tribble(
~name, ~key, ~value,
"Phillip Woods", "age", 45,
"Phillip Woods", "height", 186,
"Phillip Woods", "age", 50,
"Jessica Cordero", "age", 37,
"Jessica Cordero", "height", 156
))
#> # A tibble: 5 x 3
#> name key value
#> <chr> <chr> <dbl>
#> 1 Phillip Woods age 45
#> 2 Phillip Woods height 186
#> 3 Phillip Woods age 50
#> 4 Jessica Cordero age 37
#> 5 Jessica Cordero height 156
people %>%
pivot_wider(names_from = key, values_from = value)
#> # A tibble: 2 x 3
#> name age height
#> <chr> <list> <list>
#> 1 Phillip Woods <dbl [2]> <dbl [1]>
#> 2 Jessica Cordero <dbl [1]> <dbl [1]>
这个例子和 6.2.2.4 中的 contact
很类似,虽然这里有第三列 name
,但仍不足以唯一标识任意观测
因为这个数据集里有两个对于 “Phillip Woods” 在 age
上年龄的观测,pivot_wider()
就要把由(Phillips Woods, age)
确定的单元格里“塞进两个值”。本质上因为 name
和 key
这两个变量上的值不能唯一确定一行,所以我们只要添加一列,让name
、key
和新列可以唯一确定一行即可:
people %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = key, values_from = value)
#> mutate: new variable 'id' with 5 unique values and 0% NA
#> pivot_wider: reorganized (key, value) into (age, height) [was 5x4, now 5x4]
#> # A tibble: 5 x 4
#> name id age height
#> <chr> <int> <dbl> <dbl>
#> 1 Phillip Woods 1 45 NA
#> 2 Phillip Woods 2 NA 186
#> 3 Phillip Woods 3 50 NA
#> 4 Jessica Cordero 4 37 NA
#> 5 Jessica Cordero 5 NA 156