6.8 Miscellaneous Functions
There are several remaining useful functions in tidyr
that cannot be easily categorized.
6.8.1 chop()
and unchop()
Chopping and unchopping preserve the width of a data frame, changing its length. chop()
makes df
shorter by converting rows within each group into list-columns. unchop()
makes df
longer by expanding list-columns so that each element of the list-column gets its own row in the output.
Note that we get one row of output for each unique combination of non-chopped variables:
chop()
differs from nest()
in section 6.3 in that it does not collpase columns into a tibble, but into a list:
df <- tibble(x = c(1, 1, 1, 2, 2, 3),
y = 1:6,
z = 6:1)
df %>% chop(cols = c(y, z))
#> # A tibble: 3 x 3
#> x y z
#> <dbl> <list> <list>
#> 1 1 <int [3]> <int [3]>
#> 2 2 <int [2]> <int [2]>
#> 3 3 <int [1]> <int [1]>
df %>% nest(data = c(y, z))
#> # A tibble: 3 x 2
#> x data
#> <dbl> <list>
#> 1 1 <tibble [3 x 2]>
#> 2 2 <tibble [2 x 2]>
#> 3 3 <tibble [1 x 2]>
unchop()
:
df <- tibble(x = 1:4, y = list(integer(), 1L, 1:2, 1:3))
df %>% unchop(y)
#> # A tibble: 6 x 2
#> x y
#> <int> <int>
#> 1 2 1
#> 2 3 1
#> 3 3 2
#> 4 4 1
#> 5 4 2
#> 6 4 3
If there’s a size-0 element (like NULL
or an empty data frame), that entire row will be dropped from the output. If you want to preserve all rows, use keep_empty = TRUE
to replace size-0 elements with a single row of missing values.
# equivalent to df %>% unnest_longer(y)
df %>% unchop(y, keep_empty = TRUE)
#> # A tibble: 7 x 2
#> x y
#> <int> <int>
#> 1 1 NA
#> 2 2 1
#> 3 3 1
#> 4 3 2
#> 5 4 1
#> 6 4 2
#> # ... with 1 more row
# Incompatible types -------------------------------------------------
# If the list-col contains types that can not be natively
df <- tibble(x = 1:2, y = list("1", 1:3))
try(df %>% unchop(y))
#> Error : No common type for `..1$y` <character> and `..2$y` <integer>.
df %>% unchop(y, ptype = tibble(y = integer()))
#> # A tibble: 4 x 2
#> x y
#> <int> <int>
#> 1 1 1
#> 2 2 1
#> 3 2 2
#> 4 2 3
df %>% unchop(y, ptype = tibble(y = character()))
#> # A tibble: 4 x 2
#> x y
#> <int> <chr>
#> 1 1 1
#> 2 2 1
#> 3 2 2
#> 4 2 3
df %>% unchop(y, ptype = tibble(y = list()))
#> # A tibble: 4 x 2
#> x y
#> <int> <list>
#> 1 1 <chr [1]>
#> 2 2 <int [1]>
#> 3 2 <int [1]>
#> 4 2 <int [1]>
ptype: Optionally, supply a data frame prototype for the output cols
, overriding the default that will be guessed from the combination of individual value
6.8.2 uncount()
Performs the opposite operation to dplyr::count()
, duplicating rows according to a weighting variable (or expression)
df <- tibble(x = c("a", "b", "c"), n = c(1, 2, 3))
uncount(df, n)
#> uncount: now 6 rows and one column, ungrouped
#> # A tibble: 6 x 1
#> x
#> <chr>
#> 1 a
#> 2 b
#> 3 b
#> 4 c
#> 5 c
#> 6 c
we can supply a string .id
to create a new variable which gives a unique identifier for each created row:
uncount(df, n, .id = "id")
#> uncount: now 6 rows and 2 columns, ungrouped
#> # A tibble: 6 x 2
#> x id
#> <chr> <int>
#> 1 a 1
#> 2 b 1
#> 3 b 2
#> 4 c 1
#> 5 c 2
#> 6 c 3
uncount()
can be helpful in convertnig frequency form data to case form data, e.g:
fiber <- read_csv("data/Fiber.csv")
#> Parsed with column specification:
#> cols(
#> fiber = col_character(),
#> bloat = col_character(),
#> count = col_double()
#> )
fiber
#> # A tibble: 16 x 3
#> fiber bloat count
#> <chr> <chr> <dbl>
#> 1 bran high 0
#> 2 gum high 5
#> 3 both high 2
#> 4 none high 0
#> 5 bran medium 1
#> 6 gum medium 3
#> # ... with 10 more rows
fiber %>% uncount(count)
#> uncount: now 48 rows and 2 columns, ungrouped
#> # A tibble: 48 x 2
#> fiber bloat
#> <chr> <chr>
#> 1 gum high
#> 2 gum high
#> 3 gum high
#> 4 gum high
#> 5 gum high
#> 6 both high
#> # ... with 42 more rows
Other way that can achieve this transformation: rep()
:
6.8.3 Exercises
who
数据集时,我们说iso2
和iso3
是冗余的,证明这一点
如果 iso2
和 iso3
是冗余的,则在数据集中对于变量组合 (country, year)
的每个值,都能唯一确定一个观测(因为 (country, year)
本身可以被用作键)。
who %>%
count(country, year) %>%
filter(n > 1)
#> count: now 7,240 rows and 3 columns, ungrouped
#> filter: removed all rows (100%)
#> # A tibble: 0 x 3
#> # ... with 3 variables: country <chr>, year <int>, n <int>
另一个思路是 distinct()
函数,它将返回数据框中某些列出现的的全部不重复的水平组合(注意complete()
是”制造出“全部可能的水平组合),和 unique()
类似,但速度更快:
who %>%
distinct(country, iso2, iso3) %>%
group_by(country) %>%
summarize(n = n()) %>%
filter(n > 1)
#> distinct: removed 7,021 rows (97%), 219 rows remaining
#> group_by: one grouping variable (country)
#> summarize: now 219 rows and 2 columns, ungrouped
#> filter: removed all rows (100%)
#> # A tibble: 0 x 2
#> # ... with 2 variables: country <chr>, n <int>