6.8 Miscellaneous Functions

There are several remaining useful functions in tidyr that cannot be easily categorized.

6.8.1 `chop()` and `unchop()`

Chopping and unchopping preserve the width of a data frame, changing its length. chop() makes df shorter by converting rows within each group into list-columns. unchop() makes df longer by expanding list-columns so that each element of the list-column gets its own row in the output.

Note that we get one row of output for each unique combination of non-chopped variables:

chop() differs from nest() in section 6.3 in that it does not collpase columns into a tibble, but into a list:

df <- tibble(x = c(1, 1, 1, 2, 2, 3), 
             y = 1:6, 
             z = 6:1)

df %>% chop(cols = c(y, z))
#> # A tibble: 3 x 3
#>       x y         z        
#>   <dbl> <list>    <list>   
#> 1     1 <int [3]> <int [3]>
#> 2     2 <int [2]> <int [2]>
#> 3     3 <int [1]> <int [1]>

df %>% nest(data = c(y, z))
#> # A tibble: 3 x 2
#>       x data            
#>   <dbl> <list>          
#> 1     1 <tibble [3 x 2]>
#> 2     2 <tibble [2 x 2]>
#> 3     3 <tibble [1 x 2]>

unchop():

df <- tibble(x = 1:4, y = list(integer(), 1L, 1:2, 1:3))

df %>% unchop(y)
#> # A tibble: 6 x 2
#>       x     y
#>   <int> <int>
#> 1     2     1
#> 2     3     1
#> 3     3     2
#> 4     4     1
#> 5     4     2
#> 6     4     3

If there’s a size-0 element (like NULL or an empty data frame), that entire row will be dropped from the output. If you want to preserve all rows, use keep_empty = TRUE to replace size-0 elements with a single row of missing values.

# equivalent to df %>% unnest_longer(y)
df %>% unchop(y, keep_empty = TRUE)
#> # A tibble: 7 x 2
#>       x     y
#>   <int> <int>
#> 1     1    NA
#> 2     2     1
#> 3     3     1
#> 4     3     2
#> 5     4     1
#> 6     4     2
#> # ... with 1 more row

# Incompatible types -------------------------------------------------
# If the list-col contains types that can not be natively
df <- tibble(x = 1:2, y = list("1", 1:3))

try(df %>% unchop(y))
#> Error : No common type for `..1$y` <character> and `..2$y` <integer>.

df %>% unchop(y, ptype = tibble(y = integer()))
#> # A tibble: 4 x 2
#>       x     y
#>   <int> <int>
#> 1     1     1
#> 2     2     1
#> 3     2     2
#> 4     2     3

df %>% unchop(y, ptype = tibble(y = character()))
#> # A tibble: 4 x 2
#>       x y    
#>   <int> <chr>
#> 1     1 1    
#> 2     2 1    
#> 3     2 2    
#> 4     2 3
  
df %>% unchop(y, ptype = tibble(y = list()))
#> # A tibble: 4 x 2
#>       x y        
#>   <int> <list>   
#> 1     1 <chr [1]>
#> 2     2 <int [1]>
#> 3     2 <int [1]>
#> 4     2 <int [1]>

ptype: Optionally, supply a data frame prototype for the output cols, overriding the default that will be guessed from the combination of individual value

6.8.2 `uncount()`

Performs the opposite operation to dplyr::count(), duplicating rows according to a weighting variable (or expression)

df <- tibble(x = c("a", "b", "c"), n = c(1, 2, 3))
uncount(df, n)
#> uncount: now 6 rows and one column, ungrouped
#> # A tibble: 6 x 1
#>   x    
#>   <chr>
#> 1 a    
#> 2 b    
#> 3 b    
#> 4 c    
#> 5 c    
#> 6 c

we can supply a string .id to create a new variable which gives a unique identifier for each created row:

uncount(df, n, .id = "id")
#> uncount: now 6 rows and 2 columns, ungrouped
#> # A tibble: 6 x 2
#>   x        id
#>   <chr> <int>
#> 1 a         1
#> 2 b         1
#> 3 b         2
#> 4 c         1
#> 5 c         2
#> 6 c         3

uncount() can be helpful in convertnig frequency form data to case form data, e.g:

fiber <- read_csv("data/Fiber.csv") 
#> Parsed with column specification:
#> cols(
#>   fiber = col_character(),
#>   bloat = col_character(),
#>   count = col_double()
#> )
fiber
#> # A tibble: 16 x 3
#>   fiber bloat  count
#>   <chr> <chr>  <dbl>
#> 1 bran  high       0
#> 2 gum   high       5
#> 3 both  high       2
#> 4 none  high       0
#> 5 bran  medium     1
#> 6 gum   medium     3
#> # ... with 10 more rows

fiber %>% uncount(count)
#> uncount: now 48 rows and 2 columns, ungrouped
#> # A tibble: 48 x 2
#>   fiber bloat
#>   <chr> <chr>
#> 1 gum   high 
#> 2 gum   high 
#> 3 gum   high 
#> 4 gum   high 
#> 5 gum   high 
#> 6 both  high 
#> # ... with 42 more rows

Other way that can achieve this transformation: rep():

fiber[rep(1:nrow(fiber), fiber$count), -3]
#> # A tibble: 48 x 2
#>   fiber bloat
#>   <chr> <chr>
#> 1 gum   high 
#> 2 gum   high 
#> 3 gum   high 
#> 4 gum   high 
#> 5 gum   high 
#> 6 both  high 
#> # ... with 42 more rows

6.8.3 Exercises

Exercise 6.6 在清理 who 数据集时，我们说iso2和iso3是冗余的，证明这一点

如果 iso2 和 iso3 是冗余的，则在数据集中对于变量组合 (country, year) 的每个值，都能唯一确定一个观测(因为 (country, year) 本身可以被用作键)。

who %>% 
  count(country, year) %>% 
  filter(n > 1)
#> count: now 7,240 rows and 3 columns, ungrouped
#> filter: removed all rows (100%)
#> # A tibble: 0 x 3
#> # ... with 3 variables: country <chr>, year <int>, n <int>

另一个思路是 distinct() 函数，它将返回数据框中某些列出现的的全部不重复的水平组合（注意complete()是”制造出“全部可能的水平组合），和 unique() 类似，但速度更快：

who %>%
  distinct(country, iso2, iso3) %>%
  group_by(country) %>%
  summarize(n = n()) %>% 
  filter(n > 1)
#> distinct: removed 7,021 rows (97%), 219 rows remaining
#> group_by: one grouping variable (country)
#> summarize: now 219 rows and 2 columns, ungrouped
#> filter: removed all rows (100%)
#> # A tibble: 0 x 2
#> # ... with 2 variables: country <chr>, n <int>