17.2 Wrangling

17.2.1 tidyr

tidyr provides a handful of tools for converting between implicit (absent rows) and explicit (NA) missing values, and for handling explicit NAs.

17.2.1.4 full_seq() create the full sequence of values in a vector

This is useful if you want to fill in missing values that should have been observed but weren’t. For example, full_seq(c(1, 2, 4, 6), 1) will return 1:6.

17.2.1.5 expand() expand data frame to include all combinations of values

expand() creates a data frame containing all conbinations of specified columns, often used in conjunction with left_join() to convert implicit missing values to explicit missing values, with anti_join() to figure out which combinations are missing.

To find all unique combinations of x, y and z, including those not found in the data, supply each variable as a separate argument. To find only the combinations that occur in the data, use nest: expand(df, nesting(x, y, z)).

You can combine the two forms. For example, expand(df, nesting(school_id, student_id), date) would produce a row for every student for each date.

For factors, the full set of levels (not just those that appear in the data) are used. For continuous variables, you may need to fill in values that don’t appear in the data: to do so use expressions like year = 2010:2020or year = full_seq(year, 1).

# Each person was given one of two treatments, repeated three times
# But some of the replications haven't happened yet, so we have
# incomplete data:
experiment <- tibble(
  name = rep(c("Alex", "Robert", "Sam"), c(3, 2, 1)),
  trt  = rep(c("a", "b", "a"), c(3, 2, 1)),
  rep = c(1, 2, 3, 1, 2, 1),
  measurement_1 = runif(6),
  measurement_2 = runif(6)
)

# We can figure out the complete set of data with expand()
# Each person only gets one treatment, so we nest name and trt together:
all <- experiment %>% expand(nesting(name, trt), rep)
all
#> # A tibble: 9 x 3
#>   name   trt     rep
#>   <chr>  <chr> <dbl>
#> 1 Alex   a         1
#> 2 Alex   a         2
#> 3 Alex   a         3
#> 4 Robert b         1
#> 5 Robert b         2
#> 6 Robert b         3
#> # ... with 3 more rows

# use left_join to convert implicit missing values to explicit missing values
all %>% left_join(experiment)
#> # A tibble: 9 x 5
#>   name   trt     rep measurement_1 measurement_2
#>   <chr>  <chr> <dbl>         <dbl>         <dbl>
#> 1 Alex   a         1         0.614         0.379
#> 2 Alex   a         2         0.915         0.811
#> 3 Alex   a         3         0.838         0.475
#> 4 Robert b         1         0.855         0.838
#> 5 Robert b         2         0.501         0.739
#> 6 Robert b         3        NA            NA    
#> # ... with 3 more rows
# can use anti_join to figure out which observations are missing
all %>% anti_join(experiment)
#> # A tibble: 3 x 3
#>   name   trt     rep
#>   <chr>  <chr> <dbl>
#> 1 Robert b         3
#> 2 Sam    a         2
#> 3 Sam    a         3


# And use right_join to add in the appropriate missing values to the
# original data
experiment %>% right_join(all)
#> # A tibble: 9 x 5
#>   name   trt     rep measurement_1 measurement_2
#>   <chr>  <chr> <dbl>         <dbl>         <dbl>
#> 1 Alex   a         1         0.614         0.379
#> 2 Alex   a         2         0.915         0.811
#> 3 Alex   a         3         0.838         0.475
#> 4 Robert b         1         0.855         0.838
#> 5 Robert b         2         0.501         0.739
#> 6 Robert b         3        NA            NA    
#> # ... with 3 more rows

complete() is a short hand function around expand() + left_join(): firt create specified combinations and then left join original data to convert implicit missing values to explicit missing values:

17.2.1.6 expand_grid: create a tibble from all combinations of inputs

expand_grid() is analogus to a (atomic) vector version if expand(). Instead of taking in a data frame, expand_grid() use multiple name-value pairs to generate all combinations :

crossing() is a wrapper around expand_grid() that deduplicates and sorts each input.

17.2.3 visdat

visdat::vis_missing()

Advanced: mice, Amelia