6.3 Nesting
Nesting creates a list-column of data frames; unnesting flattens it back out into regular columns. Nesting is a implicitly summarising operation: you get one row for each group defined by the non-nested columns. This is useful in conjunction with other summaries that work with whole datasets, most notably models.
Since a nested data frame is no more than a data frame where one (or more) list-columns of data frames. You can create simple nested data frames by hand:
# df1 is a nested data frame
(df1 <- tibble(
g = c(1, 2, 3),
data = list(
tibble(x = 1, y = 2),
tibble(x = 4:5, y = 6:7),
tibble(x = 10)
)
))
#> # A tibble: 3 x 2
#> g data
#> <dbl> <list>
#> 1 1 <tibble [1 x 2]>
#> 2 2 <tibble [2 x 2]>
#> 3 3 <tibble [1 x 1]>
Or more commonly, we can create nested data frames using tidyr::nest()
. df %>% nest(x, y)
specifies the columns to be nested; i.e. the columns that will appear in the inner data frame. Alternatively, you can nest()
a grouped data frame created by dplyr::group_by()
. The grouping variables remain in the outer data frame and the others are nested. The result preserves the grouping of the input.
Variables supplied to nest() will override grouping variables so that df %>% group_by(x, y)
%>% nest(z)
will be equivalent to df %>% nest(z)
.
df2 <- tribble(
~g, ~x, ~y,
1, 1, 2,
2, 4, 6,
2, 5, 7,
3, 10, NA
)
df2 %>% nest(data = c(x, y))
#> # A tibble: 3 x 2
#> g data
#> <dbl> <list>
#> 1 1 <tibble [1 x 2]>
#> 2 2 <tibble [2 x 2]>
#> 3 3 <tibble [1 x 2]>
# grouped nesting
df2 %>%
group_by(g) %>%
nest()
#> group_by: one grouping variable (g)
#> # A tibble: 3 x 2
#> # Groups: g [3]
#> g data
#> <dbl> <list>
#> 1 1 <tibble [1 x 2]>
#> 2 2 <tibble [2 x 2]>
#> 3 3 <tibble [1 x 2]>
# equal to
df2 %>%
group_nest(g)
#> # A tibble: 3 x 2
#> g data
#> <dbl> <list>
#> 1 1 <tibble [1 x 2]>
#> 2 2 <tibble [2 x 2]>
#> 3 3 <tibble [1 x 2]>
Nesting is easiest to understand in connection to grouped data: each row in the output corresponds to one group in the input. We’ll see shortly this is particularly convenient when you have other per-group objects.
The opposite of nest()
is unnest()
. You give it the name of a list-column containing data frames, and it row-binds the data frames together, repeating the outer columns the right number of times to line up.
df1 %>% unnest(data)
#> # A tibble: 4 x 3
#> g x y
#> <dbl> <dbl> <dbl>
#> 1 1 1 2
#> 2 2 4 6
#> 3 2 5 7
#> 4 3 10 NA
dplyr::group_split()
put each nested tibble in a list, similar to base::split()
:
df2 %>% group_split(g)
#> [[1]]
#> # A tibble: 1 x 3
#> g x y
#> <dbl> <dbl> <dbl>
#> 1 1 1 2
#>
#> [[2]]
#> # A tibble: 2 x 3
#> g x y
#> <dbl> <dbl> <dbl>
#> 1 2 4 6
#> 2 2 5 7
#>
#> [[3]]
#> # A tibble: 1 x 3
#> g x y
#> <dbl> <dbl> <dbl>
#> 1 3 10 NA
#>
#> attr(,"ptype")
#> # A tibble: 0 x 3
#> # ... with 3 variables: g <dbl>, x <dbl>, y <dbl>
6.3.1 Example: Managing multiple models
Nested data is a great fit for problems where you have one of something for each group. A common place this arises is when you’re fitting multiple models.
gapminder <- gapminder::gapminder
gapminder_nest <- gapminder %>%
mutate(year1950 = year - 1950) %>%
group_nest(continent, country)
#> mutate: new variable 'year1950' with 12 unique values and 0% NA
gapminder_nest
#> # A tibble: 142 x 3
#> continent country data
#> <fct> <fct> <list>
#> 1 Africa Algeria <tibble [12 x 5]>
#> 2 Africa Angola <tibble [12 x 5]>
#> 3 Africa Benin <tibble [12 x 5]>
#> 4 Africa Botswana <tibble [12 x 5]>
#> 5 Africa Burkina Faso <tibble [12 x 5]>
#> 6 Africa Burundi <tibble [12 x 5]>
#> # ... with 136 more rows
Now gapminder_nest
is a tibble with 142 rows representing 142 countries with their respective time series data from 1952 - 2007 stored in the list column data
. Then we can combine mutate()
and map
to create a new column to fit a linear model for each country:
mod_fit <- function(data) {
lm(lifeExp ~ year1950, data = data)
}
gapminder_model <- gapminder_nest %>%
mutate(model = map(data, mod_fit))
#> mutate: new variable 'model' with 142 unique values and 0% NA
gapminder_model
#> # A tibble: 142 x 4
#> continent country data model
#> <fct> <fct> <list> <list>
#> 1 Africa Algeria <tibble [12 x 5]> <lm>
#> 2 Africa Angola <tibble [12 x 5]> <lm>
#> 3 Africa Benin <tibble [12 x 5]> <lm>
#> 4 Africa Botswana <tibble [12 x 5]> <lm>
#> 5 Africa Burkina Faso <tibble [12 x 5]> <lm>
#> 6 Africa Burundi <tibble [12 x 5]> <lm>
#> # ... with 136 more rows
Then use broom
functions to generate “tidy” model summaries:
gapminder_summary <- gapminder_model %>%
mutate(
glance = map(model, broom::glance),
tidy = map(model, broom::tidy),
augment = map(model, broom::augment)
)
#> mutate: new variable 'glance' with 142 unique values and 0% NA
#> new variable 'tidy' with 142 unique values and 0% NA
#> new variable 'augment' with 142 unique values and 0% NA
gapminder_summary
#> # A tibble: 142 x 7
#> continent country data model glance tidy augment
#> <fct> <fct> <list> <list> <list> <list> <list>
#> 1 Africa Algeria <tibble [12~ <lm> <tibble [1 ~ <tibble [2~ <tibble [12~
#> 2 Africa Angola <tibble [12~ <lm> <tibble [1 ~ <tibble [2~ <tibble [12~
#> 3 Africa Benin <tibble [12~ <lm> <tibble [1 ~ <tibble [2~ <tibble [12~
#> 4 Africa Botswana <tibble [12~ <lm> <tibble [1 ~ <tibble [2~ <tibble [12~
#> 5 Africa Burkina F~ <tibble [12~ <lm> <tibble [1 ~ <tibble [2~ <tibble [12~
#> 6 Africa Burundi <tibble [12~ <lm> <tibble [1 ~ <tibble [2~ <tibble [12~
#> # ... with 136 more rows
unnest()
each column:
# which country has the best fit
gapminder_summary %>%
unnest(glance) %>%
arrange(desc(r.squared))
#> # A tibble: 142 x 17
#> continent country data model r.squared adj.r.squared sigma statistic p.value
#> <fct> <fct> <lis> <lis> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Americas Brazil <tib~ <lm> 0.998 0.998 0.326 5111. 6.99e-15
#> 2 Africa Maurit~ <tib~ <lm> 0.998 0.997 0.408 4290. 1.68e-14
#> 3 Europe France <tib~ <lm> 0.998 0.997 0.220 4200. 1.86e-14
#> 4 Europe Switze~ <tib~ <lm> 0.997 0.997 0.215 3823. 2.98e-14
#> 5 Asia Pakist~ <tib~ <lm> 0.997 0.997 0.403 3626. 3.88e-14
#> 6 Asia Indone~ <tib~ <lm> 0.997 0.997 0.646 3455. 4.93e-14
#> # ... with 136 more rows, and 8 more variables: df <int>, logLik <dbl>,
#> # AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>, tidy <list>,
#> # augment <list>
gapminder_summary %>%
unnest(tidy)
#> # A tibble: 284 x 11
#> continent country data model glance term estimate std.error statistic
#> <fct> <fct> <lis> <lis> <list> <chr> <dbl> <dbl> <dbl>
#> 1 Africa Algeria <tib~ <lm> <tibb~ (Int~ 42.2 0.756 55.8
#> 2 Africa Algeria <tib~ <lm> <tibb~ year~ 0.569 0.0221 25.7
#> 3 Africa Angola <tib~ <lm> <tibb~ (Int~ 31.7 0.804 39.4
#> 4 Africa Angola <tib~ <lm> <tibb~ year~ 0.209 0.0235 8.90
#> 5 Africa Benin <tib~ <lm> <tibb~ (Int~ 38.9 0.671 58.0
#> 6 Africa Benin <tib~ <lm> <tibb~ year~ 0.334 0.0196 17.0
#> # ... with 278 more rows, and 2 more variables: p.value <dbl>, augment <list>
gapminder_summary %>%
unnest(augment)
#> # A tibble: 1,704 x 15
#> continent country data model glance tidy lifeExp year1950 .fitted .se.fit
#> <fct> <fct> <lis> <lis> <list> <lis> <dbl> <dbl> <dbl> <dbl>
#> 1 Africa Algeria <tib~ <lm> <tibb~ <tib~ 43.1 2 43.4 0.718
#> 2 Africa Algeria <tib~ <lm> <tibb~ <tib~ 45.7 7 46.2 0.627
#> 3 Africa Algeria <tib~ <lm> <tibb~ <tib~ 48.3 12 49.1 0.544
#> 4 Africa Algeria <tib~ <lm> <tibb~ <tib~ 51.4 17 51.9 0.472
#> 5 Africa Algeria <tib~ <lm> <tibb~ <tib~ 54.5 22 54.8 0.416
#> 6 Africa Algeria <tib~ <lm> <tibb~ <tib~ 58.0 27 57.6 0.386
#> # ... with 1,698 more rows, and 5 more variables: .resid <dbl>, .hat <dbl>,
#> # .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>
A similar case can be found at 9.1
6.3.2 Example: Multicple hoice data
multiple_choice <- tibble(method = c(
"CNNs",
"Bayesian, Logistic Regression",
"Data Visualization, Decision Trees",
"Linear Regression, A/B Testing",
"Data Visualization, Text Analytics"
))
multiple_choice %>%
mutate(method = str_split(method, ",")) %>%
unnest(method)
#> mutate: converted 'method' from character to list (0 new NA)
#> # A tibble: 9 x 1
#> method
#> <chr>
#> 1 "CNNs"
#> 2 "Bayesian"
#> 3 " Logistic Regression"
#> 4 "Data Visualization"
#> 5 " Decision Trees"
#> 6 "Linear Regression"
#> # ... with 3 more rows
The trick here is that str_split()
creates a list column, and then unnest()
can unnest the column. A more general function separate_rows()
in this case can be found at 6.5.1
Then we can do count()
and plot the most frequent methods mentioned.