# 11 God Spiked the Integers

The most common and useful generalized linear models are models for counts. Counts are non-negative integers–0, 1, 2, and so on. They are the basis of all mathematics, the first bits that children learn. But they are also intoxicatingly complicated to model–hence the apocryphal slogan that titles this chapter. The essential problem is this: When what we wish to predict is a count, the scale of the parameters is never the same as the scale of the outcome. A count golem, like a tide prediction engine, has a whirring machinery underneath that doesn’t resemble the output. Keeping the tide engine in mind, you can master these models and use them responsibly.

We will engineer complete examples of the two most common types of count model.

Binomial regressionis the name we’ll use for a family of related procedures that all model a binary classification–alive/dead, accept/reject, left/right–for which the total of both categories is known. This is like the marble and globe tossing examples from Chapter 2. But now you get to incorporate predictor variables.Poisson regressionis a GLM that models a count with an unknown maximum—number of elephants in Kenya, number of applications to a PhD program, number of significance tests in an issue ofPsychological Science. As described in Chapter 10, the Poisson model is a special case of binomial. At the end, the chapter describes some other count regressions. (McElreath, 2020a, p. 323,emphasisin the original)

In this chapter, we focus on the two most common types of count models: the binomial and the Poisson.

## 11.1 Binomial regression

The basic binomial model follows the form

\[y \sim \operatorname{Binomial}(n, p),\]

where \(y\) is some count variable, \(n\) is the number of trials, and \(p\) it the probability a given trial was a 1, which is sometimes termed a *success*. When \(n = 1\), then \(y\) is a vector of 0’s and 1’s. Presuming the logit link^{3}, which we just covered in Chapter 10, models of this type are commonly termed logistic regression. When \(n > 1\), and still presuming the logit link, we might call our model an aggregated logistic regression model, or more generally an aggregated binomial regression model.

### 11.1.2 Relative shark and absolute deer.

Based on the full model, `b11.4`

, here’s how you might compute the posterior mean and 95% intervals for the proportional odds of switching from `treatment == 2`

to `treatment == 4`

.

```
posterior_samples(b11.4) %>%
mutate(proportional_odds = exp(b_b_treatment4 - b_b_treatment2)) %>%
mean_qi(proportional_odds)
```

```
## proportional_odds .lower .upper .width .point .interval
## 1 0.9315351 0.5183732 1.536444 0.95 mean qi
```

On average, the switch multiplies the odds of pulling the left lever by 0.92, an 8% reduction in odds. This is what is meant by proportional odds. The new odds are calculated by taking the old odds and multiplying them by the proportional odds, which is 0.92 in this example. (p. 336)

A limitation of relative measures measures like proportional odds is they ignore what you might think of as the reference or the baseline.

Consider for example a rare disease which occurs in 1 per 10-million people. Suppose also that reading this textbook increased the odds of the disease 5-fold. That would mean approximately 4 more cases of the disease per 10-million people. So only 5-in-10-million chance now. The book is safe. (p. 336)

Here that is in code.

```
tibble(disease_rate = 1/1e7,
fold_increase = 5) %>%
mutate(new_disease_rate = disease_rate * fold_increase)
```

```
## # A tibble: 1 x 3
## disease_rate fold_increase new_disease_rate
## <dbl> <dbl> <dbl>
## 1 0.0000001 5 0.0000005
```

The hard part, though, is that “neither absolute nor relative risk is sufficient for all purposes” (p. 337). Each provides its own unique perspective on the data. Again, welcome to applied statistics. 🤷♂

### 11.1.3 Aggregated binomial: Chimpanzees again, condensed.

With the **tidyverse**, we can use `group_by()`

and `summarise()`

to achieve what McElreath did with `aggregate()`

.

```
<-
d_aggregated %>%
d group_by(treatment, actor, side, cond) %>%
summarise(left_pulls = sum(pulled_left)) %>%
ungroup()
%>%
d_aggregated head(n = 8)
```

```
## # A tibble: 8 x 5
## treatment actor side cond left_pulls
## <fct> <fct> <fct> <fct> <int>
## 1 1 1 1 1 6
## 2 1 2 1 1 18
## 3 1 3 1 1 5
## 4 1 4 1 1 6
## 5 1 5 1 1 6
## 6 1 6 1 1 14
## 7 1 7 1 1 14
## 8 2 1 2 1 9
```

To fit an aggregated binomial model with **brms**, we augment the `<criterion> | trials()`

syntax where the value that goes in `trials()`

is either a fixed number, as in this case, or variable in the data indexing \(n\). Either way, at least some of those trials will have an \(n > 1\). Here we’ll use the hard-code method, just like McElreath did in the text.

```
.6 <-
b11brm(data = d_aggregated,
family = binomial,
bf(left_pulls | trials(18) ~ a + b,
~ 0 + actor,
a ~ 0 + treatment,
b nl = TRUE),
prior = c(prior(normal(0, 1.5), nlpar = a),
prior(normal(0, 0.5), nlpar = b)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 11,
file = "fits/b11.06")
```

Check the posterior summary.

`print(b11.6)`

```
## Family: binomial
## Links: mu = logit
## Formula: left_pulls | trials(18) ~ a + b
## a ~ 0 + actor
## b ~ 0 + treatment
## Data: d_aggregated (Number of observations: 28)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## a_actor1 -0.44 0.33 -1.08 0.20 1.00 1323 2122
## a_actor2 3.89 0.77 2.56 5.61 1.00 3794 2058
## a_actor3 -0.74 0.34 -1.43 -0.09 1.00 1220 2095
## a_actor4 -0.75 0.34 -1.43 -0.08 1.00 1265 1944
## a_actor5 -0.44 0.34 -1.11 0.21 1.01 1263 1958
## a_actor6 0.49 0.34 -0.16 1.13 1.00 1357 2446
## a_actor7 1.96 0.42 1.16 2.82 1.00 1671 2621
## b_treatment1 -0.05 0.29 -0.59 0.53 1.00 1072 1942
## b_treatment2 0.48 0.29 -0.08 1.05 1.00 1076 1749
## b_treatment3 -0.39 0.29 -0.94 0.17 1.01 1091 1875
## b_treatment4 0.36 0.29 -0.19 0.93 1.00 984 1933
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

It might be easiest to compare `b11.4`

and `b11.6`

with a coefficient plot.

```
# this is just for fancy annotation
<-
text tibble(value = c(1.4, 2.6),
name = "b_a_actor7",
fit = c("b11.6", "b11.4"))
# rope in the posterior draws and wrangle
bind_rows(posterior_samples(b11.4),
posterior_samples(b11.6)) %>%
mutate(fit = rep(c("b11.4", "b11.6"), each = n() / 2)) %>%
pivot_longer(b_a_actor1:b_b_treatment4) %>%
# plot
ggplot(aes(x = value, y = name, color = fit)) +
stat_pointinterval(.width = .95, size = 2/3,
position = position_dodge(width = 0.5)) +
scale_color_manual(values = wes_palette("Moonrise2")[2:1]) +
geom_text(data = text,
aes(label = fit),
family = "Times", position = position_dodge(width = 2.25)) +
labs(x = "posterior (log-odds scale)",
y = NULL) +
theme(axis.ticks.y = element_blank(),
legend.position = "none")
```

Did you catch our `position = position_dodge()`

tricks? Try executing the plot without those parts of the code to get a sense of what they did. Now compute and save the PSIS-LOO estimates for the two models so we might compare them.

```
.4 <- add_criterion(b11.4, "loo")
b11.6 <- add_criterion(b11.6, "loo") b11
```

Here’s how we might attempt the comparison.

`loo_compare(b11.4, b11.6, criterion = "loo") %>% print(simplify = F)`

Unlike with McElreath’s `compare()`

code in the text, `loo_compare()`

wouldn’t even give us the results. All we get is the warning message informing us that because these two models are not based on the same data, comparing them with the LOO is invalid and **brms** refuses to let us do it. We can, however, look at their LOO summaries separately.

`loo(b11.4)`

```
##
## Computed from 4000 by 504 log-likelihood matrix
##
## Estimate SE
## elpd_loo -266.0 9.4
## p_loo 8.3 0.4
## looic 532.0 18.9
## ------
## Monte Carlo SE of elpd_loo is 0.0.
##
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
```

`loo(b11.6)`

```
##
## Computed from 4000 by 28 log-likelihood matrix
##
## Estimate SE
## elpd_loo -57.7 4.3
## p_loo 8.9 1.7
## looic 115.3 8.6
## ------
## Monte Carlo SE of elpd_loo is NA.
##
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 17 60.7% 1412
## (0.5, 0.7] (ok) 8 28.6% 768
## (0.7, 1] (bad) 3 10.7% 243
## (1, Inf) (very bad) 0 0.0% <NA>
## See help('pareto-k-diagnostic') for details.
```

To understand what’s going on, consider how you might describe six 1’s out of nine trials in the aggregated form,

\[\text{Pr}(6|9, p) = \frac{6!}{6!(9 - 6)!} p^6 (1 - p)^{9 - 6}.\]

If we still stick with the same data, but this time re-express those as nine dichotomous data points, we now describe their joint probability as

\[\text{Pr}(1, 1, 1, 1, 1, 1, 0, 0, 0 | p) = p^6 (1 - p)^{9 - 6}.\]

Let’s work this out in code.

```
# deviance of aggregated 6-in-9
-2 * dbinom(6, size = 9, prob = 0.2, log = TRUE)
```

`## [1] 11.79048`

```
# deviance of dis-aggregated
-2 * sum(dbinom(c(1, 1, 1, 1, 1, 1, 0, 0, 0), size = 1, prob = 0.2, log = TRUE))
```

`## [1] 20.65212`

But this difference is entirely meaningless. It is just a side effect of how we organized the data. The posterior distribution for the probability of success on each trial will end up the same, either way. (p. 339)

This is what our coefficient plot showed us, above. The posterior distribution was the same within simulation variance for `b11.4`

and `b11.6`

. Just like McElreath reported in the text, we also got a warning about high Pareto \(k\) values from the aggregated binomial model, `b11.6`

. To access the message and its associated table directly, we can feed the results of `loo()`

into the `loo::pareto_k_table`

function.

```
loo(b11.6) %>%
::pareto_k_table() loo
```

```
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 17 60.7% 1412
## (0.5, 0.7] (ok) 8 28.6% 768
## (0.7, 1] (bad) 3 10.7% 243
## (1, Inf) (very bad) 0 0.0% <NA>
```

Before looking at the Pareto \(k\) values, you might have noticed already that we didn’t get a similar warning before in the disaggregated logistic models of the same data. Why not? Because when we aggregated the data by actor-treatment, we forced PSIS (and WAIC) to imagine cross-validation that leaves out all 18 observations in each actor-treatment combination. So instead of leave-one-out cross-validation, it is more like leave-eighteen-out. This makes some observations more influential, because they are really now 18 observations.

What’s the bottom line? If you want to calculate WAIC or PSIS, you should use a logistic regression data format, not an aggregated format. Otherwise you are implicitly assuming that only large chunks of the data are separable. (p. 340)

### 11.1.4 Aggregated binomial: Graduate school admissions.

Load the infamous `UCBadmit`

data (see Bickel et al., 1975).

```
data(UCBadmit, package = "rethinking")
<- UCBadmit
d rm(UCBadmit)
d
```

```
## dept applicant.gender admit reject applications
## 1 A male 512 313 825
## 2 A female 89 19 108
## 3 B male 353 207 560
## 4 B female 17 8 25
## 5 C male 120 205 325
## 6 C female 202 391 593
## 7 D male 138 279 417
## 8 D female 131 244 375
## 9 E male 53 138 191
## 10 E female 94 299 393
## 11 F male 22 351 373
## 12 F female 24 317 341
```

Now compute our new index variable, `gid`

. We’ll also slip in a `case`

variable that saves the row numbers as a factor. That’ll come in handy later when we plot.

```
<-
d %>%
d mutate(gid = factor(applicant.gender, levels = c("male", "female")),
case = factor(1:n()))
```

Note the difference in how we defined out `gid`

. Whereas McElreath used numeral indices, we retained the text within an ordered factor. **brms** can handle either approach just fine. The advantage of the factor approach is it will be easier to understand the output. You’ll see in just a bit.

The univariable logistic model with `male`

as the sole predictor of `admit`

follows the form

\[\begin{align*} \text{admit}_i & \sim \operatorname{Binomial}(n_i, p_i) \\ \text{logit}(p_i) & = \alpha_{\text{gid}[i]} \\ \alpha_j & \sim \operatorname{Normal}(0, 1.5), \end{align*}\]

where \(n_i = \text{applications}_i\), the rows are indexed by \(i\), and the two levels of \(\text{gid}\) are indexed by \(j\). Since we’re only using our index variable `gid`

to model two intercepts with no further complications, we don’t need to use the verbose non-linear syntax to fit this model with **brms**.

```
.7 <-
b11brm(data = d,
family = binomial,
| trials(applications) ~ 0 + gid,
admit prior(normal(0, 1.5), class = b),
iter = 2000, warmup = 1000, cores = 4, chains = 4,
seed = 11,
file = "fits/b11.07")
```

`print(b11.7)`

```
## Family: binomial
## Links: mu = logit
## Formula: admit | trials(applications) ~ 0 + gid
## Data: d (Number of observations: 12)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## gidmale -0.22 0.04 -0.30 -0.14 1.00 2684 2468
## gidfemale -0.83 0.05 -0.94 -0.73 1.00 3289 2318
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

Our results are very similar to those in the text. But notice how our two rows have more informative row names than `a[1]`

and `a[2]`

. This is why you might consider using the ordered factor approach rather than using numeral indices.

Anyway, here we’ll compute the difference score in two metrics and summarize them with a little help from `mean_qi()`

.

```
posterior_samples(b11.7) %>%
mutate(diff_a = b_gidmale - b_gidfemale,
diff_p = inv_logit_scaled(b_gidmale) - inv_logit_scaled(b_gidfemale)) %>%
pivot_longer(contains("diff")) %>%
group_by(name) %>%
mean_qi(value, .width = .89)
```

```
## # A tibble: 2 x 7
## name value .lower .upper .width .point .interval
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 diff_a 0.609 0.504 0.715 0.89 mean qi
## 2 diff_p 0.141 0.118 0.165 0.89 mean qi
```

**brms** doesn’t have a convenience function that works quite like `rethinking::postcheck()`

. But we have options, the most handy of which in this case is probably `predict()`

.

```
<-
p predict(b11.7) %>%
data.frame() %>%
bind_cols(d)
<-
text %>%
d group_by(dept) %>%
summarise(case = mean(as.numeric(case)),
admit = mean(admit / applications) + .05)
%>%
p ggplot(aes(x = case, y = admit / applications)) +
geom_pointrange(aes(y = Estimate / applications,
ymin = Q2.5 / applications ,
ymax = Q97.5 / applications),
color = wes_palette("Moonrise2")[1],
shape = 1, alpha = 1/3) +
geom_point(color = wes_palette("Moonrise2")[2]) +
geom_line(aes(group = dept),
color = wes_palette("Moonrise2")[2]) +
geom_text(data = text,
aes(y = admit, label = dept),
color = wes_palette("Moonrise2")[2],
family = "serif") +
scale_y_continuous("Proportion admitted", limits = 0:1) +
ggtitle("Posterior validation check") +
theme(axis.ticks.x = element_blank())
```

Sometimes a fit this bad is the result of a coding mistake. In this case, it is not. The model did correctly answer the question we asked of it:

What are the average probabilities of admission for women and men, across all departments?The problem in this case is that men and women did not apply to the same departments, and departments vary in their rates of admission. This makes the answer misleading….Instead of asking

“What are the average probabilities of admission for women and men across all departments?”we want to ask“What is the average difference in probability of admission between women and men within departments?”(pp. 342–343,emphasisin the original).

The model better suited to answer that question follows the form

\[\begin{align*} \text{admit}_i & \sim \operatorname{Binomial} (n_i, p_i) \\ \text{logit}(p_i) & = \alpha_{\text{gid}[i]} + \delta_{\text{dept}[i]} \\ \alpha_j & \sim \operatorname{Normal} (0, 1.5) \\ \delta_k & \sim \operatorname{Normal} (0, 1.5), \end{align*}\]

where departments are indexed by \(k\). To fit a model including two index variables like this in **brms**, we’ll need to switch back to the non-linear syntax. Though if you’d like to see an analogous approach using conventional **brms** syntax, check out model `b10.9`

in Section 10.1.3 of my translation of McElreath’s first edition.

```
.8 <-
b11brm(data = d,
family = binomial,
bf(admit | trials(applications) ~ a + d,
~ 0 + gid,
a ~ 0 + dept,
d nl = TRUE),
prior = c(prior(normal(0, 1.5), nlpar = a),
prior(normal(0, 1.5), nlpar = d)),
iter = 4000, warmup = 1000, cores = 4, chains = 4,
seed = 11,
file = "fits/b11.08")
```

`print(b11.8)`

```
## Family: binomial
## Links: mu = logit
## Formula: admit | trials(applications) ~ a + d
## a ~ 0 + gid
## d ~ 0 + dept
## Data: d (Number of observations: 12)
## Samples: 4 chains, each with iter = 4000; warmup = 1000; thin = 1;
## total post-warmup samples = 12000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## a_gidmale -0.56 0.53 -1.64 0.45 1.00 1014 1353
## a_gidfemale -0.46 0.53 -1.55 0.56 1.00 1013 1399
## d_deptA 1.14 0.53 0.12 2.22 1.00 1020 1390
## d_deptB 1.10 0.54 0.08 2.19 1.00 1024 1431
## d_deptC -0.12 0.53 -1.15 0.97 1.00 1024 1402
## d_deptD -0.15 0.53 -1.17 0.93 1.00 1022 1447
## d_deptE -0.59 0.54 -1.63 0.50 1.00 1038 1495
## d_deptF -2.15 0.54 -3.19 -1.06 1.00 1057 1566
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

Like with the earlier model, here we compute the difference score for \(\alpha\) in two metrics.

```
posterior_samples(b11.8) %>%
mutate(diff_a = b_a_gidmale - b_a_gidfemale,
diff_p = inv_logit_scaled(b_a_gidmale) - inv_logit_scaled(b_a_gidfemale)) %>%
pivot_longer(contains("diff")) %>%
group_by(name) %>%
mean_qi(value, .width = .89)
```

```
## # A tibble: 2 x 7
## name value .lower .upper .width .point .interval
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 diff_a -0.0968 -0.227 0.0320 0.89 mean qi
## 2 diff_p -0.0215 -0.0514 0.00693 0.89 mean qi
```

Why did adding departments to the model change the inference about gender so much? The earlier figure gives you a hint–the rates of admission vary a lot across departments. Furthermore, women and men applied to different departments. Let’s do a quick tabulation to show that: (p. 344)

Here’s our **tidyverse**-style tabulation of the proportions of applicants in each department by `gid`

.

```
%>%
d group_by(dept) %>%
mutate(proportion = applications / sum(applications)) %>%
select(dept, gid, proportion) %>%
pivot_wider(names_from = dept,
values_from = proportion) %>%
mutate_if(is.double, round, digits = 2)
```

```
## # A tibble: 2 x 7
## gid A B C D E F
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 male 0.88 0.96 0.35 0.53 0.33 0.52
## 2 female 0.12 0.04 0.65 0.47 0.67 0.48
```

To make it even easier to see, we’ll depict it in a tile plot.

```
%>%
d group_by(dept) %>%
mutate(proportion = applications / sum(applications)) %>%
mutate(label = round(proportion, digits = 2),
gid = fct_rev(gid)) %>%
ggplot(aes(x = dept, y = gid, fill = proportion, label = label)) +
geom_tile() +
geom_text(aes(color = proportion > .25),
family = "serif") +
scale_fill_gradient(low = wes_palette("Moonrise2")[4],
high = wes_palette("Moonrise2")[1],
limits = c(0, 1)) +
scale_color_manual(values = wes_palette("Moonrise2")[c(1, 4)]) +
scale_x_discrete(NULL, position = "top") +
ylab(NULL) +
theme(axis.text.y = element_text(hjust = 0),
axis.ticks = element_blank(),
legend.position = "none")
```

As it turns out, “The departments with a larger proportion of women applicants are also those with lower overall admissions rates” (p. 344). If we presume gender influences both choice of department and admission rates, we might depict that in a simple DAG where \(G\) is applicant gender, \(D\) is department, and \(A\) is acceptance into grad school.

```
library(ggdag)
<-
dag_coords tibble(name = c("G", "D", "A"),
x = c(1, 2, 3),
y = c(1, 2, 1))
dagify(D ~ G,
~ D + G,
A coords = dag_coords) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_text(color = wes_palette("Moonrise2")[4], family = "serif") +
geom_dag_edges(edge_color = wes_palette("Moonrise2")[4]) +
scale_x_continuous(NULL, breaks = NULL) +
scale_y_continuous(NULL, breaks = NULL)
```

Although our `b11.8`

model did not contain a parameter corresponding to the \(G \rightarrow D\) pathway, it did condition on both \(G\) and \(D\). If we make another Figure like 11.5, we’ll see conditioning on both substantially improved the posterior predictive distribution.

```
predict(b11.8) %>%
data.frame() %>%
bind_cols(d) %>%
ggplot(aes(x = case, y = admit / applications)) +
geom_pointrange(aes(y = Estimate / applications,
ymin = Q2.5 / applications ,
ymax = Q97.5 / applications),
color = wes_palette("Moonrise2")[1],
shape = 1, alpha = 1/3) +
geom_point(color = wes_palette("Moonrise2")[2]) +
geom_line(aes(group = dept),
color = wes_palette("Moonrise2")[2]) +
geom_text(data = text,
aes(y = admit, label = dept),
color = wes_palette("Moonrise2")[2],
family = "serif") +
scale_y_continuous("Proportion admitted", limits = 0:1) +
labs(title = "Posterior validation check",
subtitle = "Though imperfect, this model is a big improvement") +
theme(axis.ticks.x = element_blank())
```

Here’s the DAG that proposes an unobserved confound, \(U\), that might better explain the \(D \rightarrow A\) pathway.

```
<-
dag_coords tibble(name = c("G", "D", "A", "U"),
x = c(1, 2, 3, 3),
y = c(1, 2, 1, 2))
dagify(D ~ G + U,
~ D + G + U,
A coords = dag_coords) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_point(x = 3, y = 2,
size = 5, color = wes_palette("Moonrise2")[2]) +
geom_dag_text(color = wes_palette("Moonrise2")[4], family = "serif") +
geom_dag_edges(edge_color = wes_palette("Moonrise2")[4]) +
scale_x_continuous(NULL, breaks = NULL) +
scale_y_continuous(NULL, breaks = NULL)
```

McElreath recommended we look at the `pairs()`

plot to get a sense of how highly correlated the parameters in our `b11.8`

model are. Why not get a little extra about it and use custom settings the upper triangle, the diagonal, and the lower triangle with a `GGally::ggpairs()`

plot? First we save our custom settings.

```
<- function(data, mapping, ...) {
my_upper
# get the x and y data to use the other code
<- eval_data_col(data, mapping$x)
x <- eval_data_col(data, mapping$y)
y
<- unname(cor.test(x, y)$estimate)
r <- format(r, digits = 2)[1]
rt <- as.character(rt)
tt
# plot the cor value
ggally_text(
label = tt,
mapping = aes(),
size = 4,
color = wes_palette("Moonrise2")[4],
alpha = 4/5,
family = "Times") +
theme_void()
}
<- function(data, mapping, ...) {
my_diag ggplot(data = data, mapping = mapping) +
geom_density(fill = wes_palette("Moonrise2")[2], size = 0) +
theme_void()
}
<- function(data, mapping, ...) {
my_lower ggplot(data = data, mapping = mapping) +
geom_point(color = wes_palette("Moonrise2")[1],
size = 1/10, alpha = 1/10) +
theme_void()
}
```

To learn more about the nature of the code for the `my_upper()`

function, check out Issue #139 in the **GGally** GitHub repository. Here is the plot.

```
library(GGally)
posterior_samples(b11.8) %>%
select(-lp__) %>%
set_names(c("alpha[male]", "alpha[female]", str_c("delta[", LETTERS[1:6], "]"))) %>%
ggpairs(upper = list(continuous = my_upper),
diag = list(continuous = my_diag),
lower = list(continuous = my_lower),
labeller = "label_parsed") +
labs(title = "Model: b11.8",
subtitle = "The parameters are strongly correlated.") +
theme(strip.text = element_text(size = 11))
```

Why might we want to over-parameterize the model? Because it makes it easier to assign priors. If we made one of the genders baseline and measured the other as a deviation from it, we would stumble into the issue of assuming that the acceptance rate for one of the genders is pre-data more uncertain than the other. This isn’t to say that over-parameterizing a model is always a good idea. But it isn’t a violation of any statistical principle. You can always convert the posterior, post sampling, to any alternative parameterization. The only limitation is whether the algorithm we use to approximate the posterior can handle the high correlations. In this case, it can. (p. 345)

#### 11.1.4.1 Rethinking: Simpson’s paradox is not a paradox.

This empirical example is a famous one in statistical teaching. It is often used to illustrate a phenomenon known as

Simpson’s paradox. Like most paradoxes, there is no violation of logic, just of intuition. And since different people have different intuition, Simpson’s paradox means different things to different people. The poor intuition being violated in this case is that a positive association in the entire population should also hold within each department. (p. 345,emphasisin the original)

In my field of clinical psychology, Simpson’s paradox is an important, if under-appreciated, phenomenon. If you’re in the social sciences as well, I highly recommend spending more time thinking about it. To get you started, I blogged about it here and Kievit et al. (2013) wrote a great tutorial paper called *Simpson’s paradox in psychological science: a practical guide*.

## 11.2 Poisson regression

When a binomial distribution has a very small probability of an event \(p\) and a very large number of trials \(N\), then it takes on a special shape. The expected value of a binomial distribution is just \(Np\), and its variance is \(Np(1 - p)\). But when \(N\) is very large and \(p\) is very small, then these are approximately the same. (p. 346)

Data of this kind are often called count data. Here we simulate some.

```
set.seed(11)
tibble(y = rbinom(1e5, 1000, 1/1000)) %>%
summarise(y_mean = mean(y),
y_variance = var(y))
```

```
## # A tibble: 1 x 2
## y_mean y_variance
## <dbl> <dbl>
## 1 1.00 1.01
```

Yes, those statistics are virtually the same. When dealing with pure Poisson data, \(\mu = \sigma^2\). When you have a number of trials for which \(n\) is unknown or much larger than seen in the data, the Poisson likelihood is a useful tool. We define it as

\[y_i \sim \text{Poisson}(\lambda),\]

where \(\lambda\) expresses both mean and variance because within this model, the variance scales right along with the mean. Since \(\lambda\) is constrained to be positive, we typically use the log link. Thus the basic Poisson regression model is

\[\begin{align*} y_i & \sim \operatorname{Poisson}(\lambda_i) \\ \log(\lambda_i) & = \alpha + \beta (x_i - \bar x), \end{align*}\]

where all model parameters receive priors following the forms we’ve been practicing.

### 11.2.1 Example: Oceanic tool complexity.

Load the `Kline`

data (see Kline & Boyd, 2010).

```
data(Kline, package = "rethinking")
<- Kline
d rm(Kline)
d
```

```
## culture population contact total_tools mean_TU
## 1 Malekula 1100 low 13 3.2
## 2 Tikopia 1500 low 22 4.7
## 3 Santa Cruz 3600 low 24 4.0
## 4 Yap 4791 high 43 5.0
## 5 Lau Fiji 7400 high 33 5.0
## 6 Trobriand 8000 high 19 4.0
## 7 Chuuk 9200 high 40 3.8
## 8 Manus 13000 low 28 6.6
## 9 Tonga 17500 high 55 5.4
## 10 Hawaii 275000 low 71 6.6
```

Here are our new columns.

```
<-
d %>%
d mutate(log_pop_std = (log(population) - mean(log(population))) / sd(log(population)),
cid = contact)
```

Our statistical model will follow the form

\[\begin{align*} \text{total_tools}_i & \sim \operatorname{Poisson}(\lambda_i) \\ \log(\lambda_i) & = \alpha_{\text{cid}[i]} + \beta_{\text{cid}[i]} \text{log_pop_std}_i \\ \alpha_j & \sim \; ? \\ \beta_j & \sim \; ?, \end{align*}\]

where the priors for \(\alpha_j\) and \(\beta_j\) have yet be defined. If we continue our convention of using a Normal prior on the \(\alpha\) parameters, we should recognize those will be log-Normal distributed on the outcome scale. Why? Because we’re modeling \(\lambda\) with the log link. Here’s our version of Figure 11.7, depicting the two log-Normal priors considered in the text.

```
tibble(x = c(3, 22),
y = c(0.055, 0.04),
meanlog = c(0, 3),
sdlog = c(10, 0.5)) %>%
expand(nesting(x, y, meanlog, sdlog),
number = seq(from = 0, to = 100, length.out = 200)) %>%
mutate(density = dlnorm(number, meanlog, sdlog),
group = str_c("alpha%~%Normal(", meanlog, ", ", sdlog, ")")) %>%
ggplot(aes(fill = group, color = group)) +
geom_area(aes(x = number, y = density),
alpha = 3/4, size = 0, position = "identity") +
geom_text(data = . %>% group_by(group) %>% slice(1),
aes(x = x, y = y, label = group),
family = "Times", parse = T, hjust = 0) +
scale_fill_manual(values = wes_palette("Moonrise2")[1:2]) +
scale_color_manual(values = wes_palette("Moonrise2")[1:2]) +
scale_y_continuous(NULL, breaks = NULL) +
xlab("mean number of tools") +
theme(legend.position = "none")
```

In this context, \(\alpha \sim \operatorname{Normal}(0, 10)\) has a very long tail on the outcome scale. The mean of the log-Normal distribution, recall, is \(\exp (\mu + \sigma^2/2)\). Here that is in code.

`exp(0 + 10^2 / 2)`

`## [1] 5.184706e+21`

That is very large. Here’s the same thing in a simulation.

```
set.seed(11)
rnorm(1e4, 0, 10) %>%
exp() %>%
mean()
```

`## [1] 1.61276e+12`

Now compute the mean for the other prior under consideration, \(\alpha \sim \operatorname{Normal}(3, 0.5)\).

`exp(3 + 0.5^2 / 2)`

`## [1] 22.7599`

This is much smaller and more reasonable. In case you were curious, here are the same priors, this time on the scale of \(\lambda\).

```
tibble(x = c(10, 4),
y = c(0.05, 0.5),
mean = c(0, 3),
sd = c(10, 0.5)) %>%
expand(nesting(x, y, mean, sd),
number = seq(from = -25, to = 25, length.out = 500)) %>%
mutate(density = dnorm(number, mean, sd),
group = str_c("alpha%~%Normal(", mean, ", ", sd, ")")) %>%
ggplot(aes(fill = group, color = group)) +
geom_area(aes(x = number, y = density),
alpha = 3/4, size = 0, position = "identity") +
geom_text(data = . %>% group_by(group) %>% slice(1),
aes(x = x, y = y, label = group),
family = "Times", parse = T, hjust = 0) +
scale_fill_manual(values = wes_palette("Moonrise2")[1:2]) +
scale_color_manual(values = wes_palette("Moonrise2")[1:2]) +
scale_y_continuous(NULL, breaks = NULL) +
xlab(expression(lambda~scale)) +
theme(legend.position = "none")
```

Now let’s prepare to make the top row of Figure 11.8. In this portion of the figure, we consider the implications of two competing priors for \(\beta\) while holding the prior for \(\alpha\) at \(\operatorname{Normal}(3, 0.5)\). The two \(\beta\) priors under consideration are \(\operatorname{Normal}(0, 10)\) and \(\operatorname{Normal}(0, 0.2)\).

```
set.seed(11)
# how many lines would you like?
<- 100
n
# simulate and wrangle
tibble(i = 1:n,
a = rnorm(n, mean = 3, sd = 0.5)) %>%
mutate(`beta%~%Normal(0*', '*10)` = rnorm(n, mean = 0 , sd = 10),
`beta%~%Normal(0*', '*0.2)` = rnorm(n, mean = 0 , sd = 0.2)) %>%
pivot_longer(contains("beta"),
values_to = "b",
names_to = "prior") %>%
expand(nesting(i, a, b, prior),
x = seq(from = -2, to = 2, length.out = 100)) %>%
# plot
ggplot(aes(x = x, y = exp(a + b * x), group = i)) +
geom_line(size = 1/4, alpha = 2/3,
color = wes_palette("Moonrise2")[4]) +
labs(x = "log population (std)",
y = "total tools") +
coord_cartesian(ylim = c(0, 100)) +
facet_wrap(~ prior, labeller = label_parsed)
```

It turns out that many of the lines considered plausible under \(\operatorname{Normal}(0, 10)\) are disturbingly extreme. Here is what \(\alpha \sim \operatorname{Normal}(3, 0.5)\) and \(\beta \sim \operatorname{Normal}(0, 0.2)\) would mean when the \(x\)-axis is on the log population scale and the population scale.

```
set.seed(11)
<-
prior tibble(i = 1:n,
a = rnorm(n, mean = 3, sd = 0.5),
b = rnorm(n, mean = 0, sd = 0.2)) %>%
expand(nesting(i, a, b),
x = seq(from = log(100), to = log(200000), length.out = 100))
# left
<-
p1 %>%
prior ggplot(aes(x = x, y = exp(a + b * x), group = i)) +
geom_line(size = 1/4, alpha = 2/3,
color = wes_palette("Moonrise2")[4]) +
labs(subtitle = expression(beta%~%Normal(0*', '*0.2)),
x = "log population",
y = "total tools") +
coord_cartesian(xlim = c(log(100), log(200000)),
ylim = c(0, 500))
# right
<-
p2 %>%
prior ggplot(aes(x = exp(x), y = exp(a + b * x), group = i)) +
geom_line(size = 1/4, alpha = 2/3,
color = wes_palette("Moonrise2")[4]) +
labs(subtitle = expression(beta%~%Normal(0*', '*0.2)),
x = "population",
y = "total tools") +
coord_cartesian(xlim = c(100, 200000),
ylim = c(0, 500))
# combine
| p2 p1
```

Okay, after settling on our two priors, the updated model formula is

\[\begin{align*} y_i & \sim \operatorname{Poisson}(\lambda_i) \\ \log(\lambda_i) & = \alpha + \beta (x_i - \bar x) \\ \alpha & \sim \operatorname{Normal}(3, 0.5) \\ \beta & \sim \operatorname{Normal}(0, 0.2). \end{align*}\]

We’re finally ready to fit the model. The only new thing in our model code is `family = poisson`

. In this case, **brms** defaults to the `log()`

link. We’ll fit both an intercept-only Poisson model and an interaction model.

```
# intercept only
.9 <-
b11brm(data = d,
family = poisson,
~ 1,
total_tools prior(normal(3, 0.5), class = Intercept),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 11,
file = "fits/b11.09")
# interaction model
.10 <-
b11brm(data = d,
family = poisson,
bf(total_tools ~ a + b * log_pop_std,
+ b ~ 0 + cid,
a nl = TRUE),
prior = c(prior(normal(3, 0.5), nlpar = a),
prior(normal(0, 0.2), nlpar = b)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 11,
file = "fits/b11.10")
```

Check the model summaries.

`print(b11.9)`

```
## Family: poisson
## Links: mu = log
## Formula: total_tools ~ 1
## Data: d (Number of observations: 10)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 3.54 0.05 3.44 3.65 1.00 1815 1498
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

`print(b11.10)`

```
## Family: poisson
## Links: mu = log
## Formula: total_tools ~ a + b * log_pop_std
## a ~ 0 + cid
## b ~ 0 + cid
## Data: d (Number of observations: 10)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## a_cidhigh 3.61 0.07 3.47 3.75 1.00 3941 2746
## a_cidlow 3.32 0.09 3.14 3.48 1.00 3300 2786
## b_cidhigh 0.19 0.16 -0.13 0.50 1.00 3907 2481
## b_cidlow 0.38 0.05 0.27 0.48 1.00 3051 2756
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

Now compute the LOO estimates and compare the models by the LOO.

```
.9 <- add_criterion(b11.9, "loo")
b11.10 <- add_criterion(b11.10, "loo")
b11
loo_compare(b11.9, b11.10, criterion = "loo") %>% print(simplify = F)
```

```
## elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
## b11.10 0.0 0.0 -42.7 6.6 7.0 2.6 85.4 13.2
## b11.9 -27.8 16.3 -70.5 16.6 8.0 3.5 141.0 33.3
```

Here’s the LOO weight.

`model_weights(b11.9, b11.10, weights = "loo") %>% round(digits = 2)`

```
## b11.9 b11.10
## 0 1
```

McElreath reported getting a warning from his `rethinking::compare()`

. Our warning came from the `add_criterion()`

function. We can inspect the Pareto \(k\) values with `loo::pareto_k_table()`

.

`loo(b11.10) %>% loo::pareto_k_table()`

```
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 6 60.0% 1225
## (0.5, 0.7] (ok) 2 20.0% 233
## (0.7, 1] (bad) 2 20.0% 40
## (1, Inf) (very bad) 0 0.0% <NA>
```

Let’s take a closer look.

```
tibble(culture = d$culture,
k = b11.10$criteria$loo$diagnostics$pareto_k) %>%
arrange(desc(k)) %>%
mutate_if(is.double, round, digits = 2)
```

```
## # A tibble: 10 x 2
## culture k
## <fct> <dbl>
## 1 Hawaii 0.96
## 2 Tonga 0.77
## 3 Yap 0.6
## 4 Trobriand 0.56
## 5 Tikopia 0.39
## 6 Malekula 0.37
## 7 Santa Cruz 0.32
## 8 Lau Fiji 0.31
## 9 Chuuk 0.09
## 10 Manus 0.08
```

It turns out Hawaii is very influential. Figure 11.9 will clarify why. Here we make the left panel.

```
<- c("Hawaii", "Tonga", "Trobriand", "Yap")
cultures
library(ggrepel)
<-
nd distinct(d, cid) %>%
expand(cid,
log_pop_std = seq(from = -4.5, to = 2.5, length.out = 100))
<-
f fitted(b11.10,
newdata = nd,
probs = c(.055, .945)) %>%
data.frame() %>%
bind_cols(nd)
<-
p1 %>%
f ggplot(aes(x = log_pop_std, group = cid, color = cid)) +
geom_smooth(aes(y = Estimate, ymin = Q5.5, ymax = Q94.5, fill = cid),
stat = "identity",
alpha = 1/4, size = 1/2) +
geom_point(data = bind_cols(d, b11.10$criteria$loo$diagnostics),
aes(y = total_tools, size = pareto_k),
alpha = 4/5) +
geom_text_repel(data =
bind_cols(d, b11.10$criteria$loo$diagnostics) %>%
filter(culture %in% cultures) %>%
mutate(label = str_c(culture, " (", round(pareto_k, digits = 2), ")")),
aes(y = total_tools, label = label),
size = 3, seed = 11, color = "black", family = "Times") +
labs(x = "log population (std)",
y = "total tools") +
coord_cartesian(xlim = range(b11.10$data$log_pop_std),
ylim = c(0, 80))
```

Now make the right panel of Figure 11.9.

```
<-
p2 %>%
f mutate(population = exp((log_pop_std * sd(log(d$population))) + mean(log(d$population)))) %>%
ggplot(aes(x = population, group = cid, color = cid)) +
geom_smooth(aes(y = Estimate, ymin = Q5.5, ymax = Q94.5, fill = cid),
stat = "identity",
alpha = 1/4, size = 1/2) +
geom_point(data = bind_cols(d, b11.10$criteria$loo$diagnostics),
aes(y = total_tools, size = pareto_k),
alpha = 4/5) +
scale_x_continuous("population", breaks = c(0, 50000, 150000, 250000)) +
ylab("total tools") +
coord_cartesian(xlim = range(d$population),
ylim = c(0, 80))
```

Combine the two subplots with **patchwork** and adjust the settings a little.

```
| p2) &
(p1 scale_fill_manual(values = wes_palette("Moonrise2")[1:2]) &
scale_color_manual(values = wes_palette("Moonrise2")[1:2]) &
scale_size(range = c(2, 5)) &
theme(legend.position = "none")
```

Hawaii is influential in that it has a very large population relative to the other islands.

#### 11.2.1.1 Overthinking: Modeling tool innovation.

McElreath’s theoretical, or scientific, model for `total_tools`

is

\[\widehat{\text{total_tools}} = \frac{\alpha_{\text{cid}[i]} \: \text{population}^{\beta_{\text{cid}[i]}}}{\gamma}.\]

We can use the Poisson likelihood to express this in a Bayesian model as

\[\begin{align*} \text{total_tools} & \sim \operatorname{Poisson}(\lambda_i) \\ \lambda_i & = \left[ \exp (\alpha_{\text{cid}[i]}) \text{population}_i^{\beta_{\text{cid}[i]}} \right] / \gamma \\ \alpha_j & \sim \operatorname{Normal}(1, 1) \\ \beta_j & \sim \operatorname{Exponential}(1) \\ \gamma & \sim \operatorname{Exponential}(1), \end{align*}\]

where we exponentiate \(\alpha_{\text{cid}[i]}\) to restrict the posterior to zero and above. Here’s how we might fit that model with **brms**.

```
.11 <-
b11brm(data = d,
family = poisson(link = "identity"),
bf(total_tools ~ exp(a) * population^b / g,
+ b ~ 0 + cid,
a ~ 1,
g nl = TRUE),
prior = c(prior(normal(1, 1), nlpar = a),
prior(exponential(1), nlpar = b, lb = 0),
prior(exponential(1), nlpar = g, lb = 0)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 11,
control = list(adapt_delta = .95),
file = "fits/b11.11")
```

Did you notice the `family = poisson(link = "identity")`

part of the code? Yes, it’s possible to use the Poisson likelihood without the log link. However, if you’re going to buck tradition and use some other link, make sure you know what you’re doing.

Check the model summary.

`print(b11.11)`

```
## Family: poisson
## Links: mu = identity
## Formula: total_tools ~ exp(a) * population^b/g
## a ~ 0 + cid
## b ~ 0 + cid
## g ~ 1
## Data: d (Number of observations: 10)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## a_cidhigh 0.95 0.85 -0.69 2.64 1.00 1565 1367
## a_cidlow 0.87 0.70 -0.59 2.19 1.00 1488 1419
## b_cidhigh 0.28 0.10 0.08 0.49 1.00 1376 998
## b_cidlow 0.26 0.03 0.19 0.33 1.00 2137 1753
## g_Intercept 1.13 0.78 0.20 3.16 1.00 1395 1373
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

Compute and check the PSIS-LOO estimates along with their diagnostic Pareto \(k\) values.

```
.11 <- add_criterion(b11.11, criterion = "loo", moment_match = T)
b11loo(b11.11)
```

```
##
## Computed from 4000 by 10 log-likelihood matrix
##
## Estimate SE
## elpd_loo -40.6 5.9
## p_loo 5.1 1.9
## looic 81.3 11.9
## ------
## Monte Carlo SE of elpd_loo is 0.1.
##
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 8 80.0% 249
## (0.5, 0.7] (ok) 2 20.0% 219
## (0.7, 1] (bad) 0 0.0% <NA>
## (1, Inf) (very bad) 0 0.0% <NA>
##
## All Pareto k estimates are ok (k < 0.7).
## See help('pareto-k-diagnostic') for details.
```

The first time through, we still had Pareto high \(k\) values. Recall that due to the very small sample size, this isn’t entirely surprising. Newer versions of **brms** might prompt you to set `moment_match = TRUE`

, which is what I did, here. You might perform the operation both ways to get a sense of the difference.

Okay, it’s time to make Figure 11.10.

```
# for the annotation
<-
text distinct(d, cid) %>%
mutate(population = c(210000, 72500),
total_tools = c(59, 68),
label = str_c(cid, " contact"))
# redifine the new data
<-
nd distinct(d, cid) %>%
expand(cid,
population = seq(from = 0, to = 300000, length.out = 100))
# compute the poster predictions for lambda
fitted(b11.11,
newdata = nd,
probs = c(.055, .945)) %>%
data.frame() %>%
bind_cols(nd) %>%
# plot!
ggplot(aes(x = population, group = cid, color = cid)) +
geom_smooth(aes(y = Estimate, ymin = Q5.5, ymax = Q94.5, fill = cid),
stat = "identity",
alpha = 1/4, size = 1/2) +
geom_point(data = bind_cols(d, b11.11$criteria$loo$diagnostics),
aes(y = total_tools, size = pareto_k),
alpha = 4/5) +
geom_text(data = text,
aes(y = total_tools, label = label),
family = "serif") +
scale_fill_manual(values = wes_palette("Moonrise2")[1:2]) +
scale_color_manual(values = wes_palette("Moonrise2")[1:2]) +
scale_size(range = c(2, 5)) +
scale_x_continuous("population", breaks = c(0, 50000, 150000, 250000)) +
ylab("total tools") +
coord_cartesian(xlim = range(d$population),
ylim = range(d$total_tools)) +
theme(legend.position = "none")
```

In case you were curious, here are the results if we compare `b11.10`

and `b11.11`

by the PSIS-LOO.

`loo_compare(b11.10, b11.11, criterion = "loo") %>% print(simplify = F)`

```
## elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
## b11.11 0.0 0.0 -40.6 5.9 5.1 1.9 81.3 11.9
## b11.10 -2.1 2.7 -42.7 6.6 7.0 2.6 85.4 13.2
```

`model_weights(b11.10, b11.11, weights = "loo") %>% round(digits = 3)`

```
## b11.10 b11.11
## 0.111 0.889
```

Finally, here’s a comparison of the two models by the Pareto \(k\) values.

```
tibble(b11.10 = b11.10$criteria$loo$diagnostics$pareto_k,
b11.11 = b11.11$criteria$loo$diagnostics$pareto_k) %>%
pivot_longer(everything()) %>%
ggplot(aes(x = value, y = name)) +
geom_vline(xintercept = c(.5, .7, 1), linetype = 3, color = wes_palette("Moonrise2")[2]) +
stat_dots(slab_fill = wes_palette("Moonrise2")[1],
slab_color = wes_palette("Moonrise2")[1]) +
scale_x_continuous(expression(Pareto~italic(k)), breaks = c(.5, .7, 1)) +
ylab(NULL) +
coord_cartesian(ylim = c(1.5, 2.4))
```

### 11.2.2 Negative binomial (gamma-Poisson) models.

Typically there is a lot of unexplained variation in Poisson models. Presumably this additional variation arises from unobserved influences that vary from case to case, generating variation in the true \(\lambda\)’s. Ignoring this variation, or

rate heterogeneity, can cause confounds just like it can for binomial models. So a very common extension of Poisson GLMs is to swap the Poisson distribution for something called thenegative binomialdistribution. This is really a Poisson distribution in disguise, and it is also sometimes called thegamma-Poissondistribution for this reason. It is a Poisson in disguise, because it is a mixture of different Poisson distributions. This is the Poisson analogue of the Student-t model, which is a mixture of different normal distributions. We’ll work with mixtures in the next chapter. (p. 357,emphasisin the original)

### 11.2.3 Example: Exposure and the offset.

For the last Poisson example, we’ll look at a case where the exposure varies across observations. When the length of observation, area of sampling, or intensity of sampling varies, the counts we observe also naturally vary. Since a Poisson distribution assumes that the rate of events is constant in time (or space), it’s easy to handle this. All we need to do, as explained above, is to add the logarithm of the exposure to the linear model. The term we add is typically called an

offset. (p. 357,emphasisin the original)

Here we simulate our data.

```
set.seed(11)
<- 30
num_days <- rpois(num_days, lambda = 1.5)
y
<- 4
num_weeks <- rpois(num_weeks, lambda = 0.5 * 7) y_new
```

Now tidy the data and add `log_days`

.

```
(<-
d tibble(y = c(y, y_new),
days = rep(c(1, 7), times = c(num_days, num_weeks)), # this is the exposure
monastery = rep(0:1, times = c(num_days, num_weeks))) %>%
mutate(log_days = log(days))
)
```

```
## # A tibble: 34 x 4
## y days monastery log_days
## <int> <dbl> <int> <dbl>
## 1 1 1 0 0
## 2 0 1 0 0
## 3 1 1 0 0
## 4 0 1 0 0
## 5 0 1 0 0
## 6 4 1 0 0
## 7 0 1 0 0
## 8 1 1 0 0
## 9 3 1 0 0
## 10 0 1 0 0
## # … with 24 more rows
```

Within the context of the Poisson likelihood, we can decompose \(\lambda\) into two parts, \(\mu\) (mean) and \(\tau\) (exposure), like this:

\[ y_i \sim \operatorname{Poisson}(\lambda_i) \\ \log \lambda_i = \log \frac{\mu_i}{\tau_i} = \log \mu_i - \log \tau_i. \]

Therefore, you can rewrite the equation if the exposure (\(\tau\)) varies in your data and you still want to model the mean (\(\mu\)). Using the model we’re about to fit as an example, here’s what that might look like:

\[\begin{align*} y_i & \sim \operatorname{Poisson}(\mu_i) \\ \log \mu_i & = \color{#a4692f}{\log \tau_i} + \alpha + \beta \text{monastery}_i \\ \alpha & \sim \operatorname{Normal}(0, 1) \\ \beta & \sim \operatorname{Normal}(0, 1), \end{align*}\]

where the offset \(\log \tau_i\) does not get a prior. In this context, its value is added directly to the right side of the formula. With the **brms** package, you use the `offset()`

function in the `formula`

syntax. You just insert a pre-processed variable like `log_days`

or the log of a variable, such as `log(days)`

. Fit the model.

```
.12 <-
b11brm(data = d,
family = poisson,
~ 1 + offset(log_days) + monastery,
y prior = c(prior(normal(0, 1), class = Intercept),
prior(normal(0, 1), class = b)),
iter = 2000, warmup = 1000, cores = 4, chains = 4,
seed = 11,
file = "fits/b11.12")
```

As we look at the model summary, keep in mind that the parameters are on the per-one-unit-of-time scale. Since we simulated the data based on summary information from two units of time–one day and seven days–, this means the parameters are in the scale of \(\log (\lambda)\) per one day.

`print(b11.12)`

```
## Family: poisson
## Links: mu = log
## Formula: y ~ 1 + offset(log_days) + monastery
## Data: d (Number of observations: 34)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept -0.01 0.18 -0.38 0.33 1.00 2350 1804
## monastery -0.89 0.33 -1.56 -0.26 1.00 2443 2306
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

The model summary helps clarify that when you use `offset()`

, `brm()`

fixes the value. Thus there is no parameter estimate for the `offset()`

. It’s a fixed part of the model not unlike the \(\nu\) parameter of the Student-\(t\) distribution gets fixed to infinity when you use the Gaussian likelihood.

To get the posterior distributions for average daily outputs for the old and new monasteries, respectively, we’ll use use the formulas

\[\begin{align*} \lambda_\text{old} & = \exp (\alpha) \;\;\; \text{and} \\ \lambda_\text{new} & = \exp (\alpha + \beta_\text{monastery}). \end{align*}\]

Following those transformations, we’ll summarize the \(\lambda\) distributions with medians and 89% HDIs with help from the `tidybayes::mean_hdi()`

function.

```
posterior_samples(b11.12) %>%
mutate(lambda_old = exp(b_Intercept),
lambda_new = exp(b_Intercept + b_monastery)) %>%
pivot_longer(contains("lambda")) %>%
mutate(name = factor(name, levels = c("lambda_old", "lambda_new"))) %>%
group_by(name) %>%
mean_hdi(value, .width = .89) %>%
mutate_if(is.double, round, digits = 2)
```

```
## # A tibble: 2 x 7
## name value .lower .upper .width .point .interval
## <fct> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 lambda_old 1 0.69 1.25 0.89 mean hdi
## 2 lambda_new 0.42 0.23 0.6 0.89 mean hdi
```

Because we don’t know what seed McElreath used to simulate his data, our simulated data differed a little from his and, as a consequence, our results differ a little, too.

## 11.3 Multinomial and categorical models

When more than two types of unordered events are possible, and the probability of each type of event is constant across trials, then the maximum entropy distribution is the

multinomial distribution. [We] already met the multinomial, implicitly, in Chapter 10 when we tossed pebbles into buckets as an introduction to maximum entropy. The binomial is really a special case of this distribution. And so its distribution formula resembles the binomial, just extrapolated out to three or more types of events. If there are \(K\) types of events with probabilities \(p_1, \dots, p_K\), then the probability of observing \(y_1, \dots, y_K\) events of each type out of n total trials is:\[\operatorname{Pr} (y_1, \dots, y_K | n, p_1, \dots, p_K) = \frac{n!}{\prod_i y_i!} \prod_{i = 1}^K p_i^{y_i}\]

The fraction with \(n!\) on top just expresses the number of different orderings that give the same counts \(y_1, \dots, y_K\). It’s the famous multiplicity from the previous chapter….

The conventional and natural link in this context is the

multinomial logit, also known as thesoftmaxfunction. This link function takes a vector of scores, one for each of \(K\) event types, and computes the probability of a particular type of event \(k\) as\[\text{Pr} (k |s_1, s_2, \dots, s_K) = \frac{\exp (s_k)}{\sum_{i = 1}^K \exp (s_i)}\] (p. 359,

emphasisin the original)

McElreath then went on to explain how multinomial logistic regression models are among the more difficult of the GLMs to master. He wasn’t kidding. To get a grasp on these, we’ll cover them in a little more detail than he did in the text. Before we begin, I’d like to give a big shout out to Adam Bear, whose initial comment on a GitHub issue turned into a friendly and productive email collaboration on what, exactly, is going on with this section. Hopefully we got it.

### 11.3.1 Predictors matched to outcomes.

To begin, let’s simulate the data just like McElreath did in the **R** code 11.55 block.

```
library(rethinking)
# simulate career choices among 500 individuals
<- 500 # number of individuals
n <- c(1, 2, 5) # expected income of each career
income <- 0.5 * income # scores for each career, based on income
score
# next line converts scores to probabilities
<- softmax(score[1], score[2], score[3])
p
# now simulate choice
# outcome career holds event type values, not counts
<- rep(NA, n) # empty vector of choices for each individual
career
# sample chosen career for each individual
set.seed(34302)
# sample chosen career for each individual
for(i in 1:n) career[i] <- sample(1:3, size = 1, prob = p)
```

Before moving on, it might be useful to examine what we just did. With the three lines below the “# simulate career choices among 500 individuals” comment, we defined the formulas for three scores. Those were

\[\begin{align*} s_1 & = 0.5 \times \text{income}_1 \\ s_2 & = 0.5 \times \text{income}_2 \\ s_3 & = 0.5 \times \text{income}_3, \end{align*}\]

where \(\text{income}_1 = 1\), \(\text{income}_2 = 2\), and \(\text{income}_3 = 5\). What’s a little odd about this setup and conceptually important to get is that although \(\text{income}_i\) varies across the three levels of \(s\), the \(\text{income}_i\) value is constant within each level of \(s\). E.g., \(\text{income}_1\) is not a variable within the context of \(s_1\). Therefore, we could also write the above as

\[\begin{align*} s_1 & = 0.5 \cdot 1 = 0.5 \\ s_2 & = 0.5 \cdot 2 = 1.0 \\ s_3 & = 0.5 \cdot 5 = 2.5. \end{align*}\]

Let’s confirm.

`print(score)`

`## [1] 0.5 1.0 2.5`

We then converted those `score`

values to probabilities with the `softmax()`

function. This will become important when we set up the model code. For now, here’s what the data look like.

```
# put them in a tibble
<-
d tibble(career = career) %>%
mutate(career_income = ifelse(career == 3, 5, career))
# plot
%>%
d ggplot(aes(x = career)) +
geom_bar(size = 0, fill = wes_palette("Moonrise2")[2])
```

Our `career`

variable is composed of three categories, `1:3`

, with each category more likely than the one before. Here’s a breakdown of the counts, percentages, and probabilities of each category.

```
%>%
d count(career) %>%
mutate(percent = (100 * n / sum(n)),
probability = n / sum(n))
```

```
## # A tibble: 3 x 4
## career n percent probability
## <int> <int> <dbl> <dbl>
## 1 1 48 9.6 0.096
## 2 2 79 15.8 0.158
## 3 3 373 74.6 0.746
```

To further build an appreciation for how we simulated data with these proportions and how the process links in with the formulas, above, we’ll retrace the first few simulation steps within a **tidyverse**-centric workflow. Recall how in those first few steps we defined values for `income`

, `score`

, and `p`

. Here they are again in a tibble.

```
tibble(income = c(1, 2, 5)) %>%
mutate(score = 0.5 * income) %>%
mutate(p = exp(score) / sum(exp(score)))
```

```
## # A tibble: 3 x 3
## income score p
## <dbl> <dbl> <dbl>
## 1 1 0.5 0.0996
## 2 2 1 0.164
## 3 5 2.5 0.736
```

Notice how the values in the `p`

column match up well with the `probability`

values from the output from the block just above. Our simulation successfully produces data corresponding to the data-generating values. Woot! Also note how the code we just used to compute those `p`

values, `p = exp(score) / sum(exp(score))`

, corresponds nicely with the formula from above,

\[\text{Pr} (k |s_1, s_2, \dots, s_K) = \frac{\exp (s_k)}{\sum_{i = 1}^K \exp (s_i)}.\]

What still might seem mysterious is what those \(s\) values in the equation are. In the simulation and in the prose, McElreath called them *scores*. Another way to think about them is as weights. The thing to get is that their exact values aren’t important so much as their difference one from another. You’ll note that `score`

for `income == 2`

was 0.5 larger than that of `income == 1`

. The same was true for `income == 3`

and `income == 2`

. So if we add an arbitrary constant to each of those `score`

values, like 11, we’ll get the same `p`

values.

```
tibble(income = c(1, 2, 5),
some_constant = 11) %>%
mutate(score = (0.5 * income) + some_constant) %>%
mutate(p = exp(score) / sum(exp(score)))
```

```
## # A tibble: 3 x 4
## income some_constant score p
## <dbl> <dbl> <dbl> <dbl>
## 1 1 11 11.5 0.0996
## 2 2 11 12 0.164
## 3 5 11 13.5 0.736
```

Now keeping that in mind, recall how McElreath said that though we have \(K\) categories, \(K = 3\) in this case, we only estimate \(K - 1\) linear models. “In a multinomial (or categorical) GLM, you need \(K - 1\) linear models for \(K\) types of events. One of the outcome values is chosen as a ‘pivot’ and the others are modeled relative to it.” (p. 360). You could also think of the pivot category as the reference category.

Before we practice fitting multinomial models with **brms**, it’ll be helpful if we first follow along with the text and fit the model directly in Stan. We will be working directly with Stan very infrequently in this ebook. If you’re interested in learning more about modeling directly with Stan, you might check out the *Stan user’s guide* (Stan Development Team, 2021c), the *Stan reference manual* (Stan Development Team, 2021b), and the *Stan functions reference* (Stan Development Team, 2021a). Fit the model with Stan.

```
# define the model
.13 <- "
code_m11data{
int N; // number of individuals
int K; // number of possible careers
int career[N]; // outcome
vector[K] career_income;
}
parameters{
vector[K - 1] a; // intercepts
real<lower=0> b; // association of income with choice
}
model{
vector[K] p;
vector[K] s;
a ~ normal(0, 1);
b ~ normal(0, 0.5);
s[1] = a[1] + b * career_income[1];
s[2] = a[2] + b * career_income[2];
s[3] = 0; // pivot
p = softmax(s);
career ~ categorical(p);
}
"
# wrangle the data
<-
dat_list list(N = n,
K = 3,
career = career,
career_income = income)
# fit the model
.13 <-
m11stan(data = dat_list,
model_code = code_m11.13,
chains = 4)
```

Check the summary.

`precis(m11.13, depth = 2) %>% round(digits = 2)`

```
## mean sd 5.5% 94.5% n_eff Rhat4
## a[1] -2.14 0.18 -2.44 -1.87 647.49 1
## a[2] -1.79 0.25 -2.25 -1.45 501.46 1
## b 0.13 0.11 0.01 0.36 497.10 1
```

One of the primary reasons we went through this exercise is to show that McElreath’s **R** code 11.56 and 11.57 do not return the results he reported on page 361. The plot thickens when we attempt the counterfactual simulation on page 362, as reported in **R** code 11.58.

```
<- extract.samples(m11.13)
post
# set up logit scores
<- with(post, a[, 1] + b * income[1])
s1 <- with(post, a[, 2] + b * income[2])
s2_orig <- with(post, a[, 2] + b * income[2] * 2)
s2_new
# compute probabilities for original and counterfactual
<- sapply(1:length(post$b), function(i)
p_orig softmax(c(s1[i], s2_orig[i], 0)))
<- sapply(1:length(post$b), function(i)
p_new softmax(c(s1[i], s2_new[i], 0)))
# summarize
<- p_new[2, ] - p_orig[2, ]
p_diff precis(p_diff)
```

```
## mean sd 5.5% 94.5% histogram
## p_diff 0.0419429 0.03941889 0.002461482 0.1214324 ▇▃▂▂▁▁▁▁▁▁▁
```

Even though we used the same code, our counterfactual simulation doesn’t match up with the results McElreath reported in the text, either. Keep this all in mind as we switch to **brms**. But before we move on to **brms**, check this out.

```
data.frame(s1 = score[3] + s1,
s2 = score[3] + s2_orig,
s3 = score[3] + 0) %>%
pivot_longer(everything()) %>%
group_by(name) %>%
mean_qi(value) %>%
mutate_if(is.double, round, digits = 2)
```

```
## # A tibble: 3 x 7
## name value .lower .upper .width .point .interval
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 s1 0.49 0.19 0.77 0.95 mean qi
## 2 s2 0.98 0.73 1.21 0.95 mean qi
## 3 s3 2.5 2.5 2.5 0.95 mean qi
```

In his Stan code (**R** code 11.56), you’ll see McElreath chose the third category to be his pivot and that he used zero as a constant value. As it turns out, it is common practice to set the score value for the reference category to zero. It’s also a common practice to use the first event type as the reference category. Importantly, in his (2021h) vignette, *Parameterization of response distributions in brms*, Bürkner clarified the **brms** default is to use the first response category as the reference and set it to a zero as well. However, we can control this behavior with the `refcat`

argument. In the examples to follow, we’ll follow McElreath and use the third event type as the reference category by setting `refcat = 3`

.

In addition to the discrepancies with the code and results in the text, one of the things I don’t care for in this section is how fast McElreath covered the material. Our approach will be to slow down a little and start off by fitting a intercepts-only model before adding the covariate. Before we fit the model, we might take a quick look at the prior structure with `brms::get_prior()`

.

```
get_prior(data = d,
family = categorical(link = logit, refcat = 3),
~ 1) career
```

```
## prior class coef group resp dpar nlpar bound source
## (flat) Intercept default
## student_t(3, 3, 2.5) Intercept mu1 default
## student_t(3, 3, 2.5) Intercept mu2 default
```

We have two “intercepts,” which are differentiated in the `dpar`

column. We’ll talk more about what these are in just a bit; don’t worry. I show this here because as of **brms** 2.12.0, “specifying global priors for regression coefficients in categorical models is deprecated.” The upshot is even if we want to use the same prior for both, we need to use the `dpar`

argument for each. With that in mind, here’s our multinomial model in **brms**. Do note the specification `family = categorical(link = logit, refcat = 3)`

. The `categorical`

part is what instructs **brms** to use the multinomial likelihood and the `refcat = 3`

part will allow us to use the third event type as the pivot.

```
.13io <-
b11brm(data = d,
family = categorical(link = logit, refcat = 3),
~ 1,
career prior = c(prior(normal(0, 1), class = Intercept, dpar = mu1),
prior(normal(0, 1), class = Intercept, dpar = mu2)),
iter = 2000, warmup = 1000, cores = 4, chains = 4,
seed = 11,
file = "fits/b11.13io")
```

The summary can be difficult to interpret.

`print(b11.13io)`

```
## Family: categorical
## Links: mu1 = logit; mu2 = logit
## Formula: career ~ 1
## Data: d (Number of observations: 500)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## mu1_Intercept -2.01 0.15 -2.30 -1.73 1.00 3324 2631
## mu2_Intercept -1.53 0.12 -1.77 -1.29 1.00 2993 2768
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

`brms::brm()`

referred to the \(K\) categories as `mu1`

, `mu2`

, and `mu3`

. Since `career == 3`

is the reference category, the score for which was set to zero, there is no parameter for `mu3_Intercept`

. That’s a zero. Now notice how `mu1_Intercept`

is about -2 and `mu2_Intercept`

is about -1.5. If we double back to the `income`

and `score`

values we played with at the beginning of this section, you’ll notice that the score for the reference category was 2.5. Here’s what happens if we rescale the three scores such that the `score`

value for the reference category is 0.

```
tibble(income = c(1, 2, 5)) %>%
mutate(score = 0.5 * income) %>%
mutate(rescaled_score = score - 2.5)
```

```
## # A tibble: 3 x 3
## income score rescaled_score
## <dbl> <dbl> <dbl>
## 1 1 0.5 -2
## 2 2 1 -1.5
## 3 5 2.5 0
```

Now notice how the `rescaled_score`

values for the first two rows correspond nicely to `mu1_Intercept`

and `mu2_Intercept`

from our model. What I hope this clarifies is that our statistical model returned the scores. But recall these are not quite probabilities. *Why?* Because the weights are all relative to one another. The easiest way to get what we want, the probabilities for the three categories, is with `brms::fitted()`

. Since this model has no predictors, only intercepts, we won’t specify any `newdata`

. In such a case, `fitted()`

will return fitted values for each case in the data. Going slow, let’s take a look at the structure of the output.

`fitted(b11.13io) %>% str()`

```
## num [1:500, 1:4, 1:3] 0.0999 0.0999 0.0999 0.0999 0.0999 ...
## - attr(*, "dimnames")=List of 3
## ..$ : NULL
## ..$ : chr [1:4] "Estimate" "Est.Error" "Q2.5" "Q97.5"
## ..$ : chr [1:3] "P(Y = 1)" "P(Y = 2)" "P(Y = 3)"
```

Just as expected, we have 500 rows–one for each case in the original data. We have four summary columns, the typical `Estimate`

, `Est.Error`

, `Q2.5`

, and `Q97.5`

. We also have third dimension composed of three levels, `P(Y = 1)`

, `P(Y = 2)`

, and `P(Y = 3)`

. Those index which of the three career categories each probability summary is for. Since the results are identical for each row, we’ll simplify the output by only keeping the first row.

```
fitted(b11.13io)[1, , ] %>%
round(digits = 2)
```

```
## P(Y = 1) P(Y = 2) P(Y = 3)
## Estimate 0.10 0.16 0.74
## Est.Error 0.01 0.02 0.02
## Q2.5 0.08 0.13 0.70
## Q97.5 0.13 0.19 0.77
```

If we take the transpose of that, it will put the results in the format we’re more accustomed to.

```
fitted(b11.13io)[1, , ] %>%
round(digits = 2) %>%
t()
```

```
## Estimate Est.Error Q2.5 Q97.5
## P(Y = 1) 0.10 0.01 0.08 0.13
## P(Y = 2) 0.16 0.02 0.13 0.19
## P(Y = 3) 0.74 0.02 0.70 0.77
```

Now compare those summaries with the empirically-derived percent and probability values we computed earlier.

```
tibble(income = c(1, 2, 5)) %>%
mutate(score = 0.5 * income) %>%
mutate(p = exp(score) / sum(exp(score)))
```

```
## # A tibble: 3 x 3
## income score p
## <dbl> <dbl> <dbl>
## 1 1 0.5 0.0996
## 2 2 1 0.164
## 3 5 2.5 0.736
```

Now here’s how to make use of the formula from the last `mutate()`

line, \(\frac{\exp (s_k)}{\sum_{i = 1}^K \exp (s_i)}\), to compute the marginal probabilities from `b11.13io`

by hand.

```
posterior_samples(b11.13io) %>%
mutate(b_mu3_Intercept = 0) %>%
mutate(p1 = exp(b_mu1_Intercept) / (exp(b_mu1_Intercept) + exp(b_mu2_Intercept) + exp(b_mu3_Intercept)),
p2 = exp(b_mu2_Intercept) / (exp(b_mu1_Intercept) + exp(b_mu2_Intercept) + exp(b_mu3_Intercept)),
p3 = exp(b_mu3_Intercept) / (exp(b_mu1_Intercept) + exp(b_mu2_Intercept) + exp(b_mu3_Intercept))) %>%
pivot_longer(p1:p3) %>%
group_by(name) %>%
mean_qi(value) %>%
mutate_if(is.double, round, digits = 2)
```

```
## # A tibble: 3 x 7
## name value .lower .upper .width .point .interval
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 p1 0.1 0.08 0.13 0.95 mean qi
## 2 p2 0.16 0.13 0.19 0.95 mean qi
## 3 p3 0.74 0.7 0.77 0.95 mean qi
```

Hurray; we did it! Not only did we fit a simple multinomial model with **brms**, we actually made sense of the parameters by connecting them to the original data-generating values. We’re almost ready to contend with the model McElreath fit with `stan()`

. But before we do, it’ll be helpful to show alternative ways to fit these models. We used conventional style syntax when we fit `b11.13io`

. There are at least two alternative ways to fit the model:

```
# verbose syntax
.13io_verbose <-
b11brm(data = d,
family = categorical(link = logit, refcat = 3),
bf(career ~ 1,
~ 1,
mu1 ~ 1),
mu2 prior = c(prior(normal(0, 1), class = Intercept, dpar = mu1),
prior(normal(0, 1), class = Intercept, dpar = mu2)),
iter = 2000, warmup = 1000, cores = 4, chains = 4,
seed = 11,
file = "fits/b11.13io_verbose")
# nonlinear syntax
.13io_nonlinear <-
b11brm(data = d,
family = categorical(link = logit, refcat = 3),
bf(career ~ 1,
nlf(mu1 ~ a1),
nlf(mu2 ~ a2),
+ a2 ~ 1),
a1 prior = c(prior(normal(0, 1), class = b, nlpar = a1),
prior(normal(0, 1), class = b, nlpar = a2)),
iter = 2000, warmup = 1000, cores = 4, chains = 4,
seed = 11,
file = "fits/b11.13io_nonlinear")
```

For the sake of space, I’m not going to show the results for those two models. If you fit them yourself, you’ll see the results for `b11.13io`

and `b11.13io_verbose`

are exactly the same and `b11.13io_nonlinear`

differs from them only within simulation variation. I point this out because it’s the nonlinear approach that will allow us to fit a model like McElreath’s `m11.13`

. My hope is the syntax we used in the `b11.13io_verbose`

model will help clarify what’s going on with the non-linear syntax. When we fit multinomial models with **brms**, the terse conventional `formula`

syntax might not make clear how there are actually \(K - 1\) formulas. The more verbose syntax of our `b11.13io_verbose`

model shows how we can specify those models directly. In our case, that was with those `mu1 ~ 1, mu2 ~ 1`

lines. Had we used the **brms** default and used the first level of `career`

as the pivot, those lines would have instead been `mu2 ~ 1, mu3 ~ 1`

. So anyway, when we switch to the non-linear syntax, we explicitly model `mu1`

and `mu2`

and, as is typical of the non-linear syntax, we name our parameters. You can see another comparison of these three ways of fitting a multinomial model at the Nonlinear syntax with a multinomial model? thread on the Stan Forums.

Now it’s time to focus on the **brms** version of McElreath’s `m11.13`

. To my eye, McElreath’s model has two odd features. First, though he has two intercepts, he only has one \(\beta\) parameter. Second, if you look at McElreath’s `parameters`

block, you’ll see that he restricted his \(\beta\) parameter to be zero and above (`real<lower=0> b;`

).

With the **brms** non-linear syntax, we can fit the model with one \(\beta\) parameter or allow the one \(\beta\) parameter to differ for `mu1`

and `mu2`

. As to setting a lower bound to the `b`

parameter[s], we can do that with the `lb`

argument within the `prior()`

function. If we fit our version of `m11.13`

by systemically varying these two features, we’ll end up with the four versions listed in the table below.

```
crossing(b = factor(c("b1 & b2", "b"), levels = c("b1 & b2", "b")),
lb = factor(c("NA", 0), levels = c("NA", 0))) %>%
mutate(fit = str_c("b11.13", letters[1:n()])) %>%
select(fit, everything()) %>%
flextable() %>%
width(width = 1.25)
```

fit | b | lb |

b11.13a | b1 & b2 | NA |

b11.13b | b1 & b2 | 0 |

b11.13c | b | NA |

b11.13d | b | 0 |

Fit `b11.13a`

through `b11.13d`

, the four variants on the model.

```
.13a <-
b11brm(data = d,
family = categorical(link = logit, refcat = 3),
bf(career ~ 1,
nlf(mu1 ~ a1 + b1 * 1),
nlf(mu2 ~ a2 + b2 * 2),
+ a2 + b1 + b2 ~ 1),
a1 prior = c(prior(normal(0, 1), class = b, nlpar = a1),
prior(normal(0, 1), class = b, nlpar = a2),
prior(normal(0, 0.5), class = b, nlpar = b1),
prior(normal(0, 0.5), class = b, nlpar = b2)),
iter = 2000, warmup = 1000, cores = 4, chains = 4,
seed = 11,
file = "fits/b11.13a")
.13b <-
b11brm(data = d,
family = categorical(link = logit, refcat = 3),
bf(career ~ 1,
nlf(mu1 ~ a1 + b1 * 1),
nlf(mu2 ~ a2 + b2 * 2),
+ a2 + b1 + b2 ~ 1),
a1 prior = c(prior(normal(0, 1), class = b, nlpar = a1),
prior(normal(0, 1), class = b, nlpar = a2),
prior(normal(0, 0.5), class = b, nlpar = b1, lb = 0),
prior(normal(0, 0.5), class = b, nlpar = b2, lb = 0)),
iter = 2000, warmup = 1000, cores = 4, chains = 4,
seed = 11,
control = list(adapt_delta = .99),
file = "fits/b11.13b")
.13c <-
b11brm(data = d,
family = categorical(link = logit, refcat = 3),
bf(career ~ 1,
nlf(mu1 ~ a1 + b * 1),
nlf(mu2 ~ a2 + b * 2),
+ a2 + b ~ 1),
a1 prior = c(prior(normal(0, 1), class = b, nlpar = a1),
prior(normal(0, 1), class = b, nlpar = a2),
prior(normal(0, 0.5), class = b, nlpar = b)),
iter = 2000, warmup = 1000, cores = 4, chains = 4,
seed = 11,
file = "fits/b11.13c")
.13d <-
b11brm(data = d,
family = categorical(link = logit, refcat = 3),
bf(career ~ 1,
nlf(mu1 ~ a1 + b * 1),
nlf(mu2 ~ a2 + b * 2),
+ a2 + b ~ 1),
a1 prior = c(prior(normal(0, 1), class = b, nlpar = a1),
prior(normal(0, 1), class = b, nlpar = a2),
prior(normal(0, 0.5), class = b, nlpar = b, lb = 0)),
iter = 2000, warmup = 1000, cores = 4, chains = 4,
seed = 11,
control = list(adapt_delta = .99),
file = "fits/b11.13d")
```

I’m not going to exhaustively show the `print()`

output for each. If you check, you’ll see they all fit reasonably well. Here we’ll look at their parameter summaries in bulk with a coefficient plot.

```
tibble(fit = str_c("b11.13", letters[1:4])) %>%
mutate(fixef = purrr::map(fit, ~get(.) %>%
fixef() %>%
data.frame() %>%
rownames_to_column("parameter"))) %>%
unnest(fixef) %>%
mutate(parameter = str_remove(parameter, "_Intercept"),
fit = factor(fit, levels = str_c("b11.13", letters[4:1]))) %>%
ggplot(aes(x = Estimate, xmin = Q2.5, xmax = Q97.5, y = fit)) +
geom_vline(xintercept = 0, color = wes_palette("Moonrise2")[3]) +
geom_pointrange(fatten = 3/2, color = wes_palette("Moonrise2")[4]) +
ylab(NULL) +
theme(axis.ticks.y = element_blank(),
panel.background = element_rect(fill = alpha("white", 1/8), size = 0)) +
facet_wrap(~ parameter, nrow = 1)
```

The results differed across models. None of them match up with the results McElreath reported in the text. However, the parameters from `b11.13d`

are very close to those from our `m11.13`

.

`precis(m11.13, depth = 2)`

```
## mean sd 5.5% 94.5% n_eff Rhat4
## a[1] -2.1410295 0.1774365 -2.439962491 -1.8725250 647.4891 1.003762
## a[2] -1.7875347 0.2483372 -2.250217501 -1.4535626 501.4588 1.003349
## b 0.1331626 0.1126707 0.009310701 0.3569797 497.0989 1.003801
```

`fixef(b11.13d) %>% round(digits = 2)`

```
## Estimate Est.Error Q2.5 Q97.5
## a1_Intercept -2.14 0.19 -2.53 -1.80
## a2_Intercept -1.79 0.25 -2.39 -1.40
## b_Intercept 0.13 0.11 0.00 0.42
```

It might be instructive to compare `b11.13a`

through `b11.13d`

with the PSIS-LOO.

```
.13a <- add_criterion(b11.13a, "loo")
b11.13b <- add_criterion(b11.13b, "loo")
b11.13c <- add_criterion(b11.13c, "loo")
b11.13d <- add_criterion(b11.13d, "loo")
b11
loo_compare(b11.13a, b11.13b, b11.13c, b11.13d, criterion = "loo") %>%
print(simplify = F)
```

```
## elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
## b11.13a 0.0 0.0 -369.5 17.0 1.9 0.1 739.0 34.1
## b11.13d 0.0 0.2 -369.5 16.9 1.9 0.1 739.0 33.8
## b11.13c -0.1 0.1 -369.6 17.1 2.0 0.1 739.2 34.2
## b11.13b -0.1 0.2 -369.6 16.9 2.0 0.1 739.3 33.7
```

```
model_weights(b11.13a, b11.13b, b11.13c, b11.13d, weights = "loo") %>%
round(digits = 2)
```

```
## b11.13a b11.13b b11.13c b11.13d
## 0.27 0.23 0.24 0.26
```

Two things pop out, here. First, all models are essentially equivalent in terms of LOO estimates and LOO weights. Second, the effective number of parameters (\(p_\text{LOO}\)) is about 2 for each model. At first glance, this might be surprising given that `b11.13a`

and `b11.13b`

both have 4 parameters and `b11.13c`

and `b11.13d`

both have three parameters. But recall that none of these models contain predictor variables from the data. All those \(\beta\) parameters, whether they’re held equal or allowed to vary across \(s_1\) and \(s_2\), are just constants. In the absence of actual `income`

values that vary within the data, those \(\beta\) parameters are kinda like extra intercepts. For context, go back and review our multicollinear legs from Section 6.1.1 or our double intercepts from Section 9.5.4.

Now see what happens when we compare these four models with our intercepts-only model, `b11.13io`

.

```
.13io <- add_criterion(b11.13io, "loo")
b11
loo_compare(b11.13io, b11.13a, b11.13b, b11.13c, b11.13d, criterion = "loo") %>%
print(simplify = F)
```

```
## elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
## b11.13a 0.0 0.0 -369.5 17.0 1.9 0.1 739.0 34.1
## b11.13d 0.0 0.2 -369.5 16.9 1.9 0.1 739.0 33.8
## b11.13io 0.0 0.1 -369.5 16.9 1.9 0.1 739.1 33.9
## b11.13c -0.1 0.1 -369.6 17.1 2.0 0.1 739.2 34.2
## b11.13b -0.1 0.2 -369.6 16.9 2.0 0.1 739.3 33.7
```

```
model_weights(b11.13io, b11.13a, b11.13b, b11.13c, b11.13d, weights = "loo") %>%
round(digits = 2)
```

```
## b11.13io b11.13a b11.13b b11.13c b11.13d
## 0.21 0.21 0.18 0.19 0.21
```

They’re all the same. Each model effectively has 2 parameters. Though it doesn’t do much by way of cross-validation, McElreath’s extra \(\beta\) parameter will let us perform a counterfactual simulation. Here is a **brms**/**tidyverse** workflow to make a counterfactual simulation for two levels of `income`

based on our `b11.13d`

, the **brms** model most closely corresponding to our **rethinking**-based `m11.13`

.

```
posterior_samples(b11.13d) %>%
transmute(s1 = b_a1_Intercept + b_b_Intercept * income[1],
s2_orig = b_a2_Intercept + b_b_Intercept * income[2],
s2_new = b_a2_Intercept + b_b_Intercept * income[2] * 2) %>%
mutate(p_orig = purrr::map2_dbl(s1, s2_orig, ~softmax(.x, .y, 0)[2]),
p_new = purrr::map2_dbl(s1, s2_new, ~softmax(.x, .y, 0)[2])) %>%
mutate(p_diff = p_new - p_orig) %>%
mean_qi(p_diff) %>%
mutate_if(is.double, round, digits = 2)
```

```
## p_diff .lower .upper .width .point .interval
## 1 0.04 0 0.15 0.95 mean qi
```

Now let’s build.

### 11.3.2 Predictors matched to observations.

Now consider an example in which each observed outcome has unique predictor values. Suppose you are still modeling career choice. But now you want to estimate the association between each person’s family income and which career they choose. So the predictor variable must have the same value in each linear model, for each row in the data. But now there is a unique parameter multiplying it in each linear model. This provides an estimate of the impact of family income on choice, for each type of career. (p. 362)

```
<- 500
n set.seed(11)
# simulate family incomes for each individual
<- runif(n)
family_income
# assign a unique coefficient for each type of event
<- c(-2, 0, 2)
b <- rep(NA, n) # empty vector of choices for each individual
career for (i in 1:n) {
<- 0.5 * (1:3) + b * family_income[i]
score <- softmax(score[1], score[2], score[3])
p <- sample(1:3, size = 1, prob = p)
career[i] }
```

In effect, we now have three data-generating equations:

\[\begin{align*} s_1 & = 0.5 + -2 \cdot \text{family_income}_i \\ s_2 & = 1.0 + 0 \cdot \text{family_income}_i \\ s_3 & = 1.5 + 2 \cdot \text{family_income}_i, \end{align*}\]

where, because `family_income`

is an actual variable that can take on unique values for each row in the data, we can call the first term in each equation the \(\alpha\) parameter and the second term in each equation the \(\beta\) parameter AND those \(\beta\) parameters will be more than odd double intercepts.

We might examine what the `family_income`

distributions look like across the three levels of `career`

. We’ll do it in two plots and combine them with the **patchwork** syntax. The first will be overlapping densities. For the second, we’ll display the proportions of `career`

across a discretized version of `family_income`

in a stacked area plot.

```
# put the data in a tibble
<-
d tibble(career = career) %>%
mutate(family_income = family_income)
<-
p1 %>%
d mutate(career = as.factor(career)) %>%
ggplot(aes(x = family_income, fill = career)) +
geom_density(size = 0, alpha = 3/4) +
scale_fill_manual(values = wes_palette("Moonrise2")[c(4, 2, 1)]) +
theme(legend.position = "none")
<-
p2 %>%
d mutate(career = as.factor(career)) %>%
mutate(fi = santoku::chop_width(family_income, width = .1, start = 0, labels = 1:10)) %>%
count(fi, career) %>%
group_by(fi) %>%
mutate(proportion = n / sum(n)) %>%
mutate(f = as.double(fi)) %>%
ggplot(aes(x = (f - 1) / 9, y = proportion, fill = career)) +
geom_area() +
scale_fill_manual(values = wes_palette("Moonrise2")[c(4, 2, 1)]) +
xlab("family_income, descritized")
+ p2 p1
```

Since Mcelreath’s simulation code in McElreath’s **R** code 11.59 did not contain a `set.seed()`

line, it won’t be possible to exactly reproduce his results. Happily, though, it appears that this time the results he reported in the text to cohere reasonably well with I ran the code on my computer. They weren’t identical, but there were much closer that for `m11.13`

from the last section. Since things are working more smoothly, here, I’m going to jump directly to **brms** code.

```
.14 <-
b11brm(data = d,
family = categorical(link = logit, refcat = 3),
bf(career ~ 1,
nlf(mu1 ~ a1 + b1 * family_income),
nlf(mu2 ~ a2 + b2 * family_income),
+ a2 + b1 + b2 ~ 1),
a1 prior = c(prior(normal(0, 1.5), class = b, nlpar = a1),
prior(normal(0, 1.5), class = b, nlpar = a2),
prior(normal(0, 1), class = b, nlpar = b1),
prior(normal(0, 1), class = b, nlpar = b2)),
iter = 2000, warmup = 1000, cores = 4, chains = 4,
seed = 11,
file = "fits/b11.14")
```

`print(b11.14)`

```
## Family: categorical
## Links: mu1 = logit; mu2 = logit
## Formula: career ~ 1
## mu1 ~ a1 + b1 * family_income
## mu2 ~ a2 + b2 * family_income
## a1 ~ 1
## a2 ~ 1
## b1 ~ 1
## b2 ~ 1
## Data: d (Number of observations: 500)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## a1_Intercept -1.29 0.26 -1.80 -0.80 1.00 2090 1831
## a2_Intercept -1.02 0.22 -1.46 -0.60 1.00 1965 1881
## b1_Intercept -2.50 0.56 -3.61 -1.44 1.00 2172 2176
## b2_Intercept -1.21 0.41 -2.01 -0.41 1.00 1988 2100
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

Check the PSIS-LOO.

```
.14 <- add_criterion(b11.14, "loo")
b11
loo(b11.14)
```

```
##
## Computed from 4000 by 500 log-likelihood matrix
##
## Estimate SE
## elpd_loo -330.3 17.0
## p_loo 3.2 0.3
## looic 660.6 34.0
## ------
## Monte Carlo SE of elpd_loo is 0.0.
##
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
```

Now that we actually have predictor variables with which we might estimate conventional \(\beta\) parameters, we finally have more than 2 effective parameters (\(p_\text{LOO}\)).

“Again, computing implied predictions is the safest way to interpret these models. They do a great job of classifying discrete, unordered events. But the parameters are on a scale that is very hard to interpret” (p. 325). Like before, we’ll do that with `fitted()`

. Now we have a predictor, this time we will use the `newdata`

argument.

```
<- tibble(family_income = seq(from = 0, to = 1, length.out = 60))
nd
<-
f fitted(b11.14,
newdata = nd)
```

First we’ll plot the fitted probabilities for each `career`

level across the full range of `family_income`

values.

```
# wrangle
rbind(f[, , 1],
2],
f[, , 3]) %>%
f[, , data.frame() %>%
bind_cols(nd %>% expand(career = 1:3, family_income)) %>%
mutate(career = str_c("career: ", career)) %>%
# plot
ggplot(aes(x = family_income, y = Estimate,
ymin = Q2.5, ymax = Q97.5,
fill = career, color = career)) +
geom_ribbon(alpha = 2/3, size = 0) +
geom_line(size = 3/4) +
scale_fill_manual(values = wes_palette("Moonrise2")[c(4, 2, 1)]) +
scale_color_manual(values = wes_palette("Moonrise2")[c(4, 2, 1)]) +
scale_x_continuous(breaks = 0:2 / 2) +
scale_y_continuous("probability", limits = c(0, 1),
breaks = 0:3 / 3, labels = c("0", ".33", ".67", "1")) +
theme(axis.text.y = element_text(hjust = 0),
legend.position = "none") +
facet_wrap(~ career)
```

If we’re willing to summarize those fitted lines by their posterior means, we could also make a model-implied version of the stacked area plot from above.

```
# annotation
<-
text tibble(family_income = c(.45, .3, .15),
proportion = c(.65, .8, .95),
label = str_c("career: ", 3:1),
color = c("a", "a", "b"))
# wrangle
rbind(f[, , 1],
2],
f[, , 3]) %>%
f[, , data.frame() %>%
bind_cols(nd %>% expand(career = 1:3, family_income)) %>%
group_by(family_income) %>%
mutate(proportion = Estimate / sum(Estimate),
career = factor(career)) %>%
# plot!
ggplot(aes(x = family_income, y = proportion)) +
geom_area(aes(fill = career)) +
geom_text(data = text,
aes(label = label, color = color),
family = "Times", size = 4.25) +
scale_color_manual(values = wes_palette("Moonrise2")[4:3]) +
scale_fill_manual(values = wes_palette("Moonrise2")[c(4, 2, 1)]) +
theme(legend.position = "none")
```

For more practice fitting multinomial models with **brms**, check out Chapter 22 of my (2020c) translation of Kruschke’s (2015) text.

#### 11.3.2.1 Multinomial in disguise as Poisson.

Here we fit a multinomial likelihood by refactoring it to a series of Poissons. Let’s retrieve the Berkeley data.

```
data(UCBadmit, package = "rethinking")
<- UCBadmit
d rm(UCBadmit)
```

Fit the models.

```
# binomial model of overall admission probability
<-
b11.binom brm(data = d,
family = binomial,
| trials(applications) ~ 1,
admit prior(normal(0, 1.5), class = Intercept),
iter = 2000, warmup = 1000, cores = 3, chains = 3,
seed = 11,
file = "fits/b11.binom")
# Poisson model of overall admission rate and rejection rate
<-
b11.pois brm(data = d %>%
mutate(rej = reject), # 'reject' is a reserved word
family = poisson,
mvbind(admit, rej) ~ 1,
prior(normal(0, 1.5), class = Intercept),
iter = 2000, warmup = 1000, cores = 3, chains = 3,
seed = 11,
file = "fits/b11.pois")
```

Note, the `mvbind()`

syntax made `b11.pois`

a multivariate Poisson model. Starting with version 2.0.0, **brms** supports a variety of multivariate models, which you might learn more about with Bürkner’s (2021d) vignette, *Estimating multivariate models with brms*. Anyway, here are the implications of `b11.pois`

.

```
# extract the samples
<- posterior_samples(b11.pois)
post # wrangle
%>%
post mutate(admit = exp(b_admit_Intercept),
reject = exp(b_rej_Intercept)) %>%
pivot_longer(admit:reject) %>%
# plot
ggplot(aes(x = value, y = name, fill = name)) +
stat_halfeye(point_interval = median_qi, .width = .95,
color = wes_palette("Moonrise2")[4]) +
scale_fill_manual(values = wes_palette("Moonrise2")[1:2]) +
labs(title = " Mean admit/reject rates across departments",
x = "# applications",
y = NULL) +
theme(axis.ticks.y = element_blank(),
legend.position = "none")
```

We might compare the model summaries.

`summary(b11.binom)$fixed`

```
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept -0.4556919 0.02999385 -0.5123193 -0.3941477 1.00131 1245 1403
```

`summary(b11.pois)$fixed`

```
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## admit_Intercept 4.984153 0.02393449 4.937026 5.029603 1.001556 2534 1733
## rej_Intercept 5.440899 0.01910679 5.403373 5.478298 1.004164 2747 2061
```

Here’s the posterior mean for the probability of admission, based on `b11.binom`

.

```
fixef(b11.binom)[, "Estimate"] %>%
inv_logit_scaled()
```

`## [1] 0.3880083`

Happily, we get the same value within simulation error from model `b11.pois`

.

```
<-
k fixef(b11.pois) %>%
as.numeric()
exp(k[1]) / (exp(k[1]) + exp(k[2]))
```

`## [1] 0.3877581`

The formula for what we just did in code is

\[p_\text{admit} = \frac{\lambda_1}{\lambda_1 + \lambda_2} = \frac{\exp (\alpha_1)}{\exp (\alpha_1) + \exp (\alpha_2)}.\]

To get a better appreciation on how well the two model types converge on the same solution, we might plot the full poster for admissions probability from each.

```
# wrangle
bind_cols(
posterior_samples(b11.pois) %>%
mutate(`the Poisson` = exp(b_admit_Intercept) / (exp(b_admit_Intercept) + exp(b_rej_Intercept))),
posterior_samples(b11.binom) %>%
mutate(`the binomial` = inv_logit_scaled(b_Intercept))
%>%
) pivot_longer(starts_with("the")) %>%
# plot
ggplot(aes(x = value, y = name, fill = name)) +
stat_halfeye(point_interval = median_qi, .width = c(.95, .5),
color = wes_palette("Moonrise2")[4]) +
scale_fill_manual(values = c(wes_palette("Moonrise2")[2:1])) +
labs(title = "Two models, same marginal posterior",
x = "admissions probability",
y = NULL) +
coord_cartesian(ylim = c(1.5, 2.25)) +
theme(axis.text.y = element_text(hjust = 0),
axis.ticks.y = element_blank(),
legend.position = "none")
```

## 11.4 Summary

This chapter described some of the most common generalized linear models, those used to model counts. It is important to never convert counts to proportions before analysis, because doing so destroys information about sample size. A fundamental difficulty with these models is that parameters are on a different scale, typically log-odds (for binomial) or log-rate (for Poisson), than the outcome variable they describe. Therefore computing implied predictions is even more important than before. (p. 365)

## 11.5 Bonus: Survival analysis

In the middle of the thirteenth lecture of his 2019 lecture series, McElreath briefly covered continuous-time survival analysis. Sadly, the problem didn’t make it into the text. Here we’ll slip it in as a bonus section. To fully understand this section, do listen to this section of the lecture. It’s only about ten minutes.

Now let’s load the `AustinCats`

data.

```
data(AustinCats, package = "rethinking")
<- AustinCats
d rm(AustinCats)
glimpse(d)
```

```
## Rows: 22,356
## Columns: 9
## $ id <fct> A730601, A679549, A683656, A709749, A733551, A756485, A732960, A664571, A727402, A7495…
## $ days_to_event <int> 1, 25, 4, 41, 9, 4, 4, 5, 24, 2, 34, 27, 3, 151, 106, 4, 55, 1, 4, 30, 18, 5, 34, 1, 1…
## $ date_out <fct> 07/08/2016 09:00:00 AM, 06/16/2014 01:54:00 PM, 07/17/2014 04:57:00 PM, 09/22/2015 12:…
## $ out_event <fct> Transfer, Transfer, Adoption, Transfer, Transfer, Adoption, Adoption, Adoption, Adopti…
## $ date_in <fct> 07/07/2016 12:11:00 PM, 05/22/2014 03:43:00 PM, 07/13/2014 01:20:00 PM, 08/12/2015 06:…
## $ in_event <fct> Stray, Stray, Stray, Stray, Stray, Stray, Stray, Owner Surrender, Stray, Stray, Stray,…
## $ breed <fct> Domestic Shorthair Mix, Domestic Shorthair Mix, Snowshoe Mix, Domestic Shorthair Mix, …
## $ color <fct> Blue Tabby, Black/White, Lynx Point, Calico, Brown Tabby/White, Blue Tabby, Calico, To…
## $ intake_age <int> 7, 1, 2, 12, 1, 1, 2, 24, 1, 3, 4, 12, 1, 7, 0, 12, 1, 12, 1, 24, 24, 3, 1, 24, 1, 24,…
```

At the moment, it doesn’t look like the **rethinking** package contains documentation about the `AustinCats`

. Based on McElreath’s lecture, he downloaded them from the website of an animal shelter in Austin, TX. We have data on 22,356 cats on whether they were adopted and how long it took. The cats came in a variety of colors. Here are the first ten.

```
%>%
d count(color) %>%
slice(1:10)
```

```
## color n
## 1 Agouti 3
## 2 Agouti/Brown Tabby 1
## 3 Agouti/White 1
## 4 Apricot 1
## 5 Black 2965
## 6 Black Smoke 83
## 7 Black Smoke/Black Tiger 1
## 8 Black Smoke/White 21
## 9 Black Tabby 119
## 10 Black Tabby/White 43
```

McElreath wondered whether it took longer for black cats to be adopted. If you look at the `color`

categories, above, you’ll see the people doing the data entry were creative with their descriptions. To keep things simple, we’ll just be comparing cats for whom `color == "Black"`

to all the others.

```
<-
d %>%
d mutate(black = ifelse(color == "Black", "black", "other"))
%>%
d count(black) %>%
mutate(percent = 100 * n / sum(n)) %>%
mutate(label = str_c(round(percent, digits = 1), "%")) %>%
ggplot(aes(y = black)) +
geom_col(aes(x = n, fill = black)) +
geom_text(aes(x = n - 250, label = label),
color = wes_palette("Moonrise2")[3], family = "Times", hjust = 1) +
scale_fill_manual(values = wes_palette("Moonrise2")[c(4, 1)], breaks = NULL) +
scale_x_continuous(expression(italic(n)), breaks = c(0, count(d, black) %>% pull(n))) +
labs(title = "Cat color",
y = NULL) +
theme(axis.ticks.y = element_blank())
```

Another variable we need to consider is the `out_event`

.

```
%>%
d count(out_event)
```

```
## out_event n
## 1 Adoption 11351
## 2 Censored 549
## 3 Died 369
## 4 Disposal 9
## 5 Euthanasia 636
## 6 Missing 28
## 7 Transfer 9414
```

Happily, most of the cats had `Adoption`

as their `out_event`

. For our purposes, all of the other options are the same as if they were `Censored`

. We’ll make a new variable to indicate that.

```
<-
d %>%
d mutate(adopted = ifelse(out_event == "Adoption", 1, 0),
censored = ifelse(out_event != "Adoption", 1, 0))
glimpse(d)
```

```
## Rows: 22,356
## Columns: 12
## $ id <fct> A730601, A679549, A683656, A709749, A733551, A756485, A732960, A664571, A727402, A7495…
## $ days_to_event <int> 1, 25, 4, 41, 9, 4, 4, 5, 24, 2, 34, 27, 3, 151, 106, 4, 55, 1, 4, 30, 18, 5, 34, 1, 1…
## $ date_out <fct> 07/08/2016 09:00:00 AM, 06/16/2014 01:54:00 PM, 07/17/2014 04:57:00 PM, 09/22/2015 12:…
## $ out_event <fct> Transfer, Transfer, Adoption, Transfer, Transfer, Adoption, Adoption, Adoption, Adopti…
## $ date_in <fct> 07/07/2016 12:11:00 PM, 05/22/2014 03:43:00 PM, 07/13/2014 01:20:00 PM, 08/12/2015 06:…
## $ in_event <fct> Stray, Stray, Stray, Stray, Stray, Stray, Stray, Owner Surrender, Stray, Stray, Stray,…
## $ breed <fct> Domestic Shorthair Mix, Domestic Shorthair Mix, Snowshoe Mix, Domestic Shorthair Mix, …
## $ color <fct> Blue Tabby, Black/White, Lynx Point, Calico, Brown Tabby/White, Blue Tabby, Calico, To…
## $ intake_age <int> 7, 1, 2, 12, 1, 1, 2, 24, 1, 3, 4, 12, 1, 7, 0, 12, 1, 12, 1, 24, 24, 3, 1, 24, 1, 24,…
## $ black <chr> "other", "other", "other", "other", "other", "other", "other", "other", "other", "othe…
## $ adopted <dbl> 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1,…
## $ censored <dbl> 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0,…
```

Here’s what the distribution of `days_to_event`

looks like, when grouped by our new `censored`

variable.

```
%>%
d mutate(censored = factor(censored)) %>%
filter(days_to_event < 300) %>%
ggplot(aes(x = days_to_event, y = censored)) +
# let's just mark off the 50% intervals
stat_halfeye(.width = .5, fill = wes_palette("Moonrise2")[2], height = 4) +
scale_y_discrete(NULL, labels = c("censored == 0", "censored == 1")) +
coord_cartesian(ylim = c(1.5, 5.1)) +
theme(axis.ticks.y = element_blank())
```

Do note there is a very long right tail that we’ve cut off for the sake of the plot. Anyway, the point of this plot is to show that the distribution for our primary variable, `days_to_event`

, looks very different conditional on whether the data were censored. As McElreath covered in the lecture, we definitely don’t want to loose that information by excluding the censored cases from the analysis.

McElreath fit his survival model using the exponential likelihood. We briefly met the exponential likelihood in Chapter 10. As McElreath wrote:

It is a fundamental distribution of distance and duration, kinds of measurements that represent displacement from some point of reference, either in time or space. If the probability of an event is constant in time or across space, then the distribution of events tends towards exponential. The exponential distribution has maximum entropy among all non-negative continuous distributions with the same average displacement. (p. 314)

If we let \(y\) be a non-negative continuous variable, the probability density function for the exponential distribution is

\[f(y) = \lambda e^{-\lambda y},\]

where \(\lambda\) is called the rate. The mean of the exponential distribution is the inverse of the rate

\[\operatorname{E}[y] = \frac{1}{\lambda}.\]

Importantly, **brms** paramaterizes exponential models in terms of \(\operatorname{E}[y]\). By default, it uses the log link. The is the same set-up McElreath used for **rethinking** in his lecture. To get a sense of how this all works, we can write our continuous-time survival model as

\[\begin{align*} \text{days_to_event}_i | \text{censored}_i = 0 & \sim \operatorname{Exponential}(\lambda_i) \\ \text{days_to_event}_i | \text{censored}_i = 1 & \sim \operatorname{Exponential-CCDF}(\lambda_i) \\ \lambda_i & = 1 / \mu_i \\ \log \mu_i & = \alpha_{\text{black}[i]} \\ \alpha & \sim \operatorname{Normal}(0, 1). \end{align*}\]

This is the same model McElreath discussed in the lecture. We’ve just renamed a couple variables. When you fit a continuous-time survival analysis with `brm()`

, you’ll want to tell the software about how the data have been censored with help from the `cens()`

function. For many of the models in this chapter, we used the `trials()`

function to include the \(n_i\) information into our binomial models. Both `trials()`

and `cens()`

are members of a class of functions designed to provide supplemental information about our criterion variables to `brm()`

. The `cens()`

function lets us add in information about censoring. In his lecture, McElreath mentioned there can be different kinds of censoring. **brms** can handle variables with left, right, or interval censoring. In the case of our `days_to_event`

data, some of the values have been right censored, which is typical in survival models. We will feed this information into the model with the `formula`

code `days_to_event | cens(censored)`

, where `censored`

is the name of the variable in our data that indexes the censoring. The `cens()`

function has been set up to expect our data to be coded as either

`'left'`

,`'none'`

,`'right'`

, and/or`'interval'`

; or`-1`

,`0`

,`1`

, and/or`2`

.

Since we coded our `censored`

variable as *censored* = 1 and *not censored* = 0, we have followed the second coding scheme. For more on the topic, see the `Additional response information`

subsection within the `brmsformula`

section of **brms** reference manual (Bürkner, 2021i). Here’s how to fit our survival model with **brms**.

```
.15 <-
b11brm(data = d,
family = exponential,
| cens(censored) ~ 0 + black,
days_to_event prior(normal(0, 1), class = b),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 11,
file = "fits/b11.15")
```

Check the summary.

`print(b11.15)`

```
## Family: exponential
## Links: mu = log
## Formula: days_to_event | cens(censored) ~ 0 + black
## Data: d (Number of observations: 22356)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## blackblack 4.05 0.03 4.00 4.10 1.00 3256 2435
## blackother 3.88 0.01 3.86 3.90 1.00 3965 2441
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

Since we modeled \(\log \mu_i\), we need to transform our \(\alpha\) parameters back into the \(\lambda\) metric using the formula

\[\begin{align*} \log \mu & = \alpha_\text{black}, && \text{and} \\ \lambda & = 1 / \mu, && \text{therefore} \\ \lambda_\text{black} & = 1 / \exp(\alpha_\text{black}). \end{align*}\]

Here are the posterior means for our two \(\lambda\)’s.

`1 / exp(fixef(b11.15)[, -2])`

```
## Estimate Q2.5 Q97.5
## blackblack 0.01740042 0.01832528 0.01650372
## blackother 0.02065009 0.02105913 0.02023061
```

It still might not be clear what any of this all means. To get a better sense, let’s make our version of one of the plots from McElreath’s lecture.

```
# annotation
<-
text tibble(color = c("black", "other"),
days = c(40, 34),
p = c(.55, .45),
label = c("black cats", "other cats"),
hjust = c(0, 1))
# wrangle
<-
f fixef(b11.15) %>%
data.frame() %>%
rownames_to_column() %>%
mutate(color = str_remove(rowname, "black")) %>%
expand(nesting(Estimate, Q2.5, Q97.5, color),
days = 0:100) %>%
mutate(m = 1 - pexp(days, rate = 1 / exp(Estimate)),
ll = 1 - pexp(days, rate = 1 / exp(Q2.5)),
ul = 1 - pexp(days, rate = 1 / exp(Q97.5)))
# plot!
%>%
f ggplot(aes(x = days)) +
geom_hline(yintercept = .5, linetype = 3, color = wes_palette("Moonrise2")[2]) +
geom_ribbon(aes(ymin = ll, ymax = ul, fill = color),
alpha = 1/2) +
geom_line(aes(y = m, color = color)) +
geom_text(data = text,
aes(y = p, label = label, hjust = hjust, color = color),
family = "Times") +
scale_fill_manual(values = wes_palette("Moonrise2")[c(4, 1)], breaks = NULL) +
scale_color_manual(values = wes_palette("Moonrise2")[c(4, 1)], breaks = NULL) +
scale_y_continuous("proportion remaining", breaks = c(0, .5, 1), limits = 0:1) +
xlab("days to adoption")
```

McElreath’s hypothesis is correct: Black cats are adopted a lower rates than cats of other colors. Another way to explore this model is to ask: *About how many days would it take for half of the cats of a given color to be adopted?* We can do this with help from the `qexp()`

function. For example:

`qexp(p = .5, rate = 1 / exp(fixef(b11.15)[1, 1]))`

`## [1] 39.83509`

But that’s just using one of the posterior means. Here’s that information using the full posterior distributions for our two levels of `black`

.

```
# wrangle
<-
post posterior_samples(b11.15) %>%
pivot_longer(starts_with("b_")) %>%
mutate(color = str_remove(name, "b_black"),
days = qexp(p = .5, rate = 1 / exp(value)))
# axis breaks
<-
medians group_by(post, color) %>%
summarise(med = median(days)) %>%
pull(med) %>%
round(., digits = 1)
# plot!
%>%
post ggplot(aes(x = days, y = color)) +
stat_halfeye(.width = .95, fill = wes_palette("Moonrise2")[2], height = 4) +
scale_x_continuous("days untill 50% are adopted",
breaks = c(30, medians, 45), labels = c("30", medians, "45"),
limits = c(30, 45)) +
ylab(NULL) +
coord_cartesian(ylim = c(1.5, 5.1)) +
theme(axis.ticks.y = element_blank())
```

The model suggests it takes about six days longer for the half of the black cats to be adopted.

### 11.5.1 Survival summary.

We’ve really just scratched the surface on survival models. In addition to those which use the exponential likelihood, **brms** supports a variety of survival models. Some of the more popular likelihoods are the log-Normal, the gamma, and the Weibull. For details, see the *Time-to-event models* section of Bürkner’s (2021h) vignette, *Parameterization of response distributions in brms*. Starting with the release of version 2.13.5, **brms** now supports the Cox proportional hazards model via `family = cox`

. If you’re tricky with your coding, you can also fit discrete-time survival models with the binomial likelihood (see here). For some examples of discrete and continuous-time survival models, you might check out my (2020b) ebook translation of Singer and Willett’s (2003) text, *Applied longitudinal data analysis: Modeling change and event occurrence*, the later chapters of which provide an exhaustive introduction to survival analysis.

## Session info

`sessionInfo()`

```
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rethinking_2.13 rstan_2.21.2 StanHeaders_2.21.0-7 ggrepel_0.9.1 GGally_2.1.1
## [6] ggdag_0.2.3 patchwork_1.1.1 tidybayes_2.3.1 ggthemes_4.2.4 wesanderson_0.3.6
## [11] brms_2.15.0 Rcpp_1.0.6 flextable_0.6.4 forcats_0.5.1 stringr_1.4.0
## [16] dplyr_1.0.5 purrr_0.3.4 readr_1.4.0 tidyr_1.1.3 tibble_3.1.0
## [21] ggplot2_3.3.3 tidyverse_1.3.0
##
## loaded via a namespace (and not attached):
## [1] readxl_1.3.1 uuid_0.1-4 backports_1.2.1 systemfonts_1.0.1 plyr_1.8.6
## [6] igraph_1.2.6 svUnit_1.0.3 splines_4.0.4 crosstalk_1.1.0.1 TH.data_1.0-10
## [11] rstantools_2.1.1 inline_0.3.17 digest_0.6.27 htmltools_0.5.1.1 viridis_0.5.1
## [16] rsconnect_0.8.16 fansi_0.4.2 magrittr_2.0.1 vembedr_0.1.4 graphlayouts_0.7.1
## [21] modelr_0.1.8 RcppParallel_5.0.2 matrixStats_0.57.0 officer_0.3.17 xts_0.12.1
## [26] sandwich_3.0-0 prettyunits_1.1.1 colorspace_2.0-0 rvest_0.3.6 ggdist_2.4.0.9000
## [31] haven_2.3.1 xfun_0.22 callr_3.5.1 crayon_1.4.1 jsonlite_1.7.2
## [36] lme4_1.1-25 survival_3.2-7 zoo_1.8-8 glue_1.4.2 polyclip_1.10-0
## [41] gtable_0.3.0 emmeans_1.5.2-1 V8_3.4.0 distributional_0.2.2 pkgbuild_1.2.0
## [46] shape_1.4.5 abind_1.4-5 scales_1.1.1 mvtnorm_1.1-1 emo_0.0.0.9000
## [51] DBI_1.1.0 miniUI_0.1.1.1 viridisLite_0.3.0 xtable_1.8-4 HDInterval_0.2.2
## [56] stats4_4.0.4 DT_0.16 htmlwidgets_1.5.2 httr_1.4.2 threejs_0.3.3
## [61] RColorBrewer_1.1-2 arrayhelpers_1.1-0 ellipsis_0.3.1 santoku_0.5.0 reshape_0.8.8
## [66] farver_2.0.3 pkgconfig_2.0.3 loo_2.4.1 dbplyr_2.0.0 utf8_1.1.4
## [71] labeling_0.4.2 tidyselect_1.1.0 rlang_0.4.10 reshape2_1.4.4 later_1.1.0.1
## [76] dagitty_0.3-1 munsell_0.5.0 cellranger_1.1.0 tools_4.0.4 cli_2.3.1
## [81] generics_0.1.0 broom_0.7.5 ggridges_0.5.2 evaluate_0.14 fastmap_1.0.1
## [86] processx_3.4.5 knitr_1.31 fs_1.5.0 tidygraph_1.2.0 zip_2.1.1
## [91] ggraph_2.0.4 nlme_3.1-152 mime_0.10 projpred_2.0.2 xml2_1.3.2
## [96] compiler_4.0.4 bayesplot_1.8.0 shinythemes_1.1.2 rstudioapi_0.13 gamm4_0.2-6
## [101] curl_4.3 reprex_0.3.0 tweenr_1.0.1 statmod_1.4.35 stringi_1.5.3
## [106] highr_0.8 ps_1.6.0 Brobdingnag_1.2-6 gdtools_0.2.2 lattice_0.20-41
## [111] Matrix_1.3-2 nloptr_1.2.2.2 markdown_1.1 shinyjs_2.0.0 vctrs_0.3.6
## [116] pillar_1.5.1 lifecycle_1.0.0 bridgesampling_1.0-0 estimability_1.3 data.table_1.14.0
## [121] httpuv_1.5.4 R6_2.5.0 bookdown_0.21 promises_1.1.1 gridExtra_2.3
## [126] codetools_0.2-18 boot_1.3-26 colourpicker_1.1.0 MASS_7.3-53 gtools_3.8.2
## [131] assertthat_0.2.1 withr_2.4.1 shinystan_2.5.0 multcomp_1.4-16 mgcv_1.8-33
## [136] hms_0.5.3 grid_4.0.4 coda_0.19-4 minqa_1.2.4 rmarkdown_2.7
## [141] ggforce_0.3.2 shiny_1.5.0 lubridate_1.7.9.2 base64enc_0.1-3 dygraphs_1.1.1.6
```

Though McElreath didn’t cover it, here, it’s also fine to fit binomial models using the probit link. Gelman et al. (2020) covered probit regression in Section 15.4. With

**brms**, it’s simply a matter of setting`family = binomial(link = "probit")`

within`brm()`

.↩︎