# 21 Dichotomous Predicted Variable

This chapter considers data structures that consist of a dichotomous predicted variable. The early chapters of the book were focused on this type of data, but now we reframe the analyses in terms of the generalized linear model…

The traditional treatment of these sorts of data structure is called “logistic regression.” In Bayesian software it is easy to generalize the traditional models so they are robust to outliers, allow different variances within levels of a nominal predictor, and have hierarchical structure to share information across levels or factors as appropriate.

## 21.1 Multiple metric predictors

“We begin by considering a situation with multiple metric predictors, because this case makes it easiest to visualize the concepts of logistic regression” (p. 623).

The 3D wireframe plot in Figure 21.1 is technically beyond the scope of our current ggplot2 paradigm. But we will discuss an alternative in the end of Section 21.1.2.

### 21.1.1 The model and implementation in JAGS brms.

Our statistical model will follow the form

\begin{align*} \mu & = \operatorname{logistic}(\beta_0 + \beta_1 x_1 + \beta_2 x_2) \\ y & \sim \operatorname{Bernoulli}(\mu) \end{align*}

where

$\operatorname{logistic}(x) = \frac{1}{[1 + \exp (-x)]}.$

The generic brms code for logistic regression using the Bernoulli likelihood looks like so.

fit <-
brm(data = my_data,
family = bernoulli,
y ~ 1 + x1 + x2,
prior = c(prior(normal(0, 2), class = Intercept),
prior(normal(0, 2), class = b)))

Note that this syntax presumes the predictor variables have already been standardized.

We’d be remiss not to point out that you can also specify the model using the binomial distribution. That code would look like this.

fit <-
brm(data = my_data,
family = binomial,
y | trials(1) ~ 1 + x1 + x2,
prior = c(prior(normal(0, 2), class = Intercept),
prior(normal(0, 2), class = b)))

As long as the data are not aggregated, the results of these two should be the same within simulation variance. In brms, the default link for both family = bernoulli and family = binomial models is logit, which is exactly what we want, here. Also, note the additional | trials(1) syntax on the left side of the model formula. You could get away with omitting this in older versions of brms. But newer versions prompt users to specify how many of trials each row in the data represents. This is because, as with the baseball data we’ll use later in the chapter, the binomial distribution includes an $$n$$ parameter. When working with un-aggregated data like what we’re about to do, below, it’s presumed that $$n = 1$$.

We won’t be making our version of Figure 21.1 until a little later in the chapter. However, we can go ahead and make our version of the model diagram in Kruschke’s Figure 21.2. Before we do, let’s discuss the issues of plot colors and theme. For this chapter, we’ll take our color palette from the PNWColors package , which provides color palettes inspired by the beauty of my birthplace,the US Pacific Northwest. Our color palette will be "Mushroom".

library(PNWColors)

pm <- pnw_palette(name = "Mushroom", n = 8)

pm

Our overall plot theme will be a "Mushroom" infused extension of theme_linedraw().

library(tidyverse)

theme_set(
theme_linedraw() +
theme(text = element_text(color = pm[1]),
axis.text = element_text(color = pm[1]),
axis.ticks = element_line(color = pm[1]),
legend.background = element_blank(),
legend.box.background = element_blank(),
legend.key = element_rect(fill = pm[8]),
panel.background = element_rect(fill = pm[8], color = pm[8]),
panel.border = element_rect(colour = pm[1]),
panel.grid = element_blank(),
strip.background = element_rect(fill = pm[1], color = pm[1]),
strip.text = element_text(color = pm[8]))
)

Now make Figure 21.2.

library(patchwork)

# normal density
p1 <-
tibble(x = seq(from = -3, to = 3, by = .1)) %>%
ggplot(aes(x = x, y = (dnorm(x)) / max(dnorm(x)))) +
geom_area(fill = pm[5]) +
annotate(geom = "text",
x = 0, y = .2,
label = "normal",
size = 7, color = pm[1]) +
annotate(geom = "text",
x = c(0, 1.5), y = .6,
label = c("italic(M)[0]", "italic(S)[0]"),
size = 7, color = pm[1], hjust = 0, family = "Times", parse = T) +
scale_x_continuous(expand = c(0, 0)) +
theme_void() +
theme(axis.line.x = element_line(size = 0.5),
plot.background = element_rect(fill = pm[8], color = "white", size = 1))

# second normal density
p2 <-
tibble(x = seq(from = -3, to = 3, by = .1)) %>%
ggplot(aes(x = x, y = (dnorm(x)) / max(dnorm(x)))) +
geom_area(fill = pm[5]) +
annotate(geom = "text",
x = 0, y = .2,
label = "normal",
size = 7, color = pm[1]) +
annotate(geom = "text",
x = c(0, 1.5), y = .6,
label = c("italic(M[j])", "italic(S[j])"),
size = 7, color = pm[1], hjust = 0, family = "Times", parse = T) +
scale_x_continuous(expand = c(0, 0)) +
theme_void() +
theme(axis.line.x = element_line(size = 0.5),
plot.background = element_rect(fill = pm[8], color = "white", size = 1))

## an annotated arrow
# save our custom arrow settings
my_arrow <- arrow(angle = 20, length = unit(0.35, "cm"), type = "closed")
p3 <-
tibble(x    = .5,
y    = 1,
xend = .85,
yend = 0) %>%

ggplot(aes(x = x, xend = xend,
y = y, yend = yend)) +
geom_segment(arrow = my_arrow, color = pm[1]) +
annotate(geom = "text",
x = .55, y = .4,
label = "'~'",
size = 10, color = pm[1], family = "Times", parse = T) +
xlim(0, 1) +
theme_void()

## another annotated arrow
p4 <-
tibble(x    = .5,
y    = 1,
xend = 1/3,
yend = 0) %>%

ggplot(aes(x = x, xend = xend,
y = y, yend = yend)) +
geom_segment(arrow = my_arrow, color = pm[1]) +
annotate(geom = "text",
x = c(.3, .48), y = .4,
label = c("'~'", "italic(j)"),
size = c(10, 7), color = pm[1], family = "Times", parse = T) +
xlim(0, 1) +
theme_void()

# likelihood formula
p5 <-
tibble(x = .5,
y = .5,
label = "logistic(beta[0]+sum()[italic(j)]~beta[italic(j)]~italic(x)[italic(ji)])") %>%

ggplot(aes(x = x, y = y, label = label)) +
geom_text(size = 7, color = pm[1], parse = T, family = "Times") +
scale_x_continuous(expand = c(0, 0), limits = c(0, 1)) +
ylim(0, 1) +
theme_void()

# a third annotated arrow
p6 <-
tibble(x     = c(.375, .6),
y     = c(1/2, 1/2),
label = c("'='", "italic(i)")) %>%

ggplot(aes(x = x, y = y, label = label)) +
geom_text(size = c(10, 7), color = pm[1], parse = T, family = "Times") +
geom_segment(x = .5, xend = .5,
y = 1, yend = 0,
arrow = my_arrow) +
xlim(0, 1) +
theme_void()

# bar plot of Bernoulli data
p7 <-
tibble(x = 0:1,
d = (dbinom(x, size = 1, prob = .6)) / max(dbinom(x, size = 1, prob = .6))) %>%

ggplot(aes(x = x, y = d)) +
geom_col(fill = pm[5], width = .4) +
annotate(geom = "text",
x = .5, y = .2,
label = "Bernoulli",
size = 7, color = pm[1]) +
annotate(geom = "text",
x = .5, y = .94,
label = "mu[italic(i)]",
size = 7, color = pm[1], family = "Times", parse = T) +
xlim(-.75, 1.75) +
theme_void() +
theme(axis.line.x = element_line(size = 0.5),
plot.background = element_rect(fill = pm[8], color = pm[8]))

# the final annotated arrow
p8 <-
tibble(x     = c(.375, .625),
y     = c(1/3, 1/3),
label = c("'~'", "italic(i)")) %>%

ggplot(aes(x = x, y = y, label = label)) +
geom_text(size = c(10, 7), color = pm[1], parse = T, family = "Times") +
geom_segment(x = .5, xend = .5,
y = 1, yend = 0,
color = pm[1], arrow = my_arrow) +
xlim(0, 1) +
theme_void()

# some text
p9 <-
tibble(x     = 1,
y     = .5,
label = "italic(y[i])") %>%

ggplot(aes(x = x, y = y, label = label)) +
geom_text(size = 7, color = pm[1], parse = T, family = "Times") +
xlim(0, 2) +
theme_void()

# define the layout
layout <- c(
area(t = 1, b = 2, l = 1, r = 2),
area(t = 1, b = 2, l = 3, r = 4),
area(t = 3, b = 3, l = 1, r = 2),
area(t = 3, b = 3, l = 3, r = 4),
area(t = 4, b = 4, l = 1, r = 4),
area(t = 5, b = 5, l = 2, r = 3),
area(t = 6, b = 7, l = 2, r = 3),
area(t = 8, b = 8, l = 2, r = 3),
area(t = 9, b = 9, l = 2, r = 3)
)

# combine and plot!
(p1 + p2 + p3 + p4 + p5 + p6 + p7 + p8 + p9) +
plot_layout(design = layout) &
ylim(0, 1) &
theme(plot.margin = margin(0, 5.5, 0, 5.5))

### 21.1.2 Example: Height, weight, and gender.

library(tidyverse)

glimpse(my_data)
## Rows: 110
## Columns: 3
## $male <dbl> 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,… ##$ height <dbl> 63.2, 68.7, 64.8, 67.9, 68.9, 67.8, 68.2, 64.8, 64.3, 64.7, 66.9, 66.9, 67.1, 70.2,…
## $weight <dbl> 168.7, 169.8, 176.6, 246.8, 151.6, 158.0, 168.6, 137.2, 177.0, 128.0, 168.4, 136.2,… Let’s standardize our predictors. my_data <- my_data %>% mutate(height_z = (height - mean(height)) / sd(height), weight_z = (weight - mean(weight)) / sd(weight)) Before we fit a model, we might take a quick look at the data to explore the relations among the continuous variables weight and height and the dummy variable male. The ggMarginal() function from the ggExtra package will help us get a sense of the multivariate distribution by allowing us to add marginal densities to a scatter plot. library(ggExtra) p <- my_data %>% ggplot(aes(x = weight, y = height, fill = male == 1)) + geom_point(aes(color = male == 1), alpha = 3/4) + scale_color_manual(values = pm[c(3, 6)]) + scale_fill_manual(values = pm[c(3, 6)]) + theme(legend.position = "none") p %>% ggMarginal(data = my_data, colour = pm[1], groupFill = T, alpha = .8, type = "density") Looks like the data for which male == 1 are concentrated in the upper right and those for which male == 0 are more so in the lower left. What we’d like is a model that would tell us the optimal dividing line(s) between our male categories with respect to those predictor variables. Open brms. library(brms) Our first logistic model with family = bernoulli uses only weight_z as a predictor. fit21.1 <- brm(data = my_data, family = bernoulli, male ~ 1 + weight_z, prior = c(prior(normal(0, 2), class = Intercept), prior(normal(0, 2), class = b)), iter = 2500, warmup = 500, chains = 4, cores = 4, seed = 21, file = "fits/fit21.01") Here’s the model summary. print(fit21.1) ## Family: bernoulli ## Links: mu = logit ## Formula: male ~ 1 + weight_z ## Data: my_data (Number of observations: 110) ## Draws: 4 chains, each with iter = 2500; warmup = 500; thin = 1; ## total post-warmup draws = 8000 ## ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## Intercept -0.08 0.22 -0.51 0.34 1.00 6863 5400 ## weight_z 1.19 0.28 0.65 1.75 1.00 5612 4505 ## ## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS ## and Tail_ESS are effective sample size measures, and Rhat is the potential ## scale reduction factor on split chains (at convergence, Rhat = 1). Now let’s get ready to make our version of Figure 21.3. First, we extract the posterior draws. draws <- as_draws_df(fit21.1) Now we wrangle a bit to make the top panel of Figure 21.3. length <- 200 n_draw <- 20 draws %>% tibble() %>% # take 20 random samples of the posterior draws slice_sample(n = n_draw) %>% # add in a sequence of weight_z expand(nesting(.draw, b_Intercept, b_weight_z), weight_z = seq(from = -2, to = 3.5, length.out = length)) %>% # compute the estimates of interest mutate(male = inv_logit_scaled(b_Intercept + b_weight_z * weight_z), weight = weight_z * sd(my_data$weight) + mean(my_data$weight), thresh = -b_Intercept / b_weight_z * sd(my_data$weight) + mean(my_data$weight)) %>% # plot! ggplot(aes(x = weight)) + geom_hline(yintercept = .5, color = pm[7], size = 1/2) + geom_vline(aes(xintercept = thresh, group = .draw), color = pm[6], size = 2/5, linetype = 2) + geom_line(aes(y = male, group = .draw), color = pm[1], size = 1/3, alpha = 2/3) + geom_point(data = my_data, aes(y = male), alpha = 1/3, color = pm[1]) + labs(title = "Data with Post. Pred.", y = "male") + coord_cartesian(xlim = range(my_data$weight))

We should discuss those thresholds (i.e., the vertical lines) a bit. Kruschke:

The spread of the logistic curves indicates the uncertainty of the estimate; the steepness of the logistic curves indicates the magnitude of the regression coefficient. The $$50\%$$ probability threshold is marked by arrows that drop down from the logistic curve to the $$x$$-axis, near a weight of approximately $$160$$ pounds. The threshold is the $$x$$ value at which $$\mu = 0.5$$, which is $$x = -\beta_0 / \beta_1$$. (p. 626)

It’s important to realize that when you compute the thresholds with $$-\beta_0 / \beta_1$$, this returns the values on the scale of the predictor. In our case, the predictor was weight_z. But since we wanted to plot the data on the scale of the unstandardized variable, weight, we have to convert the thresholds to that metric by multiplying their values by $$s_\text{weight}$$ and then add the product to $$\overline{\text{weight}}$$. If you study it closely, you’ll see that’s what we did when computing the thresh values, above.

Now here we show the marginal distributions in our versions of the lower panels of Figure 21.3.

library(tidybayes)

draws <-
draws %>%
# convert the parameter draws to their natural metric following Equation 21.1 (pp. 624--625)
transmute(Intercept = b_Intercept - (b_weight_z * mean(my_data$weight) / sd(my_data$weight)),
weight    = b_weight_z / sd(my_data$weight)) %>% pivot_longer(everything()) # plot draws %>% ggplot(aes(x = value, y = 0)) + stat_halfeye(point_interval = mode_hdi, .width = .95, shape = 15, point_size = 2.5, point_color = pm[4], slab_color = pm[1], fill = pm[7], color = pm[1], slab_size = 1/2, size = 2, normalize = "panels") + scale_y_continuous(NULL, breaks = NULL) + xlab(NULL) + facet_wrap(~ name, scales = "free", ncol = 2) And here are those exact posterior mode and 95% HDI values. draws %>% group_by(name) %>% mode_hdi() %>% mutate_if(is.double, round, digits = 3) ## # A tibble: 2 × 7 ## name value .lower .upper .width .point .interval ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> ## 1 Intercept -5.38 -7.74 -2.91 0.95 mode hdi ## 2 weight 0.033 0.018 0.048 0.95 mode hdi If you look back at our code for the lower marginal plots for Figure 21.3, you’ll notice we did a whole lot of argument tweaking within tidybayes::stat_halfeye(). We have a lot of marginal densities ahead of us in this chapter, so we might streamline our code with those settings saved in a custom function. As the vibe I’m going for in those settings is based on some of the plots in Chapter 16 of Wilke’s (2019), Fundamentals of data visualization, we’ll call our function stat_wilke(). stat_wilke <- function(.width = .95, shape = 15, point_size = 2.5, point_color = pm[4], slab_color = pm[1], fill = pm[7], color = pm[1], slab_size = 1/2, size = 2, ...) { stat_halfeye(point_interval = mode_hdi, .width = .width, shape = shape, point_size = point_size, point_color = point_color, slab_color = slab_color, fill = fill, color = color, slab_size = slab_size, size = size, normalize = "panels", ...) } Now fit the two-predictor model using both weight_z and height_z. fit21.2 <- brm(data = my_data, family = bernoulli, male ~ 1 + weight_z + height_z, prior = c(prior(normal(0, 2), class = Intercept), prior(normal(0, 2), class = b)), iter = 2500, warmup = 500, chains = 4, cores = 4, seed = 21, file = "fits/fit21.02") Here’s the model summary. print(fit21.2) ## Family: bernoulli ## Links: mu = logit ## Formula: male ~ 1 + weight_z + height_z ## Data: my_data (Number of observations: 110) ## Draws: 4 chains, each with iter = 2500; warmup = 500; thin = 1; ## total post-warmup draws = 8000 ## ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## Intercept -0.35 0.30 -0.94 0.22 1.00 5726 5626 ## weight_z 0.67 0.35 0.01 1.39 1.00 6719 5765 ## height_z 2.62 0.51 1.68 3.71 1.00 5932 4777 ## ## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS ## and Tail_ESS are effective sample size measures, and Rhat is the potential ## scale reduction factor on split chains (at convergence, Rhat = 1). Before we make our plots for Figure 21.4, we’ll need to extract the posterior samples and transform a little. draws <- as_draws_df(fit21.2) %>% mutate(b_weight = b_weight_z / sd(my_data$weight),
b_height  = b_height_z / sd(my_data$height), Intercept = b_Intercept - ((b_weight_z * mean(my_data$weight) / sd(my_data$weight)) + (b_height_z * mean(my_data$height) / sd(my_data$height)))) %>% select(.draw, b_weight:Intercept) head(draws) ## # A tibble: 6 × 4 ## .draw b_weight b_height Intercept ## <int> <dbl> <dbl> <dbl> ## 1 1 0.0406 0.677 -51.7 ## 2 2 0.0371 0.847 -62.3 ## 3 3 0.0345 0.852 -62.2 ## 4 4 0.0433 0.425 -35.4 ## 5 5 0.0250 0.636 -46.7 ## 6 6 0.0174 0.748 -53.3 Here’s our version of Figure 21.4.a. set.seed(21) # we need this for the slice_sample() function draws %>% slice_sample(n = 20) %>% expand(nesting(.draw, Intercept, b_weight, b_height), weight = c(80, 280)) %>% # this follows the Equation near the top of p. 629 mutate(height = (-Intercept / b_height) + (-b_weight / b_height) * weight) %>% # now plot ggplot(aes(x = weight, y = height)) + geom_line(aes(group = .draw), color = pm[7], size = 2/5, alpha = 2/3) + geom_text(data = my_data, aes(label = male, color = male == 1)) + scale_color_manual(values = pm[c(4, 1)]) + ggtitle("Data with Post. Pred.") + coord_cartesian(xlim = range(my_data$weight),
ylim = range(my_data$height)) + theme(legend.position = "none") With just a tiny bit more wrangling, we’ll be ready to make the bottom panels of Figure 21.4. draws %>% pivot_longer(-.draw) %>% mutate(name = factor(str_remove(name, "b_"), levels = c("Intercept", "weight", "height"))) %>% ggplot(aes(x = value, y = 0)) + stat_wilke() + scale_y_continuous(NULL, breaks = NULL) + xlab(NULL) + facet_wrap(~ name, scales = "free", ncol = 3) Did you notice our use of stat_wilke()? By now, you know how to use mode_hdi() to return those exact summary values if you’d like them. Now remember how we backed away from Figure 21.1? Well, when you have a logistic regression with two predictors, there is a reasonable way to express those three dimensions on a two-dimensional grid. Now we have the results from fit21.2, let’s try it out. First, we need a grid of values for our two predictors, weight_z and height_z. length <- 100 nd <- crossing(weight_z = seq(from = -3.5, to = 3.5, length.out = length), height_z = seq(from = -3.5, to = 3.5, length.out = length)) Second, we plug those values into fitted() and wrangle. f <- fitted(fit21.2, newdata = nd, scale = "linear") %>% as_tibble() %>% # note we're only working with the posterior mean, here transmute(prob = Estimate %>% inv_logit_scaled()) %>% bind_cols(nd) %>% mutate(weight = (weight_z * sd(my_data$weight) + mean(my_data$weight)), height = (height_z * sd(my_data$height) + mean(my_data$height))) glimpse(f) ## Rows: 10,000 ## Columns: 5 ##$ prob     <dbl> 7.042908e-06, 8.476457e-06, 1.020180e-05, 1.227831e-05, 1.477749e-05, 1.778534e-0…
## $weight_z <dbl> -3.5, -3.5, -3.5, -3.5, -3.5, -3.5, -3.5, -3.5, -3.5, -3.5, -3.5, -3.5, -3.5, -3.… ##$ height_z <dbl> -3.500000, -3.429293, -3.358586, -3.287879, -3.217172, -3.146465, -3.075758, -3.0…
## $weight <dbl> 33.45626, 33.45626, 33.45626, 33.45626, 33.45626, 33.45626, 33.45626, 33.45626, 3… ##$ height   <dbl> 53.24736, 53.51234, 53.77731, 54.04229, 54.30727, 54.57224, 54.83722, 55.10219, 5…

Third, we’re ready to plot. Here we’ll express the third dimension, probability, on a color spectrum.

f %>%
ggplot(aes(x = weight, y = height)) +
geom_raster(aes(fill = prob),
interpolate = T) +
geom_text(data = my_data,
aes(label = male, color = male == 1),
show.legend = F) +
scale_color_manual(values = pm[c(8, 1)]) +
scale_fill_gradientn(colours = pnw_palette(name = "Mushroom", n = 101),
limits = c(0, 1)) +
scale_y_continuous(position = "right") +
coord_cartesian(xlim = range(my_data$weight), ylim = range(my_data$height)) +
theme(legend.position = "left")

If you look way back to Figure 21.1 (p. 623), you’ll see the following formula at the top:

$y \sim \operatorname{dbern}(m), m = \operatorname{logistic}(0.018 x_1 + 0.7 x_2 - 50).$

Now while you keep your finger on that equation, take another look at the last line in Kruschke’s Equation 21.1,

$\operatorname{logit}(\mu) = \underbrace{\zeta_0 - \sum_j \frac{\zeta_j}{s_{x_j}} \overline x_j}_{\beta_0} + \sum_j \underbrace{\frac{\zeta_j}{s_{x_j}} \overline x_j}_{\beta_j},$

where the $$\zeta$$’s are the parameters from the model based on standardized predictors. Our fit21.2 was based on standardized weight and height values (i.e., weight_z and height_z), yielding model coefficients in the $$\zeta$$ metric. Here we use the formula above to convert our fit21.2 estimates to their unstandardized $$\beta$$ metric. For simplicity, we’ll just take their means.

as_draws_df(fit21.2) %>%
transmute(beta_0 = b_Intercept - ((b_weight_z * mean(my_data$weight) / sd(my_data$weight)) +
((b_height_z * mean(my_data$height) / sd(my_data$height)))),
beta_1 = b_weight_z / sd(my_data$weight), beta_2 = b_height_z / sd(my_data$height)) %>%
summarise_all(~ mean(.) %>% round(., digits = 3))
## # A tibble: 1 × 3
##   beta_0 beta_1 beta_2
##    <dbl>  <dbl>  <dbl>
## 1  -49.7  0.019  0.699

Within rounding error, those values are the same ones in the formula at the top of Kruschke’s Figure 21.1! That is, our last plot was a version of Figure 21.1.

Hopefully this helps make sense of what the thresholds in Figure 21.4.a represented. But do note a major limitation of this visualization approach. By expressing the threshold with multiple lines drawn from the posterior in Figure 21.4.a, we expressed the uncertainty inherent in the posterior distribution. However, for this probability plane approach, we’ve taken a single value from the posterior, the mean (i.e., the Estimate), to compute the probabilities. Though beautiful, our probability-plane plot does a poor job expressing the uncertainty in the model. If you’re curious how one might include uncertainty into a plot like this, check out the intriguing blog post by Adam Pearce, Communicating model uncertainty over space.

## 21.2 Interpreting the regression coefficients

In this section, I’ll discuss how to interpret the parameters in logistic regression. The first subsection explains how to interpret the numerical magnitude of the slope coefficients in terms of “log odds.” The next subsection shows how data with relatively few $$1$$’s or $$0$$’s can yield ambiguity in the parameter estimates. Then an example with strongly correlated predictors reveals tradeoffs in slope coefficients. Finally, I briefly describe the meaning of multiplicative interaction for logistic regression. (p. 629)

### 21.2.1 Log odds.

When the logistic regression formula is written using the logit function, we have $$\operatorname{logit}(\mu) = \beta_0 + \beta_1 x_1 + \beta_2 x_2$$. The formula implies that whenever $$x_1$$ goes up by $$1$$ unit (on the $$x_1$$ scale), then $$\operatorname{logit}(\mu)$$ goes up by an amount $$\beta_1$$. And whenever $$x_2$$ goes up by $$1$$ unit (on the $$x_2$$ scale), then $$\operatorname{logit}(\mu)$$ goes up by an amount $$\beta_2$$. Thus, the regression coefficients are telling us about increases in $$\operatorname{logit}(\mu)$$. To understand the regression coefficients, we need to understand $$\operatorname{logit}(\mu)$$. (pp. 629–630)

Given the logit function is the inverse of the logistic, which itself is

$\operatorname{logistic}(x) = \frac{1}{1 + \exp (−x)},$

and given the formula

$\operatorname{logit}(\mu) = \log \left (\frac{\mu}{1 - \mu} \right),$

where

$0 < \mu < 1,$

it may or may not be clear that the results of our logistic regression models have a nonlinear relation with the actual parameter of interest, $$\mu$$, which, recall, is the probability our criterion variable is 1 (e.g., male == 1). To get a sense of that nonlinear relation, we might make a plot.

tibble(mu = seq(from = 0, to = 1, length.out = 300)) %>%
mutate(logit_mu = log(mu / (1 - mu))) %>%

ggplot(aes(x = mu, y = logit_mu)) +
geom_line(color = pm[3], size = 1.5) +
labs(x = expression(mu~"(i.e., the probability space)"),
y = expression(logit~mu~"(i.e., the parameter space)")) +
theme(legend.position = "none")

So whereas our probability space is bound between 0 and 1, the parameter space shoots off into negative and positive infinity. Also,

$\operatorname{logit}(\mu) = \log \left (\frac{p(y = 1)}{p(y = 0)} \right ).$

Thus, “the ratio, $$p(y = 1) / p(y = 0)$$, is called the odds of outcome 1 to outcome 0, and therefore $$\operatorname{logit}(\mu)$$ is the log odds of outcome 1 to outcome 0” (p. 630).

Here’s a table layout of the height/weight examples in the middle of page 630.

tibble(b0     = -50,
b1     = .02,
b2     = .7,
weight = 160,
inches = c(63:64, 67:68)) %>%
mutate(logit_mu = b0 + b1 * weight + b2 * inches) %>%
mutate(log_odds = logit_mu) %>%
mutate(p_male = 1 / (1 + exp(-log_odds))) %>%
knitr::kable()
b0 b1 b2 weight inches logit_mu log_odds p_male
-50 0.02 0.7 160 63 -2.7 -2.7 0.0629734
-50 0.02 0.7 160 64 -2.0 -2.0 0.1192029
-50 0.02 0.7 160 67 0.1 0.1 0.5249792
-50 0.02 0.7 160 68 0.8 0.8 0.6899745

Thus, a regression coefficient in logistic regression indicates how much a $$1$$ unit change of the predictor increases the log odds of outcome $$1$$. A regression coefficient of $$0.5$$ corresponds to a rate of probability change of about $$12.5$$ percentage points per $$x$$-unit at the threshold $$x$$ value. A regression coefficient of $$1.0$$ corresponds to a rate of probability change of about $$24.4$$ percentage points per $$x$$-unit at the threshold $$x$$ value. When $$x$$ is much larger or smaller than the threshold $$x$$ value, the rate of change in probability is smaller, even though the rate of change in log odds is constant. (pp. 630–631)

### 21.2.2 When there are few 1’s or 0’s in the data.

In logistic regression, you can think of the parameters as describing the boundary between the $$0$$’s and the $$1$$’s. If there are many $$0$$’s and $$1$$’s, then the estimate of the boundary parameters can be fairly accurate. But if there are few $$0$$’s or few $$1$$’s, the boundary can be difficult to identify very accurately, even if there are many data points overall. (p. 631)

As far as I can tell, Kruschke must have used $$n = 500$$ to simulate the data he displayed in Figure 21.5. Using the coefficient values he displayed in the middle of page 631, here’s an attempt at replicating them.

b0 <- -3
b1 <- 1

n <- 500

set.seed(21)
d_rare <-
tibble(x = rnorm(n, mean = 0, sd = 1)) %>%
mutate(mu = b0 + b1 * x) %>%
mutate(y = rbinom(n, size = 1, prob = 1 / (1 + exp(-mu))))

glimpse(d_rare)
## Rows: 500
## Columns: 3
## $x <dbl> 0.793013171, 0.522251264, 1.746222241, -1.271336123, 2.197389533, 0.433130777, -1.57019… ##$ mu <dbl> -2.2069868, -2.4777487, -1.2537778, -4.2713361, -0.8026105, -2.5668692, -4.5701996, -3.…
## $y <int> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, … We’re ready to fit the model. So far, we’ve been following along with Kruschke by using the Bernoulli distribution (i.e., family = bernoulli) in our brms models. Let’s get frisky and use the $$n = 1$$ binomial distribution, here. You’ll see it yields the same results. fit21.3 <- brm(data = d_rare, family = binomial, y | trials(1) ~ 1 + x, prior = c(prior(normal(0, 2), class = Intercept), prior(normal(0, 2), class = b)), iter = 2500, warmup = 500, chains = 4, cores = 4, seed = 21, file = "fits/fit21.03") Recall that when you use the binomial distribution in newer versions of brms, you need to use the trials() syntax to tell brm() how many trials each row in the data corresponds to. Anyway, behold the summary. print(fit21.3) ## Family: binomial ## Links: mu = logit ## Formula: y | trials(1) ~ 1 + x ## Data: d_rare (Number of observations: 500) ## Draws: 4 chains, each with iter = 2500; warmup = 500; thin = 1; ## total post-warmup draws = 8000 ## ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## Intercept -3.01 0.24 -3.49 -2.56 1.00 2564 3145 ## x 1.03 0.20 0.65 1.43 1.00 2532 3897 ## ## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS ## and Tail_ESS are effective sample size measures, and Rhat is the potential ## scale reduction factor on split chains (at convergence, Rhat = 1). Looks like the model did a good job recapturing those data-generating b0 and b1 values. Now make the top left panel of Figure 21.5. draws <- as_draws_df(fit21.3) # unclear if Kruschke still used 20 draws or not # perhaps play with the n_draw values n_draw <- 20 length <- 100 set.seed(21) draws %>% # take 20 random samples of the posterior draws slice_sample(n = n_draw) %>% # add in a sequence of x expand(nesting(.draw, b_Intercept, b_x), x = seq(from = -3.5, to = 3.5, length.out = length)) %>% # compute the estimates of interest mutate(y = inv_logit_scaled(b_Intercept + b_x * x), thresh = -b_Intercept / b_x) %>% # plot! ggplot(aes(x = x)) + geom_hline(yintercept = .5, color = pm[7], size = 1/2) + geom_vline(aes(xintercept = thresh, group = .draw), color = pm[6], size = 2/5, linetype = 2) + geom_line(aes(y = y, group = .draw), color = pm[1], size = 1/3, alpha = 2/3) + geom_point(data = d_rare, aes(y = y), alpha = 1/3, color = pm[1]) + ggtitle("Data with Post. Pred.") + coord_cartesian(xlim = range(d_rare$x))

Here are the two subplots at the bottom, left.

draws %>%
mutate(Intercept = b_Intercept,
x         = b_x) %>%
pivot_longer(Intercept:x) %>%

ggplot(aes(x = value, y = 0)) +
stat_wilke() +
scale_y_continuous(NULL, breaks = NULL) +
xlab(NULL) +
facet_wrap(~ name, scales = "free", ncol = 2)

Since our data were simulated without the benefit of knowing how Kruschke set his seed and such, our results will only approximate those in the text.

Okay, now we need to simulate the complimentary data, those for which $$y = 1$$ is a less-rare event.

b0 <- 0
b1 <- 1

n <- 500

set.seed(21)
d_not_rare <-
tibble(x = rnorm(n, mean = 0, sd = 1)) %>%
mutate(mu = b0 + b1 * x) %>%
mutate(y = rbinom(n, size = 1, prob = 1 / (1 + exp(-mu))))

glimpse(d_not_rare)
## Rows: 500
## Columns: 3
## $x <dbl> 0.793013171, 0.522251264, 1.746222241, -1.271336123, 2.197389533, 0.433130777, -1.57019… ##$ mu <dbl> 0.793013171, 0.522251264, 1.746222241, -1.271336123, 2.197389533, 0.433130777, -1.57019…
ylim = range(nd$x2)) Instead of a simple threshold line, we get to visualize our interaction as a checkerboard-like probability plane. If you look back to Figure 21.7, you’ll see this is our 2D version of Kruschke’s wireframe plot on the top right. ## 21.3 Robust logistic regression With the robust logistic regression approach, we will describe the data as being a mixture of two different sources. One source is the logistic function of the predictor(s). The other source is sheer randomness or “guessing,” whereby the $$y$$ value comes from the flip of a fair coin: $$y \sim \operatorname{Bernoulli}(\mu = 1/2)$$. We suppose that every data point has a small chance, $$\alpha$$, of being generated by the guessing process, but usually, with probability $$1 - \alpha$$, the $$y$$ value comes from the logistic function of the predictor. With the two sources combined, the predicted probability that $$y = 1$$ is $\mu = \alpha \cdot \frac{1}{2} + (1 - \alpha) \cdot \operatorname{logistic} \bigg ( \beta_0 + \sum_j \beta_j x_j \bigg )$ Notice that when the guessing coefficient is zero, then the conventional logistic model is completely recovered. When the guessing coefficient is one, then the y values are completely random. (p. 635) Here’s what Kruschke’s $$\operatorname{Beta}(1, 9)$$ prior for $$\alpha$$ looks like. tibble(x = seq(from = 0, to = 1, length.out = 200)) %>% ggplot(aes(x = x, y = dbeta(x, 1, 9))) + geom_area(fill = pm[4]) + scale_x_continuous(NULL, breaks = 0:5 / 5, expand = c(0, 0)) + scale_y_continuous(NULL, breaks = NULL, expand = expansion(mult = c(0, 0.05))) + ggtitle(expression("Beta"*(1*", "*9))) To fit the brms analogue to Kruschke’s robust logistic regression model, we’ll need to adopt what Bürkner calls the non-linear syntax, which you can learn about in detail with his (2022d) vignette, Estimating non-linear models with brms. fit21.7 <- brm(data = my_data, family = bernoulli(link = identity), bf(male ~ a * .5 + (1 - a) * 1 / (1 + exp(-1 * (b0 + b1 * weight_z))), a + b0 + b1 ~ 1, nl = TRUE), prior = c(prior(normal(0, 2), nlpar = "b0"), prior(normal(0, 2), nlpar = "b1"), prior(beta(1, 9), nlpar = "a", lb = 0, ub = 1)), iter = 2500, warmup = 500, chains = 4, cores = 4, seed = 21, file = "fits/fit21.07") And just for clarity, you can do the same thing with family = binomial(link = identity), too. Just don’t forget to specify the number of trials with trials(). But to explain what’s going on with our syntax, above, I think it’s best to quote Bürkner at length: When looking at the above code, the first thing that becomes obvious is that we changed the formula syntax to display the non-linear formula including predictors (i.e., [weight_z]) and parameters (i.e., [a, b0, and b1]) wrapped in a call to [the bf() function]. This stands in contrast to classical R formulas, where only predictors are given and parameters are implicit. The argument [a + b0 + b1 ~ 1] serves two purposes. First, it provides information, which variables in formula are parameters, and second, it specifies the linear predictor terms for each parameter. In fact, we should think of non-linear parameters as placeholders for linear predictor terms rather than as parameters themselves (see also the following examples). In the present case, we have no further variables to predict [a, b0, and b1] and thus we just fit intercepts that represent our estimates of [$$\alpha$$, $$\beta_0$$, and $$\beta_1$$]. The formula [a + b0 + b1 ~ 1] is a short form of [a ~ 1, b0 ~ 1, b1 ~ 1] that can be used if multiple non-linear parameters share the same formula. Setting nl = TRUE tells brms that the formula should be treated as non-linear. In contrast to generalized linear models, priors on population-level parameters (i.e., ‘fixed effects’) are often mandatory to identify a non-linear model. Thus, brms requires the user to explicitely specify these priors. In the present example, we used a [beta(1, 9) prior on (the population-level intercept of) a, while we used a normal(0, 4) prior on both (population-level intercepts of) b0 and b1]. Setting priors is a non-trivial task in all kinds of models, especially in non-linear models, so you should always invest some time to think of appropriate priors. Quite often, you may be forced to change your priors after fitting a non-linear model for the first time, when you observe different MCMC chains converging to different posterior regions. This is a clear sign of an idenfication problem and one solution is to set stronger (i.e., more narrow) priors. (emphasis in the original) Now, behold the summary. print(fit21.7) ## Family: bernoulli ## Links: mu = identity ## Formula: male ~ a * 0.5 + (1 - a) * 1/(1 + exp(-1 * (b0 + b1 * weight_z))) ## a ~ 1 ## b0 ~ 1 ## b1 ~ 1 ## Data: my_data (Number of observations: 110) ## Draws: 4 chains, each with iter = 2500; warmup = 500; thin = 1; ## total post-warmup draws = 8000 ## ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## a_Intercept 0.20 0.09 0.04 0.39 1.00 3184 2459 ## b0_Intercept 0.34 0.44 -0.40 1.29 1.00 3676 3642 ## b1_Intercept 2.61 0.84 1.18 4.52 1.00 2916 3869 ## ## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS ## and Tail_ESS are effective sample size measures, and Rhat is the potential ## scale reduction factor on split chains (at convergence, Rhat = 1). Here’s a quick and dirty look at the conditional effects for weight_z. conditional_effects(fit21.7) %>% plot(points = T, line_args = list(color = pm[2], fill = pm[6]), point_args = list(color = pm[1], alpha = 3/4)) The way we prep for our version of Figure 21.8 is a minor extension what we did for Figure 21.3, above. Here we make the top panel. The biggest change, from before, is our adjusted formula for computing male. draws <- as_draws_df(fit21.7) p1 <- draws %>% slice_sample(n = n_draw) %>% expand(nesting(.draw, b_a_Intercept, b_b0_Intercept, b_b1_Intercept), weight_z = seq(from = -2, to = 3.5, length.out = length)) %>% # compare this to Kruschke's mu[i] code at the top of page 636 mutate(male = 0.5 * b_a_Intercept + (1 - b_a_Intercept) * inv_logit_scaled(b_b0_Intercept + b_b1_Intercept * weight_z), weight = weight_z * sd(my_data$weight) + mean(my_data$weight), thresh = -b_b0_Intercept / b_b1_Intercept * sd(my_data$weight) + mean(my_data$weight)) %>% ggplot(aes(x = weight, y = male)) + geom_hline(yintercept = .5, color = pm[7], size = 1/2) + geom_vline(aes(xintercept = thresh, group = .draw), color = pm[6], size = 2/5, linetype = 2) + geom_line(aes(group = .draw), color = pm[1], size = 1/3, alpha = 2/3) + geom_point(data = my_data, alpha = 1/3, color = pm[1]) + labs(title = "Data with Post. Pred.", y = "male") + coord_cartesian(xlim = range(my_data$weight))

Here we make the marginal-distribution plots for our versions of the lower panels of Figure 21.8.

p2 <-
draws %>%
mutate(Intercept = b_b0_Intercept - (b_b1_Intercept * mean(my_data$weight) / sd(my_data$weight)),
weight    = b_b1_Intercept / sd(my_dataweight), guessing = b_a_Intercept) %>% pivot_longer(Intercept:guessing) %>% mutate(name = factor(name, levels = c("Intercept", "weight", "guessing"))) %>% ggplot(aes(x = value, y = 0)) + stat_wilke() + scale_y_continuous(NULL, breaks = NULL) + xlab(NULL) + facet_wrap(~ name, scales = "free", ncol = 3) Now combine them with patchwork and behold the glory. p1 / p2 + plot_layout(height = c(4, 1)) Here are the ggpairs() plots. as_draws_df(fit21.7) %>% transmute(Intercept~(beta[0]) = b_b0_Intercept, weight~(beta[1]) = b_b1_Intercept, guessing = b_a_Intercept) %>% ggpairs(upper = list(continuous = my_upper), diag = list(continuous = my_diag), lower = list(continuous = my_lower), labeller = label_parsed) Look at how that sweet Stan-based HMC reduced the correlation between $$\beta_0$$ and $$\beta_1$$. For more on this approach to robust logistic regression in brms and an alternative suggested by Andrew Gelman, check out a thread from the Stan Forums, Translating robust logistic regression from rstan to brms. Yet do note that subsequent work suggests Gelman’s alternative approach is not as robust as originally thought (see here). ## 21.4 Nominal predictors “We now turn our attention from metric predictors to nominal predictors” (p. 636). ### 21.4.1 Single group. If we have just a single group and no other predictors, that’s just an intercept-only model. Back in the earlier chapters we thought of such a model as \begin{align*} y & \sim \operatorname{Bernoulli}(\theta) \\ \theta & \sim \operatorname{Beta}(a, b). \end{align*} Now we’re expressing the model as \begin{align*} y & \sim \operatorname{Bernoulli}(\mu) \\ \mu & \sim \operatorname{logistic}(\beta_0). \end{align*} For that $$\beta_0$$, we typically use a Gaussian prior of the form $\beta_0 \sim \operatorname{Normal}(M_0, S_0).$ To further explore what this means, we might make the model diagram in Figure 21.10. p1 <- tibble(x = seq(from = -3, to = 3, by = .1)) %>% ggplot(aes(x = x, y = (dnorm(x)) / max(dnorm(x)))) + geom_area(fill = pm[5]) + annotate(geom = "text", x = 0, y = .2, label = "normal", size = 7, color = pm[1]) + annotate(geom = "text", x = c(0, 1.5), y = .6, label = c("italic(M)[0]", "italic(S)[0]"), size = 7, color = pm[1], hjust = 0, family = "Times", parse = T) + scale_x_continuous(expand = c(0, 0)) + theme_void() + theme(axis.line.x = element_line(size = 0.5), plot.background = element_rect(fill = pm[8], color = pm[8])) # an annotated arrow p2 <- tibble(x = .5, y = 1, xend = .5, yend = 0) %>% ggplot(aes(x = x, xend = xend, y = y, yend = yend)) + geom_segment(arrow = my_arrow, color = pm[1]) + annotate(geom = "text", x = .4, y = .4, label = "'~'", size = 10, color = pm[1], family = "Times", parse = T) + xlim(0, 1) + theme_void() # likelihood formula p3 <- tibble(x = .5, y = .5, label = "logistic(beta[0])") %>% ggplot(aes(x = x, y = y, label = label)) + geom_text(size = 7, color = pm[1], parse = T, family = "Times") + scale_x_continuous(expand = c(0, 0), limits = c(0, 1)) + ylim(0, 1) + theme_void() # a second annotated arrow p4 <- tibble(x = .375, y = 1/2, label = "'='") %>% ggplot(aes(x = x, y = y, label = label)) + geom_text(size = 10, color = pm[1], parse = T, family = "Times") + geom_segment(x = .5, xend = .5, y = 1, yend = 0, arrow = my_arrow, color = pm[1]) + xlim(0, 1) + theme_void() # bar plot of Bernoulli data p5 <- tibble(x = 0:1, d = (dbinom(x, size = 1, prob = .6)) / max(dbinom(x, size = 1, prob = .6))) %>% ggplot(aes(x = x, y = d)) + geom_col(fill = pm[5], width = .4) + annotate(geom = "text", x = .5, y = .2, label = "Bernoulli", size = 7, color = pm[1]) + annotate(geom = "text", x = .5, y = .94, label = "mu", size = 7, color = pm[1], family = "Times", parse = T) + xlim(-.75, 1.75) + theme_void() + theme(axis.line.x = element_line(size = 0.5), plot.background = element_rect(fill = pm[8], color = pm[8])) # another annotated arrow p6 <- tibble(x = c(.375, .625), y = c(1/3, 1/3), label = c("'~'", "italic(i)")) %>% ggplot(aes(x = x, y = y, label = label)) + geom_text(size = c(10, 7), color = pm[1], parse = T, family = "Times") + geom_segment(x = .5, xend = .5, y = 1, yend = 0, arrow = my_arrow, color = pm[1]) + xlim(0, 1) + theme_void() # some text p7 <- tibble(x = 1, y = .5, label = "italic(y[i])") %>% ggplot(aes(x = x, y = y, label = label)) + geom_text(size = 7, color = pm[1], parse = T, family = "Times") + xlim(0, 2) + theme_void() # define the layout layout <- c( area(t = 1, b = 2, l = 2, r = 7), area(t = 3, b = 3, l = 2, r = 7), area(t = 4, b = 4, l = 1, r = 6), area(t = 5, b = 5, l = 1, r = 6), area(t = 6, b = 7, l = 1, r = 6), area(t = 8, b = 8, l = 1, r = 6), area(t = 9, b = 9, l = 1, r = 6) ) # combine and plot! (p1 + p2 + p3 + p4 + p5 + p6 + p7) + plot_layout(design = layout) & ylim(0, 1) & theme(plot.margin = margin(0, 5.5, 0, 5.5)) In a situation where we don’t have strong prior substantive knowledge, we often set $$M_0 = 0$$, which puts the probability mass around $$\theta = .5$$, a reasonable default hypothesis. Often times $$S_0$$ is some modest single-digit integer like 2 or 4. To get a sense of how different Gaussians translate to the beta distribution, we’ll recreate Figure 21.11. # this will help streamline the conversion logistic <- function(x) { 1 / (1 + exp(-x)) } # wrangle crossing(m_0 = 0:1, s_0 = c(.5, 1, 2)) %>% mutate(key = str_c("mu == logistic(beta %~%", " N(", m_0, ", ", s_0, "))"), sim = pmap(list(2e6, m_0, s_0), rnorm)) %>% unnest(sim) %>% mutate(sim = logistic(sim)) %>% # plot ggplot(aes(x = sim, y = ..density..)) + geom_histogram(color = pm[8], fill = pm[7], size = 1/3, bins = 20, boundary = 0) + geom_line(stat = "density", size = 1, color = pm[3]) + scale_y_continuous(NULL, breaks = NULL) + xlab(expression(mu)) + facet_wrap(~ key, scales = "free_y", labeller = label_parsed) So the prior distribution doesn’t even flatten out until you’re somewhere between $$S_0 = 1$$ and $$S_0 = 2$$. Just for kicks, here we break that down a little further. # wrangle tibble(m_0 = 0, s_0 = c(1.25, 1.5, 1.75)) %>% mutate(key = str_c("mu == logistic(beta %~%", " N(", m_0, ", ", s_0, "))"), sim = pmap(list(1e7, m_0, s_0), rnorm)) %>% unnest(sim) %>% mutate(sim = logistic(sim)) %>% # plot ggplot(aes(x = sim, y = ..density..)) + geom_histogram(color = pm[8], fill = pm[7], size = 1/3, bins = 20, boundary = 0) + geom_line(stat = "density", size = 1, color = pm[3]) + scale_y_continuous(NULL, breaks = NULL) + xlab(expression(mu)) + facet_wrap(~ key, scales = "free_y", labeller = label_parsed, ncol = 3) Here’s the basic brms analogue to Kruschke’s JAGS code from the bottom of page 639. fit <- brm(data = my_data, family = bernoulli(link = identity), y ~ 1, prior(beta(1, 1), class = Intercept)) Here’s the brms analogue to Kruschke’s JAGS code at the top of page 641. fit <- brm(data = my_data, family = bernoulli, y ~ 1, prior(normal(0, 2), class = "Intercept")) ### 21.4.2 Multiple groups. If there’s only one group, we don’t need a grouping variable. But that’s a special case. Now we show the more general approach with multiple groups. #### 21.4.2.1 Example: Baseball again. Load the baseball data. my_data <- read_csv("data.R/BattingAverage.csv") glimpse(my_data) ## Rows: 948 ## Columns: 6 ## Player       <chr> "Fernando Abad", "Bobby Abreu", "Tony Abreu", "Dustin Ackley", "Matt Adams", …
## $PriPos <chr> "Pitcher", "Left Field", "2nd Base", "2nd Base", "1st Base", "Pitcher", "Pitc… ##$ Hits         <dbl> 1, 53, 18, 137, 21, 0, 0, 2, 150, 167, 0, 128, 66, 3, 1, 81, 180, 36, 150, 0,…
## $AtBats <dbl> 7, 219, 70, 607, 86, 1, 1, 20, 549, 576, 1, 525, 275, 12, 8, 384, 629, 158, 5… ##$ PlayerNumber <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22…
## \$ PriPosNumber <dbl> 1, 7, 4, 4, 3, 1, 1, 3, 3, 4, 1, 5, 4, 2, 7, 4, 6, 8, 9, 1, 2, 5, 1, 1, 7, 2,…

#### 21.4.2.2 The model.

I’m not aware that Kruschke’s modeling approach for this example will work well within the brms paradigm. Accordingly, our version of the hierarchical model diagram in Figure 21.12 will differ in important ways from the one in the text.

# bracket
p1 <-
tibble(x = .4,
y = .25,
label = "{_}") %>%

ggplot(aes(x = x, y = y, label = label)) +
geom_text(size = 10, color = pm[3], family = "Times") +
scale_x_continuous(expand = c(0, 0), limits = c(0, 1)) +
ylim(0, 1) +
theme_void()

# normal density
p2 <-
tibble(x = seq(from = -3, to = 3, by = .1)) %>%
ggplot(aes(x = x, y = (dnorm(x)) / max(dnorm(x)))) +
geom_area(fill = pm[5]) +
annotate(geom = "text",
x = 0, y = .2,
label = "normal",
size = 7, color = pm[1]) +
annotate(geom = "text",
x = c(0, 1.5), y = .6,
label = c("italic(M)[0]", "italic(S)[0]"),
size = 7, color = pm[1], hjust = 0, family = "Times", parse = T) +
scale_x_continuous(expand = c(0, 0)) +
theme_void() +
theme(axis.line.x = element_line(size = 0.5),
plot.background = element_rect(fill = pm[8], color = "white", size = 1))

# second normal density
p3 <-
tibble(x = seq(from = -3, to = 3, by = .1)) %>%
ggplot(aes(x = x, y = (dnorm(x)) / max(dnorm(x)))) +
geom_area(fill = pm[5]) +
annotate(geom = "text",
x = 0, y = .2,
label = "normal",
size = 7, color = pm[1]) +
annotate(geom = "text",
x = c(0, 1.5), y = .6,
label = c("0", "sigma[beta]"),
size = 7, color = pm[1], hjust = 0, family = "Times", parse = T) +
scale_x_continuous(expand = c(0, 0)) +
theme_void() +
theme(axis.line.x = element_line(size = 0.5),
plot.background = element_rect(fill = pm[8], color = "white", size = 1))

#  plain arrow
p4 <-
tibble(x    = .4,
y    = 1,
xend = .4,
yend = .25) %>%

ggplot(aes(x = x, xend = xend,
y = y, yend = yend)) +
geom_segment(arrow = my_arrow, color = pm[1]) +
xlim(0, 1) +
theme_void()

# likelihood formula
p5 <-
tibble(x = .5,
y = .3,
label = "logistic(beta[0]+sum()[italic(j)]*beta['['*italic(j)*']']*italic(x)['['*italic(j)*']'](italic(i)))") %>%

ggplot(aes(x = x, y = y, label = label)) +
geom_text(size = 7, color = pm[1], parse = T, family = "Times") +
scale_x_continuous(expand = c(0, 0), limits = c(0, 1)) +
ylim(0, 1) +
theme_void()

# two annotated arrows
p6 <-
tibble(x    = c(.2, .8),
y    = 1,
xend = c(.37, .63),
yend = .1) %>%

ggplot(aes(x = x, xend = xend,
y = y, yend = yend)) +
geom_segment(arrow = my_arrow, color = pm[1]) +
annotate(geom = "text",
x = c(.22, .66, .77), y = .55,
label = c("'~'", "'~'", "italic(j)"),
size = c(10, 10, 7), color = pm[1], family = "Times", parse = T) +
xlim(0, 1) +
theme_void()

# binomial density
p7 <-
tibble(x = 0:7) %>%
ggplot(aes(x = x,
y = (dbinom(x, size = 7, prob = .625)) / max(dbinom(x, size = 7, prob = .625)))) +
geom_col(fill = pm[5], width = .4) +
annotate(geom = "text",
x = 3.5, y = .2,
label = "binomial",
size = 7) +
annotate(geom = "text",
x = 7, y = .85,
label = "italic(N)[italic(i)*'|'*italic(j)]",
size = 7, family = "Times", parse = TRUE) +
coord_cartesian(xlim = c(-1, 8),
ylim = c(0, 1.2)) +
theme_void() +
theme(axis.line.x = element_line(size = 0.5),
plot.background = element_rect(fill = pm[8], color = pm[8]))

# one annotated arrow
p8 <-
tibble(x     = c(.375, .5),
y     = c(.75, .3),
label = c("'='", "mu[italic(i)*'|'*italic(j)]")) %>%

ggplot(aes(x = x, y = y, label = label)) +
geom_text(size = c(10, 7), color = pm[1], parse = T, family = "Times") +
geom_segment(x = .5, xend = .5,
y = 1, yend = .425,
arrow = my_arrow, color = pm[1]) +
xlim(0, 1) +
theme_void()

# another annotated arrow
p9 <-
tibble(x     = c(.375, .625),
y     = c(1/3, 1/3),
label = c("'~'", "italic(i)*'|'*italic(j)")) %>%

ggplot(aes(x = x, y = y, label = label)) +
geom_text(size = c(10, 7), color = pm[1], parse = T, family = "Times") +
geom_segment(x = .5, xend = .5,
y = 1, yend = 0,
arrow = my_arrow, color = pm[1]) +
xlim(0, 1) +
theme_void()

# some text
p10 <-
tibble(x     = 1,
y     = .5,
label = "italic(y[i])['|'][italic(j)]") %>%

ggplot(aes(x = x, y = y, label = label)) +
geom_text(size = 7, color = pm[1], parse = T, family = "Times") +
xlim(0, 2) +
theme_void()

# define the layout
layout <- c(
area(t = 1, b = 2, l = 6, r = 8),
area(t = 4, b = 5, l = 1, r = 3),
area(t = 4, b = 5, l = 5, r = 7),
area(t = 3, b = 4, l = 6, r = 8),
area(t = 7, b = 8, l = 1, r = 7),
area(t = 6, b = 7, l = 1, r = 7),
area(t = 11, b = 12, l = 3, r = 5),
area(t = 9, b = 11, l = 3, r = 5),
area(t = 13, b = 14, l = 3, r = 5),
area(t = 15, b = 15, l = 3, r = 5)
)

# combine and plot!
(p1 + p2 + p3 + p4 + p5 + p6 + p7 + p8 + p9 + p10) +
plot_layout(design = layout) &
ylim(0, 1) &
theme(plot.margin = margin(0, 5.5, 0, 5.5))

I recommend we fit a hierarchical aggregated-binomial regression model in place of the one Kruschke used. With this approach, we might express the statistical model as

\begin{align*} \text{Hits}_i & \sim \operatorname{Binomial}(\text{AtBats}_i, p_i) \\ \operatorname{logit}(p_i) & = \beta_0 + \beta_{\text{PriPos}_i} + \beta_{\text{PriPos:Player}_i} \\ \beta_0 & \sim \operatorname{Normal}(0, 2) \\ \beta_\text{PriPos} & \sim \operatorname{Normal}(0, \sigma_\text{PriPos}) \\ \beta_\text{PriPos:Player} & \sim \operatorname{Normal}(0, \sigma_\text{PriPos:Player}) \\ \sigma_\text{PriPos} & \sim \operatorname{HalfCauchy}(0, 1) \\ \sigma_\text{PriPos:Player} & \sim \operatorname{HalfCauchy}(0, 1). \end{align*}

Here’s how to fit the model with brms. Notice we’re finally making good use of the trials() syntax. This is because we’re fitting an aggregated binomial model. Instead of our criterion variable Hits being a vector of 0’s and 1’s, it’s the number of successful trials given the total number of trials, which is listed in the AtBats vector. Aggregated binomial.

fit21.8 <-
brm(data = my_data,
Hits | trials(AtBats) ~ 1 + (1 | PriPos) + (1 | PriPos:Player),
prior = c(prior(normal(0, 2), class = Intercept),
prior(cauchy(0, 1), class = sd)),
iter = 2500, warmup = 500, chains = 4, cores = 4,
seed = 21,
file = "fits/fit21.08")

#### 21.4.2.3 Results.

Before we start plotting, review the model summary.

print(fit21.8)
##  Family: binomial
## Formula: Hits | trials(AtBats) ~ 1 + (1 | PriPos) + (1 | PriPos:Player)
##    Data: my_data (Number of observations: 948)
##   Draws: 4 chains, each with iter = 2500; warmup = 500; thin = 1;
##          total post-warmup draws = 8000
##
## Group-Level Effects:
## ~PriPos (Number of levels: 9)
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.32      0.10     0.19     0.57 1.00     2889     4640
##
## ~PriPos:Player (Number of levels: 948)
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.14      0.01     0.12     0.15 1.00     3535     5208
##
## Population-Level Effects:
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    -1.17      0.11    -1.40    -0.94 1.00     1387     2427
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

If you look closely at our model versus the one in the text, you’ll see ours has fewer parameters. As a down-the-line consequence, our model doesn’t support a direct analogue to the plot at the top of Figure 21.13. However, we can come close. Rather than modeling the position-based probabilities as multiple draws of beta distributions, we can simply summarize our probabilities by their posterior distributions.

library(ggridges)

# define our new data, nd
nd <-
my_data %>%
group_by(PriPos) %>%
summarise(AtBats = mean(AtBats) %>% round(0)) %>%
# to help join with the draws
mutate(name = str_c("V", 1:n()))

# push the model through fitted() and wrangle
fitted(fit21.8,
newdata = nd,
re_formula = Hits | trials(AtBats) ~ 1 + (1 | PriPos),
scale = "linear",
summary = F) %>%
as_tibble() %>%
pivot_longer(everything()) %>%
mutate(probability = inv_logit_scaled(value)) %>%
left_join(nd, by = "name") %>%

# plot
ggplot(aes(x = probability, y = PriPos)) +
geom_vline(xintercept = fixef(fit21.8)[1] %>% inv_logit_scaled(), color = pm[6]) +
geom_density_ridges(color = pm[1], fill = pm[7], size = 1/2,
rel_min_height = 0.005, scale = .9) +
geom_jitter(data = my_data,
aes(x = Hits / AtBats),
height = .025, alpha = 1/6, size = 1/6, color = pm[1]) +
scale_x_continuous("Hits / AtBats", breaks = 0:5 / 5, expand = c(0, 0)) +
coord_cartesian(xlim = c(0, 1),
ylim = c(1, 9.5)) +
ggtitle("Data with Posterior Predictive Distrib.") +
theme(axis.text.y = element_text(hjust = 0),
axis.ticks.y = element_blank())

For kicks and giggles, we depicted the grand mean probability as the vertical line in the background with the geom_vline() line.

However, we can make our plot more directly analogous to Kruschke’s if we’re willing to stretch a little. Recall that Kruschke used the beta distribution with the $$\omega-\kappa$$ parameterization in both his statistical model and his plot code–both of which you can find detailed in his Jags-Ybinom-Xnom1fac-Mlogistic.R. file. We didn’t use the beta distribution in our brm() model and the parameters from that model didn’t have as direct correspondences to the beta distribution the way those from Kruschke’s JAGS model did. However, recall that we can re-parameterize the beta distribution in terms of its mean $$\mu$$ and sample size $$n$$, following the form

\begin{align*} \alpha & = \mu n \\ \beta & = (1 - \mu) n . \end{align*}

When we take the inverse logit of our intercepts, we do get vales in a probability metric. We might consider inserting those probabilities into the $$\mu$$ parameter. Furthermore, we can take our AtBats sample sizes and insert them directly into $$n$$. As before, we’ll use the average sample size per position.

# wrangle like a boss
nd %>%
ndraws = 20,
re_formula = Hits | trials(AtBats) ~ 1 + (1 | PriPos),
dpar = "mu") %>%
# use the equations from above
mutate(alpha = mu * AtBats,
beta  = (1 - mu) * AtBats) %>%
mutate(ll = qbeta(.025, shape1 = alpha, shape2 = beta),
ul = qbeta(.975, shape1 = alpha, shape2 = beta)) %>%
mutate(theta = map2(ll, ul, seq, length.out = 100)) %>%
unnest(theta) %>%
mutate(density = dbeta(theta, alpha, beta)) %>%
group_by(.draw) %>%
mutate(density = density / max(density)) %>%

# plot!
ggplot(aes(y = PriPos)) +
geom_ridgeline(aes(x = theta, height = -density, group = interaction(PriPos, .draw)),
fill = NA, color = adjustcolor(pm[1], alpha.f = 1/3),
size = 1/3, alpha = 2/3, min_height = NA) +
geom_jitter(data = my_data,
aes(x = Hits / AtBats, size = AtBats),
height = .05, alpha = 1/6, shape = 1, color = pm[1]) +
scale_size_continuous(range = c(1/4, 4)) +
scale_x_continuous("Hits / AtBats", expand = c(0, 0)) +
labs(title = "Data with Posterior Predictive Distrib.",
y = NULL) +
coord_flip(xlim = c(0, 1),
ylim = c(0.67, 8.67)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
axis.ticks.x = element_blank(),
legend.position = c(.956, .8))

⚠️ Since we didn’t actually presume the beta distribution anywhere in our brm() statistical model, I would not attempt to present this workflow in a scientific outlet. Go with the previous plot. This attempt seems dishonest. But it is kinda fun to see how far we can push our results. ⚠️

Happily, our contrasts will be less contentious. Here’s the initial wrangling.

# define our subset of positions
positions <- c("1st Base", "Catcher", "Pitcher")

# redefine nd
nd <-
my_data %>%
filter(PriPos %in% positions) %>%
group_by(PriPos) %>%
summarise(AtBats = mean(AtBats) %>% round(0))

# push the model through fitted() and wrangle
f <-
fitted(fit21.8,
newdata = nd,
re_formula = Hits | trials(AtBats) ~ 1 + (1 | PriPos),
scale = "linear",
summary = F) %>%
as_tibble() %>%
set_names(positions)

# what did we do?
head(f)
## # A tibble: 6 × 3
##   1st Base Catcher Pitcher
##        <dbl>   <dbl>   <dbl>
## 1      -1.11   -1.15   -1.85
## 2      -1.08   -1.16   -1.84
## 3      -1.11   -1.10   -1.88
## 4      -1.07   -1.15   -1.95
## 5      -1.05   -1.10   -1.85
## 6      -1.10   -1.13   -1.89

Here we make are our versions of the middle two panels of Figure 21.13.

p1 <-
f %>%
# compute the differences and put the data in the long format
mutate(Pitcher vs. Catcher  = Pitcher - Catcher,
Catcher vs. 1st Base = Catcher - 1st Base) %>%
pivot_longer(contains("vs.")) %>%
mutate(name = factor(name, levels = c("Pitcher vs. Catcher", "Catcher vs. 1st Base"))) %>%

ggplot(aes(x = value, y = 0)) +
geom_vline(xintercept = 0, color = pm[6]) +
stat_wilke() +
scale_y_continuous(NULL, breaks = NULL) +
xlab("Difference (in b)") +
facet_wrap(~ name, scales = "free", ncol = 2)

Now make our versions of the bottom two panels of Figure 21.13.

p2 <-
f %>%
# do the transformation before computing the differences
mutate_all(inv_logit_scaled) %>%
mutate(Pitcher vs. Catcher  = Pitcher - Catcher,
Catcher vs. 1st Base = Catcher - 1st Base) %>%
pivot_longer(contains("vs.")) %>%
mutate(name = factor(name, levels = c("Pitcher vs. Catcher", "Catcher vs. 1st Base"))) %>%

ggplot(aes(x = value, y = 0)) +
geom_vline(xintercept = 0, color = pm[6]) +
stat_wilke() +
scale_y_continuous(NULL, breaks = NULL) +
xlab("Difference (in probability)") +
facet_wrap(~ name, scales = "free", ncol = 2)

Combine and plot.

p1 / p2

Note how our distributions are described as differences in probability, rather than in $$\omega$$.

## Session info

sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
##
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
##  [1] ggridges_0.5.3  GGally_2.1.2    tidybayes_3.0.2 brms_2.17.0     Rcpp_1.0.8      ggExtra_0.10.0
##  [7] patchwork_1.1.1 forcats_0.5.1   stringr_1.4.0   dplyr_1.0.8     purrr_0.3.4     readr_2.0.1
## [13] tidyr_1.2.0     tibble_3.1.6    ggplot2_3.3.5   tidyverse_1.3.1 PNWColors_0.1.0
##
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1         backports_1.4.1      plyr_1.8.6           igraph_1.2.6
##   [5] svUnit_1.0.6         splines_4.1.2        crosstalk_1.1.1      TH.data_1.0-10
##   [9] rstantools_2.1.1     inline_0.3.19        digest_0.6.29        htmltools_0.5.2
##  [13] rsconnect_0.8.24     fansi_1.0.3          magrittr_2.0.3       checkmate_2.0.0
##  [17] tzdb_0.1.2           modelr_0.1.8         RcppParallel_5.1.4   matrixStats_0.61.0
##  [21] vroom_1.5.4          xts_0.12.1           sandwich_3.0-1       prettyunits_1.1.1
##  [25] colorspace_2.0-3     rvest_1.0.1          ggdist_3.1.1         haven_2.4.3
##  [29] xfun_0.30            callr_3.7.0          crayon_1.5.1         jsonlite_1.8.0
##  [33] lme4_1.1-27.1        survival_3.2-13      zoo_1.8-9            glue_1.6.2
##  [37] gtable_0.3.0         emmeans_1.7.1-1      distributional_0.3.0 pkgbuild_1.2.0
##  [41] rstan_2.21.5         abind_1.4-5          scales_1.1.1         mvtnorm_1.1-2
##  [45] emo_0.0.0.9000       DBI_1.1.1            miniUI_0.1.1.1       xtable_1.8-4
##  [49] HDInterval_0.2.2     bit_4.0.4            stats4_4.1.2         StanHeaders_2.21.0-7
##  [53] DT_0.19              htmlwidgets_1.5.3    httr_1.4.2           threejs_0.3.3
##  [57] RColorBrewer_1.1-3   arrayhelpers_1.1-0   posterior_1.1.0.9000 ellipsis_0.3.2
##  [61] reshape_0.8.8        pkgconfig_2.0.3      loo_2.5.1            farver_2.1.0
##  [65] sass_0.4.1           dbplyr_2.1.1         utf8_1.2.2           tidyselect_1.1.2
##  [69] labeling_0.4.2       rlang_1.0.2          reshape2_1.4.4       later_1.3.0
##  [73] munsell_0.5.0        cellranger_1.1.0     tools_4.1.2          cli_3.2.0
##  [77] generics_0.1.2       broom_0.7.12         evaluate_0.15        fastmap_1.1.0
##  [81] processx_3.5.2       knitr_1.38           bit64_4.0.5          fs_1.5.2
##  [85] nlme_3.1-153         projpred_2.0.2       mime_0.11            xml2_1.3.2
##  [89] compiler_4.1.2       bayesplot_1.9.0      shinythemes_1.2.0    rstudioapi_0.13
##  [93] gamm4_0.2-6          reprex_2.0.1         bslib_0.3.1          stringi_1.7.6
##  [97] highr_0.9            ps_1.6.0             Brobdingnag_1.2-6    lattice_0.20-45
## [101] Matrix_1.3-4         nloptr_1.2.2.2       markdown_1.1         shinyjs_2.0.0
## [105] tensorA_0.36.2       vctrs_0.4.0          pillar_1.7.0         lifecycle_1.0.1
## [109] jquerylib_0.1.4      bridgesampling_1.1-2 estimability_1.3     httpuv_1.6.2
## [113] R6_2.5.1             bookdown_0.26        promises_1.2.0.1     gridExtra_2.3
## [117] codetools_0.2-18     boot_1.3-28          MASS_7.3-54          colourpicker_1.1.0
## [121] gtools_3.9.2         assertthat_0.2.1     withr_2.5.0          shinystan_2.5.0
## [125] multcomp_1.4-17      mgcv_1.8-38          parallel_4.1.2       hms_1.1.0
## [129] grid_4.1.2           minqa_1.2.4          coda_0.19-4          rmarkdown_2.13
## [133] shiny_1.6.0          lubridate_1.7.10     base64enc_0.1-3      dygraphs_1.1.1.6

### References

Attali, D., & Baker, C. (2022). ggExtra: Add marginal histograms to ’ggplot2’, and more ’ggplot2’ enhancements. https://CRAN.R-project.org/package=ggExtra
Bürkner, P.-C. (2022d). Estimating non-linear models with brms. https://CRAN.R-project.org/package=brms/vignettes/brms_nonlinear.html
Kruschke, J. K. (2015). Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan. Academic Press. https://sites.google.com/site/doingbayesiandataanalysis/
Lawlor, J. (2020). PNWColors: Color palettes inspired by nature in the US Pacific Northwest [Manual]. https://CRAN.R-project.org/package=PNWColors
Wilke, C. O. (2019). Fundamentals of data visualization. https://clauswilke.com/dataviz/