library(tidyverse)
library(scico)
library(patchwork)
library(brms)
library(tidybayes)
library(GGally)
23 Ordinal Predicted Variable
This chapter considers data that have an ordinal predicted variable. For example, we might want to predict people’s happiness ratings on a
-to- scale as a function of their total financial assets. Or we might want to predict ratings of movies as a function of the year they were made. One traditional treatment of this sort of data structure is called ordinal or ordered probit regression. We will consider a Bayesian approach to this model. As usual, in Bayesian software, it is easy to generalize the traditional model so it is robust to outliers, allows different variances within levels of a nominal predictor, or has hierarchical structure to share information across levels or factors as appropriate.
In the context of the generalized linear model (GLM) introduced in Chapter 15, this chapter’s situation involves an inverse-link function that is a thresholded cumulative normal with a categorical distribution for describing noise in the data, as indicated in the fourth row of Table 15.2 (p. 443). For a reminder of how this chapter’s combination of predicted and predictor variables relates to other combinations, see Table 15.3 (p. 444). (Kruschke, 2015, p. 671, emphasis in the original)
Load the primary packages.
We might follow Kruschke’s advice and rewind a little before tackling this chapter. The cumulative normal function is denoted
<- tibble(z = seq(from = -3, to = 3, by = 0.1)) |>
d # Add the density values
mutate(`p(z)` = dnorm(z, mean = 0, sd = 1),
# Add the CDF values
`Phi(z)` = pnorm(z, mean = 0, sd = 1))
head(d)
# A tibble: 6 × 3
z `p(z)` `Phi(z)`
<dbl> <dbl> <dbl>
1 -3 0.00443 0.00135
2 -2.9 0.00595 0.00187
3 -2.8 0.00792 0.00256
4 -2.7 0.0104 0.00347
5 -2.6 0.0136 0.00466
6 -2.5 0.0175 0.00621
It’s time to talk color and theme. For this chapter, we’ll take our color palette from the scico package (Pedersen & Crameri, 2021), which provides 17 perceptually-uniform and colorblind-safe palettes based on the work of Fabio Crameri. Our palette of interest will be "lajolla"
.
<- scico(palette = "lajolla", n = 9, direction = -1)
sl
::show_col(sl) scales
Our overall plot theme will be based on ggplot2::theme_linedraw()
with just a few adjustments.
theme_set(
theme_linedraw() +
theme(panel.grid = element_blank(),
strip.background = element_rect(color = sl[9], fill = sl[9]),
strip.text = element_text(color = sl[1]))
)
Now plot!
<- d |>
p1 ggplot(aes(x = z, y = `p(z)`)) +
geom_area(aes(fill = z <= 1),
show.legend = FALSE) +
geom_line(color = sl[3], linewidth = 1) +
scale_x_continuous(NULL, breaks = NULL) +
scale_y_continuous(expand = expansion(mult = c(0, 0.05))) +
scale_fill_manual(values = c("transparent", sl[2])) +
ylab(expression(p(italic(z)))) +
facet_grid("Normal Density" ~ .)
<- d |>
p2 ggplot(aes(x = z, y = `Phi(z)`)) +
geom_area(aes(fill = z <= 1),
show.legend = FALSE) +
geom_line(color = sl[3], linewidth = 1) +
scale_x_continuous(breaks = -2:2) +
scale_y_continuous(expand = expansion(mult = c(0, 0)), limits = 0:1) +
scale_fill_manual(values = c("transparent", sl[2])) +
ylab(expression(Phi(italic(z)))) +
facet_grid("Cumulative Normal" ~ .)
# Combine, adjust, and display
/ p2 &
p1 coord_cartesian(xlim = c(-2.5, 2.5))
For both plots, z
is in a standardized metric (i.e.,
The inverse of
23.1 Modeling ordinal data with an underlying metric variable
You can imagine that the distribution of ordinal values might not resemble a normal distribution, even though the underlying metric values are normally distributed. Figure 23.1 shows some examples of ordinal outcome probabilities generated from an underlying normal distribution. The horizontal axis is the underlying continuous metric value. Thresholds are plotted as vertical dashed lines, labeled
. In all examples, the ordinal scale has levels, and hence, there are thresholds. The lowest threshold is set at (to separate outcomes and ), and the highest threshold is set at (to separate outcomes and ). The normal curve in each panel shows the distribution of underlying continuous values. What differs across panels are the settings of means, standard deviations, and remaining thresholds. (p. 672)
The four panels in Figure 23.1 require several steps. We’ll start with the top panel and build from there. Here is how we might make the values necessary for the density curve.
# Set the population Gaussian values
<- 4; sigma <- 1.5
mu
# How many draws would you like?
<- 1e5
n
# Simulate the primary data
set.seed(23)
<- data.frame(x = rnorm(n = n, mean = mu, sd = sigma))
d
# What?
glimpse(d)
Rows: 100,000
Columns: 1
$ x <dbl> 4.289819, 3.347977, 5.369901, 6.690082, 5.494908, 5.661236, 3.582871…
These x
values are the drawn from the underlying distribution cut()
function to divide continuous values into bins. We can define the cut-points for those bins (i.e., the thresholds) with the breaks
argument. Here we define our breaks as a vector saved as thresholds_1
, and take an initial look.
<- 1:6 + 0.5
thresholds_1
|>
d mutate(y = cut(x, breaks = c(-Inf, thresholds_1, Inf))) |>
head()
x y
1 4.289819 (3.5,4.5]
2 3.347977 (2.5,3.5]
3 5.369901 (4.5,5.5]
4 6.690082 (6.5, Inf]
5 5.494908 (4.5,5.5]
6 5.661236 (5.5,6.5]
By default, the cut()
function saved the bins with labels based on their upper and lower boundaries. To help define those boundaries, we nested the thresholds_1
values in between negative and positive infinity, which are the lower and upper bounds for the Gaussian range. However, it would be better if we labeled the bins with the integers 1
through 7
, as in the text, which we can do with the labels
argument. Though this will give the correct values, they will still be saved as a factor. As we will want these as integer values for some of the operations to come, we will convert those values with as.integer()
.
<- d |>
d mutate(y = cut(x, breaks = c(-Inf, thresholds_1, Inf), labels = 1:7) |>
as.integer())
head(d)
x y
1 4.289819 4
2 3.347977 3
3 5.369901 5
4 6.690082 7
5 5.494908 5
6 5.661236 6
Here we compute and save the sample mean and standard deviation for y
. Note how they match with the sample values displayed in the top panel of Figure 23.1.
<- mean(d$y) |> round(digits = 2)) (m
[1] 4
<- sd(d$y) |> round(digits = 2)) (s
[1] 1.47
As an intermediary step, we will want to compute and save the maximum value for the Gaussian likelihood function, given our mu
and sigma
values. For the Gaussian, the maximum value will be when x
is at the mean. For later reference, we save that value as ml
.
<- dnorm(mu, mean = mu, sd = sigma)) (ml
[1] 0.2659615
Next we can use count()
and some follow-up mutate()
lines to determine the proportion and percentage of each of the seven possible values in the y
vector.
<- d |>
d count(y) |>
mutate(proportion = n / sum(n)) |>
mutate(percent = (100 * proportion) |> round(digits = 0))
print(d)
y n proportion percent
1 1 4763 0.04763 5
2 2 10844 0.10844 11
3 3 21174 0.21174 21
4 4 26256 0.26256 26
5 5 21233 0.21233 21
6 6 10951 0.10951 11
7 7 4779 0.04779 5
Those proportion
and percent
values will be important for the bars displayed in the figure. When you look at the bars, you’ll see they are offset to the right of their thresholds. Thus we will save a vector of those desired x
values. In the code below, we add in those values, and further amend the y
columns to make the labels for the percentages and the y
values.
<- 1:7
x_1
<- d |>
d mutate(label_p = str_c(percent, "%"),
label_y = str_c("italic(y)==", y),
density = ml * (proportion / max(proportion)),
x = x_1)
print(d)
y n proportion percent label_p label_y density x
1 1 4763 0.04763 5 5% italic(y)==1 0.04824706 1
2 2 10844 0.10844 11 11% italic(y)==2 0.10984486 2
3 3 21174 0.21174 21 21% italic(y)==3 0.21448314 3
4 4 26256 0.26256 26 26% italic(y)==4 0.26596152 4
5 5 21233 0.21233 21 21% italic(y)==5 0.21508078 5
6 6 10951 0.10951 11 11% italic(y)==6 0.11092873 6
7 7 4779 0.04779 5 5% italic(y)==7 0.04840913 7
In this special case, the y
values are the same as the x
values; this will not be the case for some of the other panels. Also notice how we used the ml
value from above to scale the density
values for the bars. As a final preparatory step, we now define a supplemental data frame for the Gaussian density in the background.
# Set boundaries for `x`
<- -0.5; xmax <- 8.5
xmin
# Make the data frame for the underlying normal
<- data.frame(x = seq(from = xmin, to = xmax, length.out = 101)) |>
d_normal mutate(density = dnorm(x, mean = mu, sd = sigma))
glimpse(d_normal)
Rows: 101
Columns: 2
$ x <dbl> -0.50, -0.41, -0.32, -0.23, -0.14, -0.05, 0.04, 0.13, 0.22, 0.…
$ density <dbl> 0.002954566, 0.003530896, 0.004204484, 0.004988582, 0.00589763…
We can now make a version of the top panel like so.
|>
d ggplot(aes(x = x, y = density)) +
geom_area(data = d_normal,
fill = sl[2]) +
geom_col(alpha = 0.85, fill = sl[5], width = 0.45,
color = alpha(sl[5], 0.85)) +
geom_text(aes(label = label_p),
size = 3.0, vjust = -2) +
geom_text(aes(label = label_y),
parse = TRUE, size = 3.0, vjust = -0.25) +
geom_vline(xintercept = thresholds_1, color = sl[7], linetype = 3) +
annotate(geom = "text",
x = xmax - 0.05, y = ml, label = str_c("mu==", mu, "*','*~sigma==", sigma),
hjust = 1, parse = TRUE, size = 3.2, vjust = -2.75) +
annotate(geom = "text",
x = xmax - 0.05, y = ml, label = str_c("italic(M[y])==", m, "*','*~italic(S[y])==", s),
hjust = 1, parse = TRUE, size = 3.2, vjust = -1) +
scale_x_continuous(NULL, breaks = thresholds_1, expand = c(0, 0), labels = thresholds_1,
sec.axis = dup_axis(labels = parse(text = str_c("theta[", 1:6, "]")))) +
scale_y_continuous(NULL, breaks = NULL, expand = expansion(mult = c(0, 0.41)))
Sadly, this kind of figure doesn’t lend itself well to a faceted plot. The main difficulty is the thresholds–and thus their primary and secondary x-axis breaks–vary across panels, and I’m not aware of a simple way to achieve that with scale_x_continuous()
. So we will instead wrap the code above into a custom function called plot_bar_den()
, which we’ll use to make the four panels one at a time.
<- function(mu, sigma, thresholds, x, digits = 1, n = 1e5, seed = 23) {
plot_bar_den
# Simulate the primary data
set.seed(seed)
<- data.frame(x = rnorm(n = n, mean = mu, sd = sigma)) |>
d mutate(y = cut(x, breaks = c(-Inf, thresholds, Inf), labels = 1:7) |>
as.integer())
# Compute sample statistics for `y`
<- mean(d$y) |> round(digits = digits)
m <- sd(d$y) |> round(digits = digits)
s
# Compute the maximum of the likelihood function
<- dnorm(mu, mean = mu, sd = sigma)
ml
# Set boundaries for `x`
<- -0.5; xmax <- 8.5
xmin
# Make the data frame for the underlying normal
<- data.frame(x = seq(from = xmin, to = xmax, length.out = 101)) |>
d_normal mutate(density = dnorm(x, mean = mu, sd = sigma))
# Adjust the primary data
|>
d count(y) |>
mutate(proportion = n / sum(n)) |>
mutate(percent = (100 * proportion) |> round(digits = 0)) |>
mutate(label_p = str_c(percent, "%"),
label_y = str_c("italic(y)==", y),
density = ml * (proportion / max(proportion)),
x = x) |>
# Plot
ggplot(aes(x = x)) +
geom_area(data = d_normal,
aes(y = density),
fill = sl[2]) +
geom_col(aes(y = density),
alpha = 0.85, fill = sl[5], width = 0.45,
color = alpha(sl[5], 0.85)) +
geom_text(aes(y = density, label = label_p),
size = 3.0, vjust = -2) +
geom_text(aes(y = density, label = label_y),
parse = TRUE, size = 3.0, vjust = -0.25) +
geom_vline(xintercept = thresholds, color = sl[7], linetype = 3) +
annotate(geom = "text",
x = xmax - 0.05, y = ml, label = str_c("mu==", mu, "*','*~sigma==", sigma),
hjust = 1, parse = TRUE, size = 3.2, vjust = -2.75) +
annotate(geom = "text",
x = xmax - 0.05, y = ml, label = str_c("italic(M[y])==", m, "*','*~italic(S[y])==", s),
hjust = 1, parse = TRUE, size = 3.2, vjust = -1) +
scale_x_continuous(NULL, breaks = thresholds, expand = c(0, 0), labels = thresholds,
sec.axis = dup_axis(labels = parse(text = str_c("theta[", 1:6, "]")))) +
scale_y_continuous(NULL, breaks = NULL, expand = expansion(mult = c(0, 0.41)))
}
Here’s how plot_bar_den()
works for the top panel.
plot_bar_den(mu = 4, sigma = 1.5, thresholds = thresholds_1, x = x_1)
The mu
and sigma
arguments take single numeric inputs for the desired thresholds
argument takes a vector of y
(i.e., 7 for all our use cases in this figure). The x
argument takes a vector of digits
, n
, and seed
all have sensible default values for our needs. Here’s how to use plot_bar_den()
and patchwork syntax to make the full Figure 23.1.
# Save the threshold values for each panel
<- 1:6 + 0.5
thresholds_1 <- thresholds_1
thresholds_2 <- c(1.5, 3.1, 3.7, 4.3, 4.9, 6.5)
thresholds_3 <- c(1.5, 2.25, 3, 5, 5.75, 6.5)
thresholds_4
# Save the x-axis values for the bars in each panel
<- 1:7
x_1 <- x_1
x_2 <- c(1, 2.2, 3.4, 4, 4.6, 5.7, 7)
x_3 <- c(1, 1.875, 2.625, 4, 5.375, 6.125, 7)
x_4
# Combine and display
plot_bar_den(mu = 4, sigma = 1.5, thresholds = thresholds_1, x = x_1) /
plot_bar_den(mu = 1, sigma = 2.5, thresholds = thresholds_2, x = x_2) /
plot_bar_den(mu = 4, sigma = 1.0, thresholds = thresholds_3, x = x_3) /
plot_bar_den(mu = 4, sigma = 3.0, thresholds = thresholds_4, x = x_4)
Oh mamma. “The crucial concept in Figure 23.1 is that the probability of a particular ordinal outcome is the area under the normal curve between the thresholds of that outcome” (p. 672, emphasis in the original). In each of the subplots, we used six thresholds to discretize the continuous data into seven categories. More generally, we need
the idea is that we consider the cumulative area under the normal up the high-side threshold, and subtract away the cumulative area under the normal up to the low-side threshold. Recall that the cumulative area under the standardized normal is denoted
, as was illustrated in Figure 15.8 [which we remade at the top of this chapter]. Thus, the area under the normal to the left of is , and the area under the normal to the left of is . Therefore, the area under the normal curve between the two thresholds, which is the probability of outcome , is
[This equation] applies even to the least and greatest ordinal values if we append two “virtual” thresholds at
and … Thus, a normally distributed underlying metric value can yield a clearly non-normal distribution of discrete ordinal values. This result does not imply that the ordinal values can be treated as if they were themselves metric and normally distributed; in fact it implies the opposite: We might be able to model a distribution of ordinal values as consecutive intervals of a normal distribution on an underlying metric scale with appropriately positioned thresholds. (pp. 674–675)
23.2 The case of a single group
Given a model with no predictors, “if there are
For example, all four subplots from Figure 23.1 had
As we’ll see, there are other ways to parameterize these models.
23.2.1 Implementation in JAGS brms.
The syntax to fit a basic ordered probit model with brms::brm()
is pretty simple.
<- brm(
fit data = my_data,
family = cumulative(probit),
~ 1,
y prior(normal(0, 4), class = Intercept))
The family = cumulative(probit)
tells brms you’d like to use the probit link for the ordered-categorical data. It’s important to specify probit
because the brms default is to use the logit
link, instead. We’ll talk more about that approach at the end of this chapter.
Remember how, at the end of the last section, we said there are other ways to parameterize the ordered probit model? As it turns out, brms does not follow Kruschke’s approach for fixing the thresholds on the ends. Rather, brms freely estimates all thresholds, brms::brm()
uses the standard normal cumulative density function
This all probably seems abstract. We’ll get a lot of practice comparing the two approaches as we go along. Each has its strengths and weaknesses. At this point, the thing to get is that when fitting a single-group ordered-probit model with the brm()
function, there will be no priors for prior(normal(0, 4), class = Intercept)
might be a good starting place. Do feel free to experiment with different settings.
23.2.2 Examples: Bayesian estimation recovers true parameter values.
The data for Kruschke’s first example come from his OrdinalProbitData-1grp-1.csv
file. Load the data.
<- read_csv("data.R/OrdinalProbitData-1grp-1.csv")
my_data_1
glimpse(my_data_1)
Rows: 100
Columns: 1
$ Y <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
Plot the distribution for Y
.
|>
my_data_1 mutate(Y = factor(Y)) |>
ggplot(aes(x = Y)) +
geom_bar(fill = sl[5]) +
scale_y_continuous(expand = expansion(mult = c(0, 0.05)))
It looks a lot like the distribution of the data from one of the panels from Figure 23.1.
Now we fit the first cumulative-probit model.
.1 <- brm(
fit23data = my_data_1,
family = cumulative(probit),
~ 1,
Y prior(normal(0, 4), class = Intercept),
iter = 3000, warmup = 1000, chains = 4, cores = 4,
seed = 23,
file = "fits/fit23.01")
Examine the model summary.
print(fit23.1)
Family: cumulative
Links: mu = probit; disc = identity
Formula: Y ~ 1
Data: my_data_1 (Number of observations: 100)
Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
total post-warmup draws = 8000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1] 0.18 0.13 -0.07 0.43 1.00 9765 6602
Intercept[2] 0.60 0.13 0.34 0.87 1.00 10580 6148
Intercept[3] 1.04 0.15 0.75 1.34 1.00 11813 6972
Intercept[4] 1.50 0.19 1.15 1.89 1.00 11818 7489
Intercept[5] 1.97 0.25 1.50 2.50 1.00 11836 6559
Intercept[6] 2.58 0.40 1.91 3.45 1.00 11454 7239
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
disc 1.00 0.00 1.00 1.00 NA NA NA
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
The brms output for these kinds of models names the thresholds Intercept[i]
. Again, whereas Kruschke identified his model by fixing
Let’s extract the posterior draws.
<- as_draws_df(fit23.1)
draws
glimpse(draws)
Rows: 8,000
Columns: 18
$ `b_Intercept[1]` <dbl> 0.22219600, 0.17617457, 0.12889934, 0.31154831, 0.200…
$ `b_Intercept[2]` <dbl> 0.6401313, 0.5768805, 0.4810705, 0.7504918, 0.6286168…
$ `b_Intercept[3]` <dbl> 1.0925275, 1.0247167, 0.8647831, 1.1303165, 1.0431510…
$ `b_Intercept[4]` <dbl> 1.538993, 1.768666, 1.274813, 1.712827, 1.458790, 1.4…
$ `b_Intercept[5]` <dbl> 1.707398, 2.093944, 2.053914, 1.958433, 1.802841, 1.8…
$ `b_Intercept[6]` <dbl> 2.581430, 2.581517, 2.395196, 2.801048, 2.387104, 2.2…
$ disc <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ `Intercept[1]` <dbl> 0.22219600, 0.17617457, 0.12889934, 0.31154831, 0.200…
$ `Intercept[2]` <dbl> 0.6401313, 0.5768805, 0.4810705, 0.7504918, 0.6286168…
$ `Intercept[3]` <dbl> 1.0925275, 1.0247167, 0.8647831, 1.1303165, 1.0431510…
$ `Intercept[4]` <dbl> 1.538993, 1.768666, 1.274813, 1.712827, 1.458790, 1.4…
$ `Intercept[5]` <dbl> 1.707398, 2.093944, 2.053914, 1.958433, 1.802841, 1.8…
$ `Intercept[6]` <dbl> 2.581430, 2.581517, 2.395196, 2.801048, 2.387104, 2.2…
$ lprior <dbl> -14.25641, -14.31861, -14.22442, -14.34868, -14.22515…
$ lp__ <dbl> -151.4283, -150.8879, -151.2538, -151.2717, -149.6003…
$ .chain <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ .iteration <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ .draw <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
Simplify the draws
object.
<- draws |>
draws select(.draw, `b_Intercept[1]`:`b_Intercept[6]`)
Here’s our brms version of the bottom plot of Figure 23.2.
# Compute the posterior means for each threshold
<- draws |>
means summarise_at(vars(`b_Intercept[1]`:`b_Intercept[6]`), mean) |>
pivot_longer(cols = everything(), values_to = "mean")
# Further wrangle
|>
draws pivot_longer(cols = -.draw, values_to = "threshold") |>
group_by(.draw) |>
mutate(theta_bar = mean(threshold)) |>
# Plot
ggplot(aes(x = threshold, y = theta_bar, color = name)) +
geom_vline(data = means,
aes(xintercept = mean, color = name),
linetype = 2) +
geom_point(alpha = 1/10) +
scale_color_scico_d(palette = "lajolla", end = 0.85, direction = -1) +
ylab("mean threshold") +
theme(legend.position = "none")
The initial means
data at the top contains the geom_vline()
. Did you see what we did there with those group_by()
and mutate()
lines? That’s how we computed the mean threshold within each step of the HMC chain, what Kruschke (p. 680) denoted as
Perhaps of greater interest, you might have noticed how different our plot is from the one in the text. We might should compare the results of our brms parameterization of scales::rescale()
function, about which you might learn more here.
# Primary data wrangling
<- bind_rows(
p # brms parameterization
|>
draws pivot_longer(cols = -.draw, values_to = "threshold") |>
group_by(.draw) |>
mutate(theta_bar = mean(threshold)),
# Kruschke's parameterization
|>
draws pivot_longer(cols = -.draw, values_to = "threshold") |>
group_by(.draw) |>
mutate(threshold = scales::rescale(threshold, to = c(1.5, 6.5))) |>
mutate(theta_bar = mean(threshold))) |>
# Add a parameterization index
mutate(model = rep(c("brms parameterization", "Kruschke's parameterization"), each = n() / 2))
# Compute the means by model and threshold for the vertical lines
<- p |>
means ungroup() |>
group_by(model, name) |>
summarise(mean = mean(threshold))
# Plot!
|>
p ggplot(aes(x = threshold, y = theta_bar)) +
geom_vline(data = means,
aes(xintercept = mean, color = name),
linetype = 2) +
geom_point(aes(color = name),
alpha = 1/10, size = 1/2) +
scale_color_scico_d(palette = "lajolla", end = 0.85, direction = -1) +
ylab("mean threshold") +
theme(legend.position = "none") +
facet_wrap(~ model, ncol = 1, scales = "free")
We can take our rescaling approach further to convert the posterior distributions for
If you’d like to compute
Big shout out to my math-stats savvy friends academic twitter for the formulas, especially Ph.Demetri, Lukas Neugebauer, and Brenton Wiernik for walking out the formulas (see this twitter thread). For our application, Intercept[1]
and Intercept[6]
will be our two
|>
draws select(.draw, `b_Intercept[1]`, `b_Intercept[6]`) |>
mutate(`y[1]` = 1.5,
`y[6]` = 6.5) |>
mutate(mu = `y[1]` - `b_Intercept[1]` * 1,
sigma = (`y[1]` - `y[6]`) / (`b_Intercept[1]` - `b_Intercept[6]`)) |>
mutate(`(mu-2)/sigma` = (mu - 2) / sigma) |>
pivot_longer(cols = mu:`(mu-2)/sigma`) |>
mutate(name = factor(name, levels = c("mu", "sigma", "(mu-2)/sigma"))) |>
ggplot(aes(x = value)) +
stat_halfeye(point_interval = mode_hdi, .width = 0.95,
color = sl[8], fill = sl[4],
normalize = "panels") +
scale_y_continuous(NULL, breaks = NULL) +
facet_wrap(~ name, labeller = label_parsed, scales = "free")
Our results are similar to Kruschke’s. Given we used a different algorithm, a different parameterization, and different priors, I’m not terribly surprised they’re a little different. If you have more insight on the matter or have spotted a flaw in this method, please share with the rest of us.
It’s unclear, to me, how we’d interpret the effect size. The difficulty isn’t that Kruschke’s comparison of
In the large paragraph on the lower part of page 679, Kruschke discussed why the thresholds tend to have nontrivial covariances. This is what he was trying to convey with the bottom subplot in Figure 23.2. Just for practice, we might also explore the correlation matrix among the thresholds with a customized GGally::ggpairs()
plot.
# Customize
<- function(data, mapping, ...) {
my_upper ggplot(data = data, mapping = mapping) +
geom_point(alpha = 1/5, color = sl[7], size = 1/5)
}
<- function(data, mapping, ...) {
my_diag ggplot(data = data, mapping = mapping) +
geom_density(fill = sl[3], linewidth = 0) +
scale_x_continuous(NULL, breaks = NULL) +
scale_y_continuous(NULL, breaks = NULL)
}
<- function(data, mapping, ...) {
my_lower
# Get the x and y data to use the other code
<- eval_data_col(data, mapping$x)
x <- eval_data_col(data, mapping$y)
y
# Compute the correlations
<- cor(x, y, method = "p", use = "pairwise")
corr
# Plot the cor value
ggally_text(
label = formatC(corr, digits = 2, format = "f") |> str_replace("0\\.", "."),
mapping = aes(),
color = "black", size = 4) +
scale_x_continuous(NULL, breaks = NULL) +
scale_y_continuous(NULL, breaks = NULL)
}
# Wrangle
as_draws_df(fit23.1) |>
select(contains("b_Intercept")) |>
set_names(str_c("theta[", 1:6, "]")) |>
# Plot
ggpairs(upper = list(continuous = my_upper),
diag = list(continuous = my_diag),
lower = list(continuous = my_lower),
labeller = label_parsed) +
ggtitle("The thresholds are highly correlated")
Kruschke didn’t do this in the text, but it might be informative to plot the probability distributions for the seven categories from Y
,
|>
draws select(-.draw) |>
mutate_all(.funs = pnorm) |>
transmute(`p[Y==1]` = `b_Intercept[1]`,
`p[Y==2]` = `b_Intercept[2]` - `b_Intercept[1]`,
`p[Y==3]` = `b_Intercept[3]` - `b_Intercept[2]`,
`p[Y==4]` = `b_Intercept[4]` - `b_Intercept[3]`,
`p[Y==5]` = `b_Intercept[5]` - `b_Intercept[4]`,
`p[Y==6]` = `b_Intercept[6]` - `b_Intercept[5]`,
`p[Y==7]` = 1 - `b_Intercept[6]`) |>
set_names(1:7) |>
pivot_longer(cols = everything(), names_to = "Y") |>
ggplot(aes(x = value, y = Y)) +
stat_halfeye(point_interval = mode_hdi, .width = 0.95,
color = sl[8], fill = sl[4], height = 2.5, size = 1/2) +
scale_x_continuous(expression(italic(p)*"["*Y==italic(i)*"]"),
breaks = 0:5 / 5, expand = c(0, 0), limits = c(0, 1))
Happily, the model produces data that look a lot like those from which it was generated.
set.seed(23)
|>
draws mutate(z = rnorm(n(), mean = 0, sd = 1)) |>
mutate(Y = case_when(
< `b_Intercept[1]` ~ 1,
z < `b_Intercept[2]` ~ 2,
z < `b_Intercept[3]` ~ 3,
z < `b_Intercept[4]` ~ 4,
z < `b_Intercept[5]` ~ 5,
z < `b_Intercept[6]` ~ 6,
z >= `b_Intercept[6]` ~ 7
z |> as.factor()) |>
)
ggplot(aes(x = Y)) +
geom_bar(fill = sl[5]) +
scale_y_continuous(expand = expansion(mult = c(0, 0.05)))
Along similar lines, we can use the pp_check()
function to make a version of the upper right panel of Figure 23.2. The type = "bars"
argument will allow us to summarize the posterior predictions as a dot (mean) and standard error bars superimposed on a bar plot of the original data. Note how this differs a little from Kruschke’s use of the posterior median and 95% HDIs. The ndraws = 1000
argument controls how many posterior predictions we wanted to summarize over. The rest is just formatting.
::color_scheme_set(sl[2:7])
bayesplot
set.seed(23)
pp_check(fit23.1, type = "bars", ndraws = 1000, fatten = 2) +
scale_x_continuous("y", breaks = 1:7) +
scale_y_continuous(NULL, breaks = NULL, expand = expansion(mult = c(0, 0.05))) +
ggtitle("Data with posterior predictions",
subtitle = "N = 100") +
theme(legend.background = element_blank(),
legend.position = c(0.9, 0.8))
Load the data for the next model.
<- read_csv("data.R/OrdinalProbitData-1grp-2.csv") my_data_2
Since we’re reusing all the specifications from the last model for this one, we can just use update()
.
.2 <- update(
fit23.1,
fit23newdata = my_data_2,
iter = 3000, warmup = 1000, chains = 4, cores = 4,
seed = 23,
file = "fits/fit23.02")
print(fit23.2)
Family: cumulative
Links: mu = probit; disc = identity
Formula: Y ~ 1
Data: my_data_2 (Number of observations: 70)
Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
total post-warmup draws = 8000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1] -1.41 0.21 -1.84 -1.00 1.00 5545 5461
Intercept[2] -0.17 0.15 -0.47 0.12 1.00 9587 6805
Intercept[3] 0.17 0.15 -0.12 0.47 1.00 9258 7298
Intercept[4] 0.46 0.16 0.16 0.76 1.00 9223 6339
Intercept[5] 0.83 0.17 0.50 1.16 1.00 9120 7027
Intercept[6] 1.99 0.31 1.42 2.66 1.00 8518 6280
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
disc 1.00 0.00 1.00 1.00 NA NA NA
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Extract and wrangle the posterior draws.
<- as_draws_df(fit23.2) |>
draws select(.draw, `b_Intercept[1]`:`b_Intercept[6]`)
Now we might compare the brms parameterization of compare_thresholds()
.
<- function(data, lb = 1.5, ub = 6.5) {
compare_thresholds
# `lb` = lower bound
# `ub` = upper bound
# Primary data wrangling
<- bind_rows(
p |>
data pivot_longer(cols = -.draw, values_to = "threshold") |>
group_by(.draw) |>
mutate(theta_bar = mean(threshold)),
|>
data pivot_longer(cols = -.draw, values_to = "threshold") |>
group_by(.draw) |>
mutate(threshold = scales::rescale(threshold, to = c(lb, ub))) |>
mutate(theta_bar = mean(threshold))) |>
mutate(model = rep(c("brms parameterization", "Kruschke's parameterization"), each = n() / 2))
# Compute the means by model and threshold for the vertical lines
<- p |>
means ungroup() |>
group_by(model, name) |>
summarise(mean = mean(threshold))
# Plot
|>
p ggplot(aes(x = threshold, y = theta_bar)) +
geom_vline(data = means,
aes(xintercept = mean, color = name),
linetype = 2) +
geom_point(aes(color = name),
alpha = 1/10, size = 1/2) +
scale_color_scico_d(palette = "lajolla", end = 0.85, direction = -1) +
ylab("mean threshold") +
theme(legend.position = "none") +
facet_wrap(~ model, ncol = 1, scales = "free")
}
Take that puppy for a spin.
|>
draws compare_thresholds(lb = 1.5, ub = 6.5)
Oh man, that works sweet. Now let’s use the same parameter-transformation approach from before to get our un-standardized posteriors for
|>
draws select(.draw, `b_Intercept[1]`, `b_Intercept[6]`) |>
mutate(`y[1]` = 1.5,
`y[6]` = 6.5) |>
mutate(mu = `y[1]` - `b_Intercept[1]` * 1,
sigma = (`y[1]` - `y[6]`) / (`b_Intercept[1]` - `b_Intercept[6]`)) |>
mutate(`(mu-4)/sigma` = (mu - 4) / sigma) |>
pivot_longer(cols = mu:`(mu-4)/sigma`) |>
mutate(name = factor(name, levels = c("mu", "sigma", "(mu-4)/sigma"))) |>
ggplot(aes(x = value)) +
stat_halfeye(point_interval = mode_hdi, .width = 0.95,
color = sl[8], fill = sl[4],
normalize = "panels") +
scale_y_continuous(NULL, breaks = NULL) +
facet_wrap(~ name, labeller = label_parsed, scales = "free")
Use pp_check()
to make our version of the upper-right panel of Figure 23.3.
set.seed(23)
pp_check(fit23.2, type = "bars", fatten = 2, ndraws = 1000) +
scale_x_continuous("y", breaks = 1:7) +
scale_y_continuous(NULL, breaks = NULL, expand = expansion(mult = c(0, 0.05))) +
ggtitle("Data with posterior predictions",
subtitle = "N = 70") +
theme(legend.background = element_blank(),
legend.position = c(0.9, 0.8))
Just as in the text, “the posterior predictive distribution in the top-right subpanel accurately describes the bimodal distribution of the outcomes” (p. 680).
Here are the probability distributions for each of the 7 categories of Y
.
|>
draws select(-.draw) |>
mutate_all(.funs = pnorm) |>
transmute(`p[Y==1]` = `b_Intercept[1]`,
`p[Y==2]` = `b_Intercept[2]` - `b_Intercept[1]`,
`p[Y==3]` = `b_Intercept[3]` - `b_Intercept[2]`,
`p[Y==4]` = `b_Intercept[4]` - `b_Intercept[3]`,
`p[Y==5]` = `b_Intercept[5]` - `b_Intercept[4]`,
`p[Y==6]` = `b_Intercept[6]` - `b_Intercept[5]`,
`p[Y==7]` = 1 - `b_Intercept[6]`) |>
set_names(1:7) |>
pivot_longer(cols = everything(), names_to = "Y") |>
ggplot(aes(x = value, y = Y)) +
stat_halfeye(point_interval = mode_hdi, .width = 0.95,
color = sl[8], fill = sl[4], size = 1/2) +
scale_x_continuous(expression(italic(p)*"["*Y==italic(i)*"]"),
breaks = 0:5 / 5, expand = c(0, 0), limits = c(0, 1))
Before we move on, it might be helpful to nail down what the thresholds mean within the context of our brms parameterization. To keep things simple, we’ll focus on their posterior means.
tibble(x = seq(from = -3.5, to = 3.5, by = 0.01)) |>
mutate(density = dnorm(x)) |>
ggplot(aes(x = x, ymin = 0, ymax = density)) +
geom_ribbon(fill = sl[3]) +
geom_vline(xintercept = fixef(fit23.2)[, 1], color = "white", linetype = 3) +
scale_x_continuous(NULL, breaks = fixef(fit23.2)[, 1],
labels = parse(text = str_c("theta[", 1:6, "]"))) +
scale_y_continuous(NULL, breaks = NULL, expand = expansion(mult = c(0, 0.05))) +
ggtitle("Standard normal distribution underlying the ordinal Y data:",
subtitle = "The dashed vertical lines mark the posterior means for the thresholds.") +
coord_cartesian(xlim = c(-3, 3))
Compare that to Figure 23.1.
23.2.2.1 Not the same results as pretending the data are metric.
“In some conventional approaches to ordinal data, the data are treated as if they were metric and normally distributed” (p. 681). Here’s what that brms::brm()
model might look like using methods from back in Chapter 16. First, we’ll define our stanvars
.
<- mean(my_data_1$Y)
mean_y <- sd(my_data_1$Y)
sd_y
<- stanvar(mean_y, name = "mean_y") +
stanvars stanvar(sd_y, name = "sd_y")
Fit the model.
.3 <- brm(
fit23data = my_data_1,
family = gaussian,
~ 1,
Y prior = c(prior(normal(mean_y, sd_y * 100), class = Intercept),
prior(normal(0, sd_y), class = sigma)),
chains = 4, cores = 4,
stanvars = stanvars,
seed = 23,
file = "fits/fit23.03")
Check the results.
print(fit23.3)
Family: gaussian
Links: mu = identity; sigma = identity
Formula: Y ~ 1
Data: my_data_1 (Number of observations: 100)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 1.95 0.14 1.67 2.22 1.00 2995 2286
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 1.41 0.10 1.23 1.63 1.00 3529 2816
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
As Kruschke indicated in the text, it yielded a distributional mean of about 1.95 and a standard deviation of about 1.41. Here we’ll use a posterior predictive check to compare histograms of data generated from this model to that of the original data.
pp_check(fit23.3, type = "hist", binwidth = 1, ndraws = 10) +
scale_x_continuous(breaks = seq(from = -3, to = 7, by = 2)) +
theme(legend.position = c(0.9, 0.15))
Yeah, that’s not a good fit. We won’t be conducting a
as_draws_df(fit23.3) |>
mutate(`2 - b_Intercept` = 2 - b_Intercept,
`effect size` = (2 - b_Intercept) / sigma) |>
pivot_longer(cols = `2 - b_Intercept`:`effect size`) |>
ggplot(aes(x = value)) +
stat_halfeye(point_interval = mode_hdi, .width = 0.95,
color = sl[8], fill = sl[4], normalize = "panels") +
scale_y_continuous(NULL, breaks = NULL) +
facet_wrap(~ name, scales = "free")
Yes indeed, 2 is a credible value for the intercept. And as reported in the text, we got a very small
<- mean(my_data_2$Y)
mean_y <- sd(my_data_2$Y)
sd_y
<- stanvar(mean_y, name = "mean_y") +
stanvars stanvar(sd_y, name = "sd_y")
.4 <- update(
fit23.3,
fit23newdata = my_data_2,
chains = 4, cores = 4,
stanvars = stanvars,
seed = 23,
file = "fits/fit23.04")
Let’s just jump to the plot. This time we’re comparing the b_Intercept
to the value of 4.0.
as_draws_df(fit23.4) |>
mutate(`2 - b_Intercept` = 4 - b_Intercept,
`effect size` = (4 - b_Intercept) / sigma) |>
pivot_longer(cols = `2 - b_Intercept`:`effect size`) |>
ggplot(aes(x = value)) +
stat_halfeye(point_interval = mode_hdi, .width = 0.95,
color = sl[8], fill = sl[4], normalize = "panels") +
scale_y_continuous(NULL, breaks = NULL) +
facet_wrap(~ name, scales = "free")
As in the text, our fit23.4
summarized these data.
pp_check(fit23.4, type = "hist", binwidth = 1, ndraws = 10) +
scale_x_continuous(breaks = seq(from = -3, to = 7, by = 2)) +
theme(legend.position = c(0.9, 0.15))
The histograms aren’t as awful as the ones from the previous model. But they’re still not great. We might further inspect the model misspecification with a cumulative distribution function overlay, this time comparing fit23.2
directly to fit23.4
.
<- pp_check(fit23.2, type = "ecdf_overlay", ndraws = 50) +
p1 ggtitle("Cumulative-normal (fit23.2)")
<- pp_check(fit23.4, type = "ecdf_overlay", ndraws = 50) +
p2 ggtitle("Conventional-normal (fit23.4)")
+ p2 &
(p1 scale_x_continuous(breaks = 0:7, limits = c(0, 7)) &
scale_y_continuous(expand = c(0, 0)) &
theme(title = element_text(size = 10.5))) +
plot_layout(guides = 'collect')
“Which of the analyses yields the more trustworthy conclusion? The one that describes the data better. In these cases, there is no doubt that the cumulative-normal model is the better description of the data” than the conventional Gaussian model (p. 682).
23.2.2.2 Ordinal outcomes versus Likert scales.
Just for fun,
rate how much you agree with the statement, “Bayesian estimation is more informative than null-hypothesis significance testing,” by selecting one option from the following:
= strongly disagree; = disagree; = undecided; = agree; = strongly agree. This sort of ordinal response interface is often called a Likert-type response (Likert, 1932), pronounced LICK-ert not LIKE-ert). Sometimes, it is called a Likert “scale” but the term “scale” in this context is more properly reserved for referring to an underlying metric variable that is indicated by the arithmetic mean of several meaningfully related Likert-type responses (Carifio & Perla, 2008; e.g., Carifio & Perla, 2007; Norman, 2010). (p. 681)
Kruschke then briefly introduced how one might combine several such meaningfully-related Likert-type responses with latent variable methods. He then clarified this text will not explore that approach, further. The current version of brms (i.e., 2.12.0) has very limited latent variable capacities. However, they are in the works. Interested modelers can follow Bürkner’s progress in GitHub issue #304. He also has a (2020) paper on how one might use brms to fit item response theory models, which can be viewed as a special family of latent variable models. One can also fit Bayesian latent variable models with the blavaan package.
23.3 The case of two groups
In both examples in the preceding text, the two groups of outcomes were on the same ordinal scale. In the first example, both questionnaire statements were answered on the same disagree–agree scale. In the second example, both groups responded on the same very unhappy–very happy scale. Therefore, we assume that both groups have the same underlying metric variable with the same thresholds. (p. 682)
23.3.1 Implementation in JAGS brms.
The brm()
syntax for adding a single categorical predictor to an ordered-probit model is much like that for any other likelihood. We just add the variable name to the right side of the ~
in the formula
argument. If you’re like me and like to use the verbose 1
syntax for your model intercepts–thresholds in these models–just use the +
operator between them. For example, this is what it’d look like for an ordered-categorical criterion y
and a single categorical predictor x
.
<- brm(
fit data = my_data,
family = cumulative(probit),
~ 1 + x,
y prior = c(prior(normal(0, 4), class = Intercept),
prior(normal(0, 4), class = b)))
Also of note, we’ve expanded the prior
section to include a line for class = b
. As with the thresholds, interpret this prior through the context of the underlying standard normal cumulative distribution, x
will be x
, those will be standardized mean differences, making them a kind of effect size.
Note, the above all presumes you’re only interested in comparing means between groups. Things get more complicated if you want groups to vary by
First, look back at the output from print(fit1)
or print(fit2)
. The second line for both reads: Links: mu = probit; disc = identity
. Hopefully the mu = probit
part is no surprise. Probit regression is the primary focus of this chapter. But check out the disc = identity
part and notice that nowhere in there is there any mention of sigma = identity
like we get when treating the criterion as metric as in conventional Gaussian models (i.e., execute print(fit3)
or print(fit4)
).
Yes, there is a relationship between disc
and sigma
. disc
is shorthand for discrimination. The term comes from the item response theory (IRT) literature and discrimination is the inverse of disc
and sigma
are modeled using the identity link. If you recall from earlier chapters, we switched to the log link to constrain the values to zero and above when we allowed
As with bf()
syntax when modeling the discrimination parameter in brms. For a general introduction to what Bürkner calls distributional modeling, see his (2022a) vignette, Estimating distributional models with brms. In the case of the discrimination parameter for the cumulative model, we’ll want more focused instructions. Happily, Bürkner & Vuorre (2019) have our backs. We read:
Conceptually, unequal variances are incorporated in the model by specifying an additional regression formula for the variance component of the latent variable
. In brms, the parameter related to latent variances is called disc (short for “discrimination”), following conventions in item response theory. Note that disc is not the variance itself, but the inverse of the standard deviation, That is, . Further, because disc must be strictly positive, it is by default modeled on the log scale. Predicting auxiliary parameters (parameters of the distribution other than the mean/location) in brms is accomplished by passing multiple regression formulas to the
brm()
function. Each formula must first be wrapped in another function,bf()
orlf()
(for “linear formula”)–depending on whether it is a main or an auxiliary formula, respectively. The formulas are then combined and passed to theformula
argument ofbrm()
. Because the standard deviation of the latent variable is fixed to 1 for the baseline [group, disc cannot be estimated for the baseline group]. We must therefore ensure that disc is estimated only for [non-baseline groups]. To do so, we omit the intercept from the model of disc by writing0 + ...
on the right-hand side of the regression formula. By default, R applies cell-mean coding to factors in formulas without an intercept. That would lead to disc being estimated for [all groups], so we must deactivate it via thecmc
argument oflf()
. (pp. 11–12)
Here’s what that might look like.
<- brm(
fit data = my_data,
family = cumulative(probit),
bf(y ~ 1 + x) +
lf(disc ~ 0 + x, cmc = F),
prior = c(prior(normal(0, 4), class = Intercept),
prior(normal(0, 4), class = b),
prior(normal(0, 4), class = b, dpar = disc)))
Note how when using the disc ~ 0 + ...
syntax, the disc
parameters are of class = b
within the prior()
function. If you’d like to assign them priors differing from the other b
parameters, you’ll need to specify dpar = disc
. Again, though the mean structure for this model is on the probit scale, the discrimination structure is on the log scale. Recalling that
To get a better sense of how one might set a prior on such a scale, we might compare
tibble(mu = 0,
sigma = c(0.5, 1, 2)) |>
expand_grid(y = seq(from = -5, to = 5, by = 0.1)) |>
mutate(`p(y)` = dnorm(y, mu, sigma),
`Phi(y)` = pnorm(y, mu, sigma)) |>
mutate(alpha = 1 / sigma,
loga = log(1 / sigma)) |>
mutate(label = str_c("list(sigma==", sigma, ",alpha==", alpha, ",log(alpha)==", round(loga, 2), ")")) |>
pivot_longer(cols = `p(y)`:`Phi(y)`) |>
ggplot(aes(x = y, y = value)) +
geom_line(linewidth = 1.5, color = sl[7]) +
scale_y_continuous(NULL, breaks = NULL) +
xlab(NULL) +
coord_cartesian(xlim = c(-4, 4)) +
facet_grid(name ~ label, labeller = label_parsed, switch = "y")
Put another way, here’s how
tibble(sigma = seq(from = 0.0001, to = 10, by = 0.01)) |>
mutate(alpha = 1 / sigma,
`log(alpha)` = log(1 / sigma)) |>
pivot_longer(cols = -sigma, names_to = "labels") |>
ggplot(aes(x = sigma, y = value)) +
geom_hline(yintercept = 0, color = sl[3], linetype = 2) +
geom_vline(xintercept = 0, color = sl[3], linetype = 2) +
geom_line(linewidth = 1.5, color = sl[7]) +
coord_cartesian(ylim = c(-2, 10)) +
facet_grid(~ labels, labeller = label_parsed)
When
tibble(`log(alpha)` = seq(from = -3, to = 3, by = 0.01)) |>
mutate(sigma = 1 / exp(`log(alpha)`)) |>
ggplot(aes(x = `log(alpha)`, y = sigma)) +
geom_hline(yintercept = 0, color = sl[3], linetype = 2) +
geom_line(color = sl[7], linewidth = 1.5) +
labs(x = expression(log(alpha)),
y = expression(sigma)) +
coord_cartesian(xlim = c(-2.5, 2.5),
ylim = c(0, 10))
In the context where the underlying distribution for the reference category will be the standard normal, it seems like a
23.3.2 Examples: Not funny.
Load the data for the next model.
<- read_csv("data.R/OrdinalProbitData1.csv")
my_data
glimpse(my_data)
Rows: 88
Columns: 2
$ X <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A"…
$ Y <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
Fit the first ordinal probit model with group-specific Y
.
.5 <- brm(
fit23data = my_data,
family = cumulative(probit),
bf(Y ~ 1 + X) +
lf(disc ~ 0 + X, cmc = FALSE),
prior = c(prior(normal(0, 4), class = Intercept),
prior(normal(0, 4), class = b),
prior(normal(0, 1), class = b, dpar = disc)),
iter = 3000, warmup = 1000, chains = 4, cores = 4,
seed = 23,
file = "fits/fit23.05")
Look over the summary.
print(fit23.5)
Family: cumulative
Links: mu = probit; disc = log
Formula: Y ~ 1 + X
disc ~ 0 + X
Data: my_data (Number of observations: 88)
Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
total post-warmup draws = 8000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1] 0.50 0.20 0.12 0.89 1.00 8682 5711
Intercept[2] 1.30 0.24 0.87 1.78 1.00 9614 6223
Intercept[3] 2.20 0.38 1.51 3.00 1.00 5284 5223
Intercept[4] 3.42 0.82 2.14 5.31 1.00 3958 4971
XB 0.43 0.34 -0.31 1.05 1.00 4816 3770
disc_XB -0.32 0.28 -0.86 0.25 1.00 3105 3916
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Because our fancy new parameter disc_XB
is on the
1 / (exp(fixef(fit23.5)["disc_XB", 1]))
[1] 1.371802
Before we follow along with Kruschke, let’s hammer the meaning of these model parameters home. Here is a density plot of the two underlying latent distributions for Y
, given X
. We’ll throw in the thresholds for good measure. To keep things simple, we’ll just express the distributions in terms of the posterior means of each parameter.
tibble(X = LETTERS[1:2],
mu = c(0, fixef(fit23.5)["XB", 1]),
sigma = c(1, 1 / (exp(fixef(fit23.5)["disc_XB", 1])))) |>
expand_grid(y = seq(from = -5, to = 5, by = 0.1)) |>
mutate(density = dnorm(y, mean = mu, sd = sigma)) |>
ggplot(aes(x = y, y = density, fill = X)) +
geom_area(alpha = 2/3, position = "identity") +
geom_vline(xintercept = fixef(fit23.5)[1:4, 1], color = sl[9], linetype = 3) +
scale_fill_scico_d(palette = "lajolla", begin = 0.75, end = 0.4) +
scale_x_continuous(sec.axis = dup_axis(breaks = fixef(fit23.5)[1:4, 1] |> as.double(),
labels = parse(text = str_c("theta[", 1:4, "]")))) +
scale_y_continuous(NULL, breaks = NULL, expand = expansion(mult = c(0, 0.05))) +
labs(x = NULL,
title = "Underlying latent scale for Y, given X") +
theme(axis.ticks.x.top = element_blank())
Returning to our previous workflow, extract the posterior draws and wrangle.
<- as_draws_df(fit23.5)
draws
glimpse(draws)
Rows: 8,000
Columns: 15
$ `b_Intercept[1]` <dbl> 0.6225250, 0.3970634, 0.4201213, 0.5717441, 0.6611438…
$ `b_Intercept[2]` <dbl> 1.2447460, 0.9979544, 0.9056600, 1.5148320, 1.6155517…
$ `b_Intercept[3]` <dbl> 2.015155, 1.432478, 1.869619, 2.601692, 2.592201, 1.9…
$ `b_Intercept[4]` <dbl> 3.182607, 1.760798, 2.241776, 3.869627, 3.710775, 3.1…
$ b_XB <dbl> 0.67725443, 0.23475709, 0.35607950, 0.47933554, 0.638…
$ b_disc_XB <dbl> -0.27934313, 0.31402332, 0.02621193, -0.14254980, -0.…
$ `Intercept[1]` <dbl> 0.28389776, 0.27968489, 0.24208154, 0.33207630, 0.342…
$ `Intercept[2]` <dbl> 0.9061188, 0.8805758, 0.7276202, 1.2751642, 1.2964657…
$ `Intercept[3]` <dbl> 1.676528, 1.315099, 1.691579, 2.362024, 2.273115, 1.6…
$ `Intercept[4]` <dbl> 2.843980, 1.643419, 2.063736, 3.629960, 3.391689, 2.8…
$ lprior <dbl> -12.86722, -12.66125, -12.69030, -13.10282, -13.06029…
$ lp__ <dbl> -107.9541, -111.8837, -110.4143, -109.7284, -108.2663…
$ .chain <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ .iteration <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ .draw <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
Now, let’s use our handy compare_thresholds()
function to make an expanded version of the lower-left plot of Figure 23.4.
|>
draws select(`b_Intercept[1]`:`b_Intercept[4]`, .draw) |>
compare_thresholds(lb = 1.5, ub = 4.5)
It will no longer be straightforward to use the formulas from 23.2.2 to convert the output from our brms parameterization to match the way Kruschke parameterized his conditional means and standard deviations. I will leave the conversion up to the interested reader. Going forward, we will focus on the output from our brms parameterization.
|>
draws # Simple parameters
mutate(`mu[A]` = 0,
`mu[B]` = b_XB,
`sigma[A]` = 1,
`sigma[B]` = 1 / exp(b_disc_XB)) |>
# Simple differences
mutate(`mu[B]-mu[A]` = `mu[B]` - `mu[A]`,
`sigma[B]-sigma[A]` = `sigma[B]` - `sigma[A]`) |>
# Effect size
mutate(`(mu[B]-mu[A])/sqrt((sigma[A]^2+sigma[B]^2)/2)` = (`mu[B]-mu[A]`) / sqrt((`sigma[A]`^2 + `sigma[B]`^2) / 2)) |>
# Wrangle
pivot_longer(cols = `mu[A]`:`(mu[B]-mu[A])/sqrt((sigma[A]^2+sigma[B]^2)/2)`) |>
mutate(name = factor(name,
levels = c("mu[A]", "mu[B]", "mu[B]-mu[A]",
"sigma[A]", "sigma[B]", "sigma[B]-sigma[A]",
"(mu[B]-mu[A])/sqrt((sigma[A]^2+sigma[B]^2)/2)"))) |>
# Plot
ggplot(aes(x = value)) +
stat_halfeye(point_interval = mode_hdi, .width = 0.95,
color = sl[8], fill = sl[4], normalize = "panels") +
scale_y_continuous(NULL, breaks = NULL) +
xlab(NULL) +
facet_wrap(~ name, labeller = label_parsed, scales = "free")
stat_halfeye()
approach with the scales freed across facets. If these plots really mattered for a scientific presentation or something for industry, you could experiment using either a common scale across all facets, or making the plots individually and then combining them with patchwork syntax. Returning to interpretation, because
Here we make good use of the type = "bars_grouped"
and group = "X"
arguments to make the posterior predictive plots at the top right of Figure 23.4 with the brms::pp_check()
function.
set.seed(23)
pp_check(fit23.5, type = "bars_grouped", fatten = 2, group = "X", ndraws = 100) +
scale_x_continuous("y", breaks = 1:7) +
scale_y_continuous(NULL, breaks = NULL, expand = expansion(mult = c(0, 0.05))) +
ggtitle("Data with posterior predictions",
subtitle = expression(list(italic(N[A])==44, italic(N[B])==44))) +
theme(legend.background = element_blank(),
legend.position = c(0.9, 0.75))
Using more tricks from back in Chapter 16, here’s the corresponding conventional Gaussian model for metric data.
<- mean(my_data$Y)
mean_y <- sd(my_data$Y)
sd_y
<- stanvar(mean_y, name = "mean_y") +
stanvars stanvar(sd_y, name = "sd_y")
.6 <- brm(
fit23data = my_data,
family = gaussian,
bf(Y ~ 0 + X, sigma ~ 0 + X),
prior = c(prior(normal(mean_y, sd_y * 100), class = b),
prior(normal(0, 1), class = b, dpar = sigma)),
chains = 4, cores = 4,
stanvars = stanvars,
seed = 23,
file = "fits/fit23.06")
Check the summary.
print(fit23.6)
Family: gaussian
Links: mu = identity; sigma = log
Formula: Y ~ 0 + X
sigma ~ 0 + X
Data: my_data (Number of observations: 88)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
XA 1.43 0.12 1.20 1.66 1.00 4042 2503
XB 1.86 0.17 1.53 2.18 1.00 3749 2648
sigma_XA -0.26 0.11 -0.47 -0.03 1.00 4222 3112
sigma_XB 0.08 0.11 -0.12 0.30 1.00 3623 2465
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Here are the marginal posteriors, including the effect size (i.e., a standardized mean difference using the pooled standard deviation formula presuming equal sample sizes,
as_draws_df(fit23.6) |>
mutate(`A mean` = b_XA,
`B mean` = b_XB,
`A Std. Dev.` = exp(b_sigma_XA),
`B Std. Dev.` = exp(b_sigma_XB)) |>
mutate(`Difference of Means` = `B mean` - `A mean`,
`Difference of Std. Devs` = `B Std. Dev.` - `A Std. Dev.`) |>
mutate(`Effect Size` = `Difference of Means` / sqrt((`A Std. Dev.`^2 + `B Std. Dev.`^2) / 2)) |>
pivot_longer(cols = `A mean`:`Effect Size`) |>
ggplot(aes(x = value)) +
stat_halfeye(point_interval = mode_hdi, .width = 0.95,
color = sl[8], fill = sl[4], normalize = "panels") +
scale_y_continuous(NULL, breaks = NULL) +
labs(x = "Marginal posterior",
subtitle = "These are based on the conventional Gaussian model, NOT the cumulative-normal\nmodel Kruschke displayed in Figure 23.4") +
facet_wrap(~ name, ncol = 2, scales = "free")
Compare those results to those Kruschke reported from an NHST analysis in the note below Figure 23.4:
, with effect size with CI of . An test of the variances concludes that the standard deviations are significantly different: . Notice in this case that treating the values as metric greatly underestimates their variances, as well as erroneously concludes the variances are different. (p. 684)
As to the data in the analyses Kruschke reported in Figure 23.5 and the prose in the second paragraph on page 685, I’m not aware that Kruschke provided them. From his footnote #2, we read: “Data in Figure 23.5 are from an as-yet unpublished study I conducted with the collaboration of Allison Vollmer as part of her undergraduate honors project.” In place of the real data, I eyeballed the values based on the upper-right panels in Figure 23.5. Here they are.
<- tibble(x = rep(str_c("joke ", c(1, 6)), each = 177),
d y = c(rep(1:7, times = c(95, 19, 18, 10, 17, 10, 8)),
rep(1:7, times = c(53, 33, 31, 22, 23, 14, 1))))
glimpse(d)
Rows: 354
Columns: 2
$ x <chr> "joke 1", "joke 1", "joke 1", "joke 1", "joke 1", "joke 1", "joke 1"…
$ y <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
My approximation to Kruschke’s data looks like this.
|>
d ggplot(aes(x = y)) +
geom_bar(fill = sl[5]) +
scale_x_continuous(breaks = 1:7) +
scale_y_continuous(NULL, breaks = NULL, expand = expansion(mult = c(0, 0.05))) +
facet_wrap(~ x, ncol = 1)
Here we fit the cumulative-normal model based on our version of the data, continuing to allow both
.7 <- brm(
fit23data = d,
family = cumulative(probit),
bf(y ~ 1 + x) +
lf(disc ~ 0 + x, cmc = FALSE),
prior = c(prior(normal(0, 4), class = Intercept),
prior(normal(0, 4), class = b),
prior(normal(0, 1), class = b, dpar = disc)),
iter = 3000, warmup = 1000, chains = 4, cores = 4,
seed = 23,
file = "fits/fit23.07")
Check the model summary.
print(fit23.7)
Family: cumulative
Links: mu = probit; disc = log
Formula: y ~ 1 + x
disc ~ 0 + x
Data: d (Number of observations: 354)
Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
total post-warmup draws = 8000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1] 0.08 0.09 -0.11 0.26 1.00 4586 5209
Intercept[2] 0.37 0.09 0.19 0.54 1.00 6484 5951
Intercept[3] 0.65 0.09 0.47 0.83 1.00 7716 5929
Intercept[4] 0.87 0.10 0.68 1.06 1.00 7589 6623
Intercept[5] 1.26 0.12 1.03 1.49 1.00 5985 5797
Intercept[6] 1.81 0.16 1.51 2.13 1.00 6906 6579
xjoke6 0.39 0.10 0.20 0.58 1.00 6747 5827
disc_xjoke6 0.50 0.12 0.27 0.73 1.00 4011 5214
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Save and wrangle the posterior draws, then use our compare_thresholds()
function to compare the brms parameterization of
<- as_draws_df(fit23.7)
draws
|>
draws select(`b_Intercept[1]`:`b_Intercept[6]`, .draw) |>
compare_thresholds(lb = 1.5, ub = 6.5)
Given our data are only approximations of Kruschke’s, I think we did pretty good. Here are the histograms for our brms versions of the
|>
draws transmute(`mu[Joke~1]` = 0,
`mu[Joke~6]` = b_xjoke6,
`sigma[Joke~1]` = 1,
`sigma[Joke~6]` = 1 / exp(b_disc_xjoke6)) |>
mutate(`mu[Joke~6]-mu[Joke~1]` = `mu[Joke~6]` - `mu[Joke~1]`,
`sigma[Joke~6]-sigma[Joke~1]` = `sigma[Joke~6]` - `sigma[Joke~1]`) |>
mutate(`(mu[Joke~6]-mu[Joke~1])/sqrt((sigma[Joke~1]^2+sigma[Joke~6]^2)/2)` = (`mu[Joke~6]-mu[Joke~1]`) / sqrt((`sigma[Joke~1]`^2 + `sigma[Joke~6]`^2) / 2)) |>
pivot_longer(cols = everything()) |>
mutate(name = factor(name,
levels = c("mu[Joke~1]", "mu[Joke~6]", "mu[Joke~6]-mu[Joke~1]",
"sigma[Joke~1]", "sigma[Joke~6]", "sigma[Joke~6]-sigma[Joke~1]",
"(mu[Joke~6]-mu[Joke~1])/sqrt((sigma[Joke~1]^2+sigma[Joke~6]^2)/2)"))) |>
ggplot(aes(x = value)) +
stat_halfeye(point_interval = mode_hdi, .width = 0.95,
color = sl[8], fill = sl[4], normalize = "panels") +
scale_y_continuous(NULL, breaks = NULL) +
xlab(NULL) +
facet_wrap(~ name, labeller = label_parsed, scales = "free")
Here are our versions of the two panels in the upper right of Figure 23.5.
set.seed(23)
pp_check(fit23.7, type = "bars_grouped", fatten = 2, group = "x", ndraws = 100) +
scale_x_continuous("y", breaks = 1:7) +
scale_y_continuous(NULL, breaks = NULL, expand = expansion(mult = c(0, 0.05))) +
ggtitle("Data with posterior predictions",
subtitle = expression(list(italic(N["joke "*1])==177, italic(N["joke "*6])==177))) +
theme(legend.position = "none")
Now here’s the corresponding model where we treat the y
data as metric.
<- mean(d$y)
mean_y <- sd(d$y)
sd_y
<- stanvar(mean_y, name = "mean_y") +
stanvars stanvar(sd_y, name = "sd_y")
.8 <- brm(
fit23data = d,
family = gaussian,
bf(y ~ 0 + x, sigma ~ 0 + x),
prior = c(prior(normal(mean_y, sd_y * 100), class = b),
prior(normal(0, exp(sd_y)), class = b, dpar = sigma)),
chains = 4, cores = 4,
stanvars = stanvars,
seed = 23,
file = "fits/fit23.08")
Check the summary.
print(fit23.8)
Family: gaussian
Links: mu = identity; sigma = log
Formula: y ~ 0 + x
sigma ~ 0 + x
Data: d (Number of observations: 354)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
xjoke1 2.41 0.14 2.14 2.69 1.00 3958 3066
xjoke6 2.86 0.12 2.61 3.09 1.00 4498 2809
sigma_xjoke1 0.64 0.05 0.54 0.75 1.00 4485 3035
sigma_xjoke6 0.52 0.05 0.42 0.63 1.00 5132 3413
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Make the marginal posteriors, including the effect size.
as_draws_df(fit23.8) |>
mutate(`Joke 1 Mean` = b_xjoke1,
`Joke 6 Mean` = b_xjoke6,
`Joke 1 Std. Dev.` = exp(b_sigma_xjoke1),
`Joke 6 Std. Dev.` = exp(b_sigma_xjoke6)) |>
mutate(`Difference of Means` = `Joke 6 Mean` - `Joke 1 Mean`,
`Difference of Std. Devs` = `Joke 6 Std. Dev.` - `Joke 1 Std. Dev.`) |>
mutate(`Effect Size` = `Difference of Means` / sqrt((`Joke 1 Std. Dev.`^2 + `Joke 6 Std. Dev.`^2) / 2)) |>
pivot_longer(cols = `Joke 1 Mean`:`Effect Size`) |>
mutate(name = factor(name,
levels = c("Joke 1 Mean", "Joke 1 Std. Dev.",
"Joke 6 Mean", "Joke 6 Std. Dev.",
"Difference of Means", "Difference of Std. Devs",
"Effect Size"))) |>
ggplot(aes(x = value)) +
stat_halfeye(point_interval = mode_hdi, .width = 0.95,
color = sl[8], fill = sl[4], normalize = "panels") +
scale_y_continuous(NULL, breaks = NULL) +
labs(x = "Marginal posterior",
subtitle = "These are based on the conventional Gaussian model, NOT the cumulative-normal\nmodel Kruschke displayed in Figure 23.5") +
facet_wrap(~ name, ncol = 2, scales = "free")
If you think you have a better approximation of Kruschke’s data, please share.
23.4 The Case of metric predictors
“This type of model is often referred to as ordinal probit regression or ordered probit regression because the probit function is the link function corresponding to the cumulative-normal inverse-link function” (p. 688, emphasis in the original).
23.4.1 Implementation in JAGS brms.
This model is easy to specify in brms. Just make sure to think clearly about your priors.
23.4.2 Example: Happiness and money.
Load the data for the next model.
<- read_csv("data.R/OrdinalProbitData-LinReg-2.csv")
my_data
glimpse(my_data)
Rows: 200
Columns: 2
$ X <dbl> 1.386389, 1.223879, 1.454505, 1.112068, 1.222715, 1.545099, 1.360256…
$ Y <dbl> 1, 1, 5, 5, 1, 4, 6, 2, 5, 4, 1, 4, 4, 4, 4, 6, 1, 1, 6, 2, 1, 7, 1,…
Take a quick look at the data.
|>
my_data ggplot(aes(x = X, y = Y)) +
geom_point(alpha = 1/3, color = sl[9]) +
scale_y_continuous(breaks = 1:7)
Kruschke standardized his predictor within his model code. Here we’ll standardize X
before fitting the model.
<- my_data |>
my_data mutate(X_s = (X - mean(X)) / sd(X))
Fit the model.
.9 <- brm(
fit23data = my_data,
family = cumulative(probit),
~ 1 + X_s,
Y prior = c(prior(normal(0, 4), class = Intercept),
prior(normal(0, 4), class = b)),
iter = 3000, warmup = 1000, chains = 4, cores = 4,
seed = 23,
file = "fits/fit23.09")
Check the summary.
print(fit23.9)
Family: cumulative
Links: mu = probit; disc = identity
Formula: Y ~ 1 + X_s
Data: my_data (Number of observations: 200)
Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
total post-warmup draws = 8000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1] -1.18 0.12 -1.42 -0.94 1.00 4929 5788
Intercept[2] -0.76 0.11 -0.98 -0.55 1.00 6648 6423
Intercept[3] -0.29 0.10 -0.50 -0.09 1.00 8515 6940
Intercept[4] 0.25 0.11 0.04 0.45 1.00 9773 7244
Intercept[5] 0.73 0.11 0.51 0.95 1.00 8933 6684
Intercept[6] 1.26 0.13 1.01 1.52 1.00 8414 7401
X_s 1.16 0.10 0.97 1.35 1.00 6232 6258
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
disc 1.00 0.00 1.00 1.00 NA NA NA
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Extract the posterior draws and compare the brms parameterization of
<- as_draws_df(fit23.9)
draws
|>
draws select(`b_Intercept[1]`:`b_Intercept[6]`, .draw) |>
compare_thresholds(lb = 1.5, ub = 6.5)
Here’s the marginal distribution of b_X_s
, our effect size for the number of jokes.
|>
draws ggplot(aes(x = b_X_s)) +
stat_halfeye(point_interval = mode_hdi, .width = 0.95,
color = sl[8], fill = sl[4]) +
scale_y_continuous(NULL, breaks = NULL)
This differs from Kruschks’s conditional_effects()
within plot()
.
conditional_effects(fit23.9) |>
plot(line_args = list(color = sl[7], fill = sl[3]))
Here’s a more elaborated version of the same plot, this time depicting the model with 100 fitted lines randomly drawn from the posterior.
set.seed(23)
conditional_effects(fit23.9,
ndraws = 100,
spaghetti = TRUE) |>
plot(points = TRUE,
line_args = list(size = 0),
point_args = list(alpha = 1/3, color = sl[9]),
spaghetti_args = list(colour = alpha(sl[6], 0.2)))
Note the warning message. There was a similar one in the first plot, which I suppressed for simplicity sake. The message suggests treating the fitted lines as “continuous variables” might lead to a deceptive plot. Here’s what happens if we follow the suggestion.
set.seed(23)
conditional_effects(fit23.9, categorical = TRUE)
Recall that our thresholds, Y
value, given X_s
. That is, X_s
, lower values of Y
are more probable. At higher values of X_s
, higher values of Y
are more probable.
It might be useful to get more practice in with this model. Here we’ll use fitted()
to make a similar plot, depicting the model with may fitted lines instead of summary statistics.
# How many fitted lines do you want?
<- 50
n_draws
# Define the `X_s` values you want to condition on
# because the lines are nonlinear, you'll need many of them
<- tibble(X_s = seq(from = -2, to = 2, by = 0.05))
nd
<- fitted(fit23.9,
f newdata = nd,
ndraws = n_draws,
summary = FALSE)
# Inspect the output
str(f)
num [1:50, 1:81, 1:7] 0.8 0.898 0.789 0.869 0.874 ...
- attr(*, "dimnames")=List of 3
..$ : chr [1:50] "1" "2" "3" "4" ...
..$ : NULL
..$ : chr [1:7] "1" "2" "3" "4" ...
Our output came in three dimensions. We have 50 rows, corresponding to n_draws <- 50
(i.e., 50 posterior draws). There are 81 columns, based on how we defined the X_s
values within our nd
data (i.e., seq(from = -2, to = 2, by = 0.05)
). The third dimension has seven levels, one corresponding to each of the seven levels of our criterion variable Y
. Here’s a way to rearrange that output into a useful format for plotting.
# Rearrange the output
rbind(
1],
f[, , 2],
f[, , 3],
f[, , 4],
f[, , 5],
f[, , 6],
f[, , 7]) |>
f[, , # Wrangle
data.frame() |>
set_names(nd |> pull(X_s)) |>
mutate(draw = rep(1:n_draws, times = 7),
rating = rep(1:7, each = n_draws)) |>
pivot_longer(cols = -c(draw, rating), names_to = "X_s", values_to = "probability") |>
mutate(rating = str_c("Y: ", rating),
X_s = X_s |> as.double()) |>
# Plot
ggplot(aes(x = X_s, y = probability,
group = interaction(draw, rating),
color = rating)) +
geom_line(alpha = 1/2, linewidth = 1/4) +
scale_y_continuous(breaks = c(0, 0.5, 1), expand = c(0, 0), limits = c(0, 1)) +
scale_color_scico_d(palette = "lajolla", end = 0.85, direction = -1) +
theme(legend.position = "none") +
facet_wrap(~ rating, ncol = 7)
So far, we’ve been plotting our model with the context of the default scale = "response"
setting within fitted()
. Within the context of the response variable Y
, our model returns response probabilities. We can also look at the output within the context of scale = "linear"
. We’ll plot these fitted lines across our nd
values two different ways. For the first, p1
, we’ll use summary statistics. For the second, p2
, we’ll set summary = TRUE
.
# Adjust the predictor data
<- tibble(X_s = seq(from = -2.5, to = 2.5, by = 0.1))
nd
# Use summary statistics
<- fitted(fit23.9,
p1 newdata = nd,
scale = "linear") |>
data.frame() |>
bind_cols(nd) |>
ggplot(aes(x = X_s, y = Estimate, ymin = Q2.5, ymax = Q97.5)) +
geom_smooth(stat = "identity",
alpha = 1/2, color = sl[7], fill = sl[3]) +
labs(y = "underlying standard normal",
title = "summary statistics")
# Set `summary = FALSE`
set.seed(23)
<- fitted(fit23.9,
p2 newdata = nd,
ndraws = n_draws,
scale = "linear",
summary = FALSE) |>
data.frame() |>
set_names(nd |> pull(X_s)) |>
mutate(draw = 1:n_draws) |>
pivot_longer(cols = -draw, names_to = "X_s") |>
mutate(X_s = X_s |> as.double()) |>
ggplot(aes(x = X_s, y = value, group = draw)) +
geom_line(alpha = 1/2, color = sl[7]) +
scale_y_continuous(NULL, breaks = NULL) +
ggtitle("50 posterior draws")
# Combine, adjust, and display
+ p2 &
p1 coord_cartesian(ylim = c(-2, 2))
Both methods returned the fitted lines in the metric of the underlying standard normal distribution. The fitted lines are nonlinear in the metric of the raw data Y
, but they’re linear in the metric of the presumed underlying distribution. If it helps, we’ll make a marginal plot of the standard normal distribution and tack it onto the right.
# Make Phi
<- tibble(x = seq(from = -3, to = 3, by = 0.01)) |>
p3 mutate(density = dnorm(x)) |>
ggplot(aes(x = x, y = density)) +
geom_area(fill = sl[5]) +
# Add the thresholds!
geom_vline(xintercept = posterior_summary(fit23.9)[1:6, 1],
color = sl[9], linetype = 3) +
# Mark the thresholds with the axis breaks
scale_x_reverse(NULL, breaks = fixef(fit23.9)[1:6, 1], position = "top",
labels = parse(text = str_c("theta[", 1:6, "]"))) +
scale_y_continuous(NULL, breaks = NULL) +
ggtitle(expression(italic(N)(0,1))) +
coord_flip(xlim = c(-2, 2))
# Combine, adjust, and display
| p2) &
(((p1 geom_hline(yintercept = posterior_summary(fit23.9)[1:6, 1],
color = sl[9], linetype = 3) &
coord_cartesian(ylim = c(-2, 2)) | p3)) +
plot_layout(widths = c(4, 4, 1))
The lines intersecting the plots are the posterior means for thresholds,
But these still aren’t faithful depictions of the top panel of Figure 23.7, you say. Okay, fine. One of the distinctive elements of that panel is the left-tilted bar-and-error plots. If you look closely at the vertical lines at their bases, you’ll see that the leftmost subplot starts at the minimum value of X
and the rightmost subplot starts at the maximum value of X
. Since our plots, so far, have been based on X_s
, we’ll use the minimum and maximum values from that. Here are those values.
<- range(my_data$X_s)) (r
[1] -1.774444 1.750168
To my eye, it appears that the three middle subplots are equally distributed between those at the ends. If we proceed under that assumption, here’s how we might use fitted()
to get us rolling on computing the relevant values.
<- tibble(X_s = seq(from = r[1], to = r[2], length.out = 5))
nd
<- fitted(fit23.9, newdata = nd)
f
# Inspect the output
str(f)
num [1:5, 1:4, 1:7] 0.807183 0.444354 0.124257 0.015638 0.000907 ...
- attr(*, "dimnames")=List of 3
..$ : NULL
..$ : chr [1:4] "Estimate" "Est.Error" "Q2.5" "Q97.5"
..$ : chr [1:7] "P(Y = 1)" "P(Y = 2)" "P(Y = 3)" "P(Y = 4)" ...
Here we’ll rearrange the output to make it useful for plotting.
# Rearrange the output
<- rbind(
f 1],
f[, , 2],
f[, , 3],
f[, , 4],
f[, , 5],
f[, , 6],
f[, , 7]) |>
f[, , # Wrangle
data.frame() |>
bind_cols(expand_grid(nd, Y = 1:7) |> arrange(Y, X_s))
head(f)
Estimate Est.Error Q2.5 Q97.5 X_s Y
1 0.8071829772 0.0469968777 0.7067266277 0.889247368 -1.77444380 1
2 0.4443543291 0.0485583395 0.3494432061 0.540381426 -0.89329094 1
3 0.1242573951 0.0248619516 0.0805231604 0.177020747 -0.01213809 1
4 0.0156381288 0.0067075474 0.0057884052 0.032051758 0.86901477 1
5 0.0009066288 0.0007959035 0.0001158735 0.003025388 1.75016762 1
6 0.0925281535 0.0252089786 0.0500434480 0.146541824 -1.77444380 2
Now we can make those bar-and-error plots.
|>
f mutate(X_s = round(X_s, digits = 3)) |>
ggplot(aes(x = Y, y = Estimate,
ymin = Q2.5, ymax = Q97.5)) +
geom_col(fill = sl[4]) +
geom_pointrange(color = sl[7], fatten = 1.5, linewidth = 1) +
scale_x_continuous(breaks = 1:7) +
scale_y_reverse(NULL, position = "right", breaks = c(1, 0.5, 0), expand = c(0, 0),
labels = c(1, ".5", 0), limits = c(1, 0)) +
coord_flip() +
facet_wrap(~ X_s, ncol = 7, strip.position = "bottom")
The X_s
values are depicted in the panel strips on the bottom. The response probabilities are scaled based on the axis on the top. The points and leftmost sides of the bars are the posterior means. The thin, dark horizontal lines are the percentile-based 95% intervals. Here we reformat f
a little more to put those bar-and-error plots in a format more similar to that of Figure 23.7.
|>
f select(-Est.Error) |>
# Rescale the probability summaries
mutate_at(.vars = vars(Estimate:Q97.5), .funs = \(x) x / 2) |>
# Plot
ggplot() +
geom_vline(xintercept = seq(from = r[1], to = r[2], length.out = 5),
color = sl[2]) +
# Bar marking the Estimate
geom_segment(aes(x = X_s, xend = X_s - Estimate,
y = Y + 0.1, yend = Y + 0.1),
color = sl[4], linewidth = 8) +
# Bar marking the 95% interval
geom_segment(aes(x = X_s - Q2.5, xend = X_s - Q97.5,
y = Y + 0.2, yend = Y + 0.2),
color = sl[7], linewidth = 1) +
# Data
geom_point(data = my_data,
aes(x = X_s, y = Y),
color = sl[9], shape = 1, size = 2) +
scale_y_continuous("Y", breaks = 1:7, limits = c(0.5, 7.5)) +
coord_cartesian(xlim = c(-2.4, 2.4))
I’m not going to attempt superimposing fitted lines on this plot the way Kruschke did. Given that our ordered-probit model is nonlinear on the scale of the criterion, it seems misleading to present linear fitted lines atop the raw data. If you’d like to do so, you’re on your own.
Now here’s the corresponding model is we treat the y
data as metric with tricks from Chapter 17.
<- sd(my_data$X)
sd_x <- sd(my_data$Y)
sd_y <- mean(my_data$X)
m_x <- mean(my_data$Y)
m_y
<- 10 * abs(m_x * sd_y / sd_x)
beta_0_sigma <- 10 * abs(sd_y / sd_x)
beta_1_sigma
<- stanvar(beta_0_sigma, name = "beta_0_sigma") +
stanvars stanvar(beta_1_sigma, name = "beta_1_sigma") +
stanvar(sd_y, name = "sd_y")
.10 <- brm(
fit23data = my_data,
family = gaussian,
~ 1 + X,
Y prior = c(prior(normal(0, beta_0_sigma), class = Intercept),
prior(normal(0, beta_1_sigma), class = b),
prior(normal(0, sd_y), class = sigma)),
chains = 4, cores = 4,
stanvars = stanvars,
seed = 23,
file = "fits/fit23.10")
It may not have been readily apparent from Kruschke’s prose in the note for Figure 23.7, but his OLS model was based on the fully unstandardized data (i.e., using X
as the predictor), not the partially standardized data he used in his JAGS code from 23.4.1. We followed the same sensibilities for fit23.10
. Speaking of which, here are the summaries for the marginal posteriors.
posterior_summary(fit23.10)[1:3, ] |> round(digits = 3)
Estimate Est.Error Q2.5 Q97.5
b_Intercept -5.412 0.629 -6.648 -4.199
b_X 6.701 0.440 5.849 7.570
sigma 1.527 0.076 1.385 1.686
These values are very close to those he reported at the bottom of page 690. Here are what the fitted lines from that model look like when superimposed on the data, when presuming both variables are metric.
set.seed(23)
conditional_effects(fit23.10,
ndraws = 100,
spaghetti = TRUE) |>
plot(points = TRUE,
line_args = list(size = 0),
point_args = list(alpha = 1/3, color = sl[9]),
spaghetti_args = list(colour = alpha(sl[6], 0.2)))
For the next example, we’ll load the HappinessAssetsDebt.csv
data from Shi (2009).
<- read_csv("data.R/HappinessAssetsDebt.csv")
my_data
glimpse(my_data)
Rows: 6,759
Columns: 3
$ Happiness <dbl> 3, 3, 3, 3, 1, 3, 2, 2, 4, 2, 3, 5, 3, 3, 4, 3, 3, 2, 4, 3, …
$ Assets <dbl> 0, 10000, 30000, 40000, 21000, 20000, 20000, 0, 0, 20000, 50…
$ Debt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5000, 0, 0, 0, 0, 0, 0, 0, 0, …
Here’s a quick scatter plot of the data. To help with the overplotting, the points have been horizontally jittered a bit. But as indicated in the text, Happiness
is a discrete variable.
set.seed(23)
|>
my_data ggplot(aes(x = Assets, y = Happiness)) +
geom_jitter(alpha = 1/4, color = sl[9], height = 0.25) +
scale_x_continuous(expand = expansion(0, 0.05))
Standardize our predictor.
<- my_data |>
my_data mutate(Assets_s = (Assets - mean(Assets)) / sd(Assets))
Fit the model like before.
.11 <- brm(
fit23data = my_data,
family = cumulative(probit),
~ 1 + Assets_s,
Happiness prior = c(prior(normal(0, 4), class = Intercept),
prior(normal(0, 4), class = b)),
iter = 3000, warmup = 1000, chains = 4, cores = 4,
seed = 23,
file = "fits/fit23.11")
Check the summary.
print(fit23.11)
Family: cumulative
Links: mu = probit; disc = identity
Formula: Happiness ~ 1 + Assets_s
Data: my_data (Number of observations: 6759)
Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
total post-warmup draws = 8000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1] -2.03 0.03 -2.10 -1.96 1.00 5122 4419
Intercept[2] -1.17 0.02 -1.20 -1.13 1.00 7825 6240
Intercept[3] -0.15 0.02 -0.18 -0.12 1.00 8527 6266
Intercept[4] 1.48 0.02 1.44 1.53 1.00 9010 6816
Assets_s 0.15 0.01 0.12 0.17 1.00 8160 4947
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
disc 1.00 0.00 1.00 1.00 NA NA NA
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Extract the posterior draws and compare the brms parameterization of
<- as_draws_df(fit23.11)
draws
|>
draws select(`b_Intercept[1]`:`b_Intercept[4]`, .draw) |>
compare_thresholds(lb = 1.5, ub = 4.5)
Behold the marginal distribution of b_Assets_s
, our effect size for Assets
.
|>
draws ggplot(aes(x = b_Assets_s)) +
stat_halfeye(point_interval = mode_hdi, .width = 0.95,
color = sl[8], fill = sl[4]) +
scale_y_continuous(NULL, breaks = NULL)
Here’s the fitted()
-oriented preparatory work for our version of the top panel of Figure 23.8.
# Define the range for the predictor
<- range(my_data$Assets_s)
r
# Re-define the new data
<- tibble(Assets_s = seq(from = r[1], to = r[2], length.out = 5))
nd
# Compute the fitted summaries
<- fitted(fit23.11, newdata = nd)
f
# Rearrange the output
<- rbind(
f 1],
f[, , 2],
f[, , 3],
f[, , 4],
f[, , 5]) |>
f[, , # Wrangle
data.frame() |>
bind_cols(expand_grid(nd, Happiness = 1:5) |> arrange(Happiness, Assets_s))
# Examine
head(f)
Estimate Est.Error Q2.5 Q97.5 Assets_s Happiness
1 2.611798e-02 2.133404e-03 2.215960e-02 3.047004e-02 -0.5971712 1
2 3.159217e-03 7.290196e-04 1.953182e-03 4.774445e-03 4.7565107 1
3 2.339827e-04 1.329509e-04 6.859301e-05 5.735401e-04 10.1101925 1
4 1.151040e-05 1.296018e-05 1.039798e-06 4.714594e-05 15.4638743 1
5 4.162833e-07 8.774536e-07 6.685249e-09 2.528812e-06 20.8175561 1
6 1.147304e-01 4.207162e-03 1.065825e-01 1.230356e-01 -0.5971712 2
Like with the same variant from Figure 23.7, we will not be superimposing linear fitted lines. The model is nonlinear on the scale of the data and I don’t want to confuse readers by pretending otherwise.
|>
f select(-Est.Error) |>
mutate_at(.vars = vars(Estimate:Q97.5), .funs = \(x) x * 2.5) |>
ggplot() +
geom_vline(xintercept = seq(from = r[1], to = r[2], length.out = 5),
color = sl[2]) +
geom_segment(aes(x = Assets_s, xend = Assets_s - Estimate,
y = Happiness + 0.1, yend = Happiness + 0.1),
color = sl[4], linewidth = 8) +
geom_segment(aes(x = Assets_s - Q2.5, xend = Assets_s - Q97.5,
y = Happiness + 0.2, yend = Happiness + 0.2),
color = sl[7], linewidth = 1) +
geom_point(data = my_data,
aes(x = Assets_s, y = Happiness),
color = sl[9], shape = 1, size = 2) +
scale_y_continuous("Happiness", breaks = 1:5, limits = c(0.5, 5.5)) +
coord_cartesian(xlim = c(-4, 24))
Now here’s the corresponding model is we treat Happiness
as metric. Unlike our method for the corresponding model from Figure 23.7, fit10
, we will use the standardized version of the predictor, Assets_s
. The unstandardized values for Happiness
and Assets
are on vastly different scales, which can be difficulty for HMC with broad priors of the type Kruschke often uses. Standardizing the predictor helps.
<- sd(my_data$Happiness)
sd_y <- stanvar(sd_y, name = "sd_y")
stanvars
.12 <- brm(
fit23data = my_data,
family = gaussian,
~ 1 + Assets_s,
Happiness prior = c(prior(normal(3.5, 5), class = Intercept),
prior(normal(0, 5), class = b),
prior(normal(0, sd_y), class = sigma)),
chains = 4, cores = 4,
stanvars = stanvars,
seed = 23,
file = "fits/fit23.12")
Here are the summaries for the marginal posteriors.
posterior_summary(fit23.12)[1:3, ] |> round(digits = 6)
Estimate Est.Error Q2.5 Q97.5
b_Intercept 3.484393 0.010273 3.464044 3.504634
b_Assets_s 0.115658 0.010255 0.095015 0.136033
sigma 0.847120 0.007128 0.833300 0.860943
They’re just a bit different from those produced by OLS. Here are what the fitted lines from that model look like when superimposed on the data, when presuming both variables are metric.
set.seed(23)
conditional_effects(fit23.12,
ndraws = 100,
spaghetti = TRUE) |>
plot(points = TRUE,
line_args = list(size = 0),
point_args = list(alpha = 1/3, color = sl[9]),
spaghetti_args = list(colour = alpha(sl[6], 0.2)))
23.4.3 Example: Movies–They don’t make ’em like they used to.
For this section, we’ll load the Moore (2006) Movies.csv
data.
<- read_csv("data.R/Movies.csv")
my_data
glimpse(my_data)
Rows: 100
Columns: 6
$ Title <chr> "A_Ticklish_Affair", "Action_in_the_North_Atlantic", "And_…
$ Year <dbl> 1963, 1943, 1984, 1978, 1931, 1930, 1950, 1989, 1940, 1947…
$ Length <dbl> 89, 127, 138, 97, 77, 69, 93, 119, 70, 69, 101, 62, 97, 85…
$ Cast <dbl> 5, 7, 7, 5, 6, 8, 5, 8, 9, 9, 9, 6, 10, 10, 9, 6, 10, 6, 1…
$ Rating <dbl> 2.0, 3.0, 3.0, 3.0, 2.5, 2.5, 3.0, 2.5, 2.5, 2.0, 3.0, 2.0…
$ Description <dbl> 7, 9, 15, 11, 7, 10, 8, 15, 8, 8, 11, 10, 12, 13, 9, 7, 10…
In footnote #5 at the bottom of page 693, Kruschke explained that whereas the original Ratings
data ranged from 1.0
to 4.0
in half-unit increments, he recoded them to range from 1
to 7
. Here we recode Ratings
in the same way using dplyr::case_match()
. While we’re at it, we’ll make standardized versions of the predictors, too.
<- my_data |>
my_data mutate(Rating = case_match(Rating,
1.0 ~ 1,
1.5 ~ 2,
2.0 ~ 3,
2.5 ~ 4,
3.0 ~ 5,
3.5 ~ 6,
4.0 ~ 7),
Year_s = (Year - mean(Year)) / sd(Year),
Length_s = (Length - mean(Length)) / sd(Length))
Here’s a scatter plot of the data, with points colored by Rating
.
|>
my_data mutate(Rating = factor(Rating)) |>
ggplot(aes(x = Year, y = Length, label = Rating)) +
geom_point(aes(color = Rating),
size = 3) +
geom_text(size = 3) +
scale_color_scico_d(palette = "lajolla", begin = 0.35, end = 0.90, direction = -1)
Fitting the multivariable ordered-probit model with brms is about as simple as fitting any other multivariable model. Just tack on predictors with the +
operator.
.13 <- brm(
fit23data = my_data,
family = cumulative(probit),
~ 1 + Year_s + Length_s,
Rating prior = c(prior(normal(0, 4), class = Intercept),
prior(normal(0, 4), class = b)),
iter = 3000, warmup = 1000, chains = 4, cores = 4,
seed = 23,
file = "fits/fit23.13")
Check the model summary.
print(fit23.13)
Family: cumulative
Links: mu = probit; disc = identity
Formula: Rating ~ 1 + Year_s + Length_s
Data: my_data (Number of observations: 100)
Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
total post-warmup draws = 8000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1] -1.69 0.20 -2.11 -1.30 1.00 6014 5318
Intercept[2] -0.91 0.15 -1.22 -0.62 1.00 8443 5737
Intercept[3] -0.22 0.13 -0.48 0.04 1.00 8669 7066
Intercept[4] 0.61 0.14 0.34 0.88 1.00 8518 7047
Intercept[5] 1.69 0.20 1.31 2.10 1.00 9009 6885
Intercept[6] 2.59 0.37 1.95 3.39 1.00 9103 6786
Year_s -0.49 0.13 -0.74 -0.24 1.00 6982 5936
Length_s 0.62 0.13 0.37 0.88 1.00 6975 6019
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
disc 1.00 0.00 1.00 1.00 NA NA NA
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Extract the posterior draws and compare the brms parameterization of
<- as_draws_df(fit23.13)
draws
|>
draws select(`b_Intercept[1]`:`b_Intercept[6]`, .draw) |>
compare_thresholds(lb = 1.5, ub = 6.5)
To sate any curiosity, here are the correlations among the parameters.
# Wrangle
as_draws_df(fit23.13) |>
select(contains("b_Intercept"), b_Year_s, b_Length_s) |>
set_names(str_c("theta[", 1:6, "]"), str_c("beta[", 1:2, "]")) |>
# Plot
ggpairs(upper = list(continuous = my_upper),
diag = list(continuous = my_diag),
lower = list(continuous = my_lower),
labeller = label_parsed) +
ggtitle("These parameters have milder correlations")
Now focus on the marginal distribution of our two effect-size parameters.
|>
draws pivot_longer(cols = ends_with("_s")) |>
mutate(name = factor(name, levels = c("b_Year_s", "b_Length_s"))) |>
ggplot(aes(x = value)) +
stat_halfeye(point_interval = mode_hdi, .width = 0.95,
color = sl[8], fill = sl[4], normalize = "panels") +
scale_y_continuous(NULL, breaks = NULL) +
xlab("effect size") +
facet_wrap(~ name, scales = "free")
Before we make the top panel from Figure 23.9, I’d like to wander a bit and look at something related. We’ll use fitted()
.
# Define the new data
<- crossing(Year_s = seq(from = -3, to = 3, by = 0.25),
nd Length_s = seq(from = -3, to = 3, by = 0.25))
# Compute the `Response` probabilities
<- fitted(fit23.13, newdata = nd)
f
# Rearrange the output
<- rbind(
f 1],
f[, , 2],
f[, , 3],
f[, , 4],
f[, , 5],
f[, , 6],
f[, , 7] ) |>
f[, , # Wrangle
data.frame() |>
bind_cols(
|>
nd expand_grid(Rating = 1:7) |>
arrange(Rating, Year_s))
# What did we do?
head(f)
Estimate Est.Error Q2.5 Q97.5 Year_s Length_s Rating
1 0.11652439 0.07893031 0.018596666 0.31399252 -3 -3.00 1
2 0.08997773 0.06485363 0.013122143 0.25771735 -3 -2.75 1
3 0.06826646 0.05244825 0.008879778 0.20764293 -3 -2.50 1
4 0.05092608 0.04181886 0.005860147 0.16105489 -3 -2.25 1
5 0.03738891 0.03293782 0.003661057 0.12565676 -3 -2.00 1
6 0.02704712 0.02568048 0.002199393 0.09533809 -3 -1.75 1
We just computed the response probabilities across the two-dimensional grid of the predictor values. Now plot using the posterior means.
|>
f mutate(strip = str_c("Rating: ", Rating)) |>
ggplot(aes(x = Year_s, y = Length_s)) +
geom_raster(aes(fill = Estimate),
interpolate = TRUE) +
geom_text(data = my_data |> mutate(strip = str_c("Rating: ", Rating)),
aes(label = Rating),
color = "white", size = 2.5) +
scale_x_continuous(breaks = seq(from = -2, to = 2, by = 2),
expand = c(0, 0)) +
scale_y_continuous(breaks = seq(from = -2, to = 2, by = 2),
expand = c(0, 0)) +
scale_fill_scico(palette = 'lajolla', breaks = c(0, 0.5, 1),
labels = c(0, ".5", 1), limits = c(0, 1)) +
theme(legend.position = "bottom") +
facet_wrap(~ strip, nrow = 1)
This model didn’t do a great job capturing the Response
probabilities. If you’re curious, you’ll find you can do a little bit better if you allow the two predictors to interact (i.e., add + Year_s:Length_s
to the formula
line). Even then, the model isn’t great. I leave that as an exercise for the interested reader.
For this model, however, we will follow Kruschke and make a more faithful version of the top panel of Figure 23.9. We’ll need to wrangle our draws
data a bit to get things ready. Here’s the work.
<- draws |>
draws # We just need the data from three steps in the HMC chain
slice(1:3) |>
mutate(.draw = .draw |> as.factor()) |>
rename(b1 = b_Year_s,
b2 = b_Length_s) |>
# Because these are straight lines, two extreme x1-values are all we need
expand_grid(x1 = c(-10, 10)) |>
pivot_longer(cols = contains("b_Intercept"), names_to = "theta") |>
# Use Kruschke's Formula 23.5
mutate(x2 = (value / b2) + (-b1 / b2) * x1,
# This just renames our x variables for easy plotting
Year_s = x1,
Length_s = x2)
glimpse(draws)
Rows: 36
Columns: 20
$ b1 <dbl> -0.5650158, -0.5650158, -0.5650158, -0.5650158, -0.5650…
$ b2 <dbl> 0.6424351, 0.6424351, 0.6424351, 0.6424351, 0.6424351, …
$ disc <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ `Intercept[1]` <dbl> -1.650021, -1.650021, -1.650021, -1.650021, -1.650021, …
$ `Intercept[2]` <dbl> -0.8908118, -0.8908118, -0.8908118, -0.8908118, -0.8908…
$ `Intercept[3]` <dbl> -0.10894122, -0.10894122, -0.10894122, -0.10894122, -0.…
$ `Intercept[4]` <dbl> 0.4600140, 0.4600140, 0.4600140, 0.4600140, 0.4600140, …
$ `Intercept[5]` <dbl> 1.469098, 1.469098, 1.469098, 1.469098, 1.469098, 1.469…
$ `Intercept[6]` <dbl> 2.494688, 2.494688, 2.494688, 2.494688, 2.494688, 2.494…
$ lprior <dbl> -18.84353, -18.84353, -18.84353, -18.84353, -18.84353, …
$ lp__ <dbl> -181.6031, -181.6031, -181.6031, -181.6031, -181.6031, …
$ .chain <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ .iteration <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2…
$ .draw <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2…
$ x1 <dbl> -10, -10, -10, -10, -10, -10, 10, 10, 10, 10, 10, 10, -…
$ theta <chr> "b_Intercept[1]", "b_Intercept[2]", "b_Intercept[3]", "…
$ value <dbl> -1.6500213, -0.8908118, -0.1089412, 0.4600140, 1.469098…
$ x2 <dbl> -11.363295, -10.181527, -8.964485, -8.078862, -6.508144…
$ Year_s <dbl> -10, -10, -10, -10, -10, -10, 10, 10, 10, 10, 10, 10, -…
$ Length_s <dbl> -11.363295, -10.181527, -8.964485, -8.078862, -6.508144…
Now just plot.
|>
draws ggplot(aes(x = Year_s, y = Length_s)) +
geom_line(aes(group = interaction(.draw, theta), color = theta, linetype = .draw)) +
geom_text(data = my_data,
aes(label = Rating)) +
scale_color_scico_d(expression(theta), palette = "lajolla", end = 0.85, direction = -1, labels = 1:6) +
coord_cartesian(xlim = range(my_data$Year_s),
ylim = range(my_data$Length_s))
It might be easier to see Kruschke’s main point if we facet by .draw
.
|>
draws ggplot(aes(x = Year_s, y = Length_s)) +
geom_line(aes(group = interaction(.draw, theta), color = theta, linetype = .draw)) +
geom_text(data = my_data,
aes(label = Rating)) +
scale_color_scico_d(expression(theta), palette = "lajolla", end = 0.85, direction = -1, labels = 1:6) +
coord_cartesian(xlim = range(my_data$Year_s),
ylim = range(my_data$Length_s)) +
theme(legend.direction = "horizontal",
legend.position = c(0.75, 0.25)) +
facet_wrap(~ .draw, ncol = 2)
Both our versions of the plot show what Kruschke pointed out in the text:
Threshold lines from the same step in the chain must be parallel because the regression coefficients are constant at that step, but are different at another step. The threshold lines in Figure 23.9 are level contours on the underlying metric planar surface, and the lines reveal that the ratings increase toward the top left, that is, as
decreases and increases. (p. 693)
Before we move on to the next section, what these diagonal 2-dimensional threshold lines also hint at is that when we use two predictors to describe ordinal data as having been produces by an underlying unit Gaussian distribution, that underlying distribution is actually bivariate Gaussian. Here we’ll use fitted()
one more time to depict that bivariate-Gaussian distribution with a little geom_raster()
.
# Define the new data
<- crossing(Year_s = seq(from = -5, to = 5, by = 0.1),
nd Length_s = seq(from = -5, to = 5, by = 0.1))
fitted(fit23.13,
newdata = nd,
# This will yield z-scores
scale = "linear") |>
data.frame() |>
bind_cols(nd) |>
# Convert the z-scores to density values
mutate(density = dnorm(Estimate, 0, 1)) |>
ggplot(aes(x = Year_s, y = Length_s)) +
geom_raster(aes(fill = density),
interpolate = TRUE) +
geom_text(data = my_data,
aes(label = Rating),
size = 2.5) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
scale_fill_scico("density", palette = 'lajolla',
limits = c(0, 0.4))
As with many of our previous approaches with geom_raster()
, this plot is based on the posterior means in each cell and, therefore, does a poor job depicting the uncertainty in the posterior distribution.
23.4.4 Why are some thresholds outside the data?
Now load Kruschke’s simulated data.
<- read_csv("data.R/OrdinalProbitData-Movies.csv")
my_data
glimpse(my_data)
Rows: 400
Columns: 3
$ Year <dbl> 1959, 1946, 1964, 1938, 1946, 1971, 1957, 1970, 1968, 1962, 192…
$ Length <dbl> 88, 117, 130, 85, 111, 105, 93, 119, 78, 77, 138, 107, 70, 60, …
$ Rating <dbl> 4, 5, 5, 4, 5, 4, 3, 4, 3, 3, 7, 4, 2, 2, 5, 5, 3, 4, 3, 2, 2, …
These data imitate the movie ratings, but with two key differences. First and foremost, the artificial data have much smaller noise, with
as opposed to in the real data. Second, the artificial data have points that span the entire range of both predictors, unlike the real data which had points mostly in the central region. (p. 695)
Like with the real movie data, we’ll inspect these data with a colored scatter plot.
|>
my_data mutate(Rating = factor(Rating)) |>
ggplot(aes(x = Year, y = Length, label = Rating)) +
geom_point(aes(color = Rating),
size = 3) +
geom_text(size = 3) +
scale_color_scico_d(palette = "lajolla", begin = 0.35, end = 0.9, direction = -1)
Unlike in last section, there appears to be a clear trend in Kruschke’s simulated data. Kruschke’s simulated critic liked his movies old and long. Time to standardize the predictors.
<- my_data |>
my_data mutate(Year_s = (Year - mean(Year)) / sd(Year),
Length_s = (Length - mean(Length)) / sd(Length))
Fitting the multivariable ordered-probit model with brms is about as simple as fitting any other multivariable model. Just tack on predictors with the +
operator.
.14 <- update(
fit23.13,
fit23newdata = my_data,
iter = 3000, warmup = 1000, chains = 4, cores = 4,
seed = 23,
file = "fits/fit23.14")
Check the model summary.
print(fit23.14)
Family: cumulative
Links: mu = probit; disc = identity
Formula: Rating ~ 1 + Year_s + Length_s
Data: my_data (Number of observations: 400)
Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
total post-warmup draws = 8000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1] -9.77 0.67 -11.10 -8.51 1.00 1930 2823
Intercept[2] -5.96 0.43 -6.82 -5.14 1.01 2085 3268
Intercept[3] -2.25 0.21 -2.67 -1.84 1.00 3297 4854
Intercept[4] 2.43 0.21 2.02 2.85 1.00 3086 4399
Intercept[5] 7.79 0.55 6.73 8.89 1.00 2045 2886
Intercept[6] 10.88 0.77 9.41 12.43 1.00 2121 3422
Year_s -2.80 0.20 -3.20 -2.42 1.00 2018 3068
Length_s 4.73 0.32 4.12 5.37 1.00 1912 2866
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
disc 1.00 0.00 1.00 1.00 NA NA NA
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Extract the posterior draws and use compare_thresholds()
to make our expanded version of the bottom panel of Figure 23.10.
<- as_draws_df(fit23.14)
draws
|>
draws select(`b_Intercept[1]`:`b_Intercept[6]`, .draw) |>
compare_thresholds(lb = 1.5, ub = 6.5)
Make the marginal distribution of our two effect-size parameters.
|>
draws pivot_longer(cols = ends_with("_s")) |>
mutate(name = factor(name, levels = c("b_Year_s", "b_Length_s"))) |>
ggplot(aes(x = value)) +
stat_halfeye(point_interval = mode_hdi, .width = 0.95,
color = sl[8], fill = sl[4], normalize = "panels") +
scale_y_continuous(NULL, breaks = NULL) +
xlab("effect size") +
facet_wrap(~ name, scales = "free")
Make the top panel for Figure 23.10, just like we did for its analogue in Figure 23.9.
as_draws_df(fit23.14) |>
slice(1:3) |>
mutate(.draw = .draw |> as.factor()) |>
rename(b1 = b_Year_s,
b2 = b_Length_s) |>
expand_grid(x1 = c(-10, 10)) |>
pivot_longer(cols = contains("b_Intercept"), names_to = "theta") |>
mutate(x2 = (value / b2) + (-b1 / b2) * x1,
Year_s = x1,
Length_s = x2) |>
ggplot(aes(x = Year_s, y = Length_s)) +
geom_line(aes(group = interaction(.draw, theta), color = theta, linetype = .draw)) +
geom_text(data = my_data,
aes(label = Rating)) +
scale_color_scico_d(expression(theta), palette = "lajolla", end = 0.85, direction = -1, labels = 1:6) +
coord_cartesian(xlim = range(my_data$Year_s),
ylim = range(my_data$Length_s))
Those are some tight thresholds. They “very clearly cleave parallel regions of distinct ordinal values. The example demonstrates that the threshold lines do make intuitive sense when there is low noise and a broad range of data” (p. 695, emphasis in the original).
With our various bonus plots, we’ve been anticipating Figure 23.11 for some time, now. The thresholds from fit23.14
result in beautifully nonlinear probability curves for the Rating
levels. Take a quick look with conditional_effects()
.
<- conditional_effects(fit23.14, categorical = TRUE)
ce
plot(ce, plot = FALSE)[[1]] +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0), limits = c(0, 1)) +
scale_fill_scico_d(palette = "lajolla", end = 0.85, direction = -1) +
scale_color_scico_d(palette = "lajolla", end = 0.85, direction = -1)
plot(ce, plot = FALSE)[[2]] +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0), limits = c(0, 1)) +
scale_fill_scico_d(palette = "lajolla", end = 0.85, direction = -1) +
scale_color_scico_d(palette = "lajolla", end = 0.85, direction = -1)
Because the model had two predictors, we got two plots. What brms::conditional_effects()
called Probability
on the Year_s
or Length_s
. Whereas Kruschke marked off his different outcomes by line styles, ours were marked by color. Since we don’t have the data Kruschke used to make Figure 23.11, we won’t be able to reproduce it exactly. However, you’ll note that our plot for Length_s
corresponded nicely with his subplot on the top (i.e., the one for which effects = "Length_s"
, we can use conditional_effects()
to make a similar plot to Kruschke’s subplot for which
<- conditional_effects(fit23.13,
ce categorical = TRUE,
effects = "Length_s")
plot(ce, plot = FALSE)[[1]] +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = expansion(mult = c(0, 0.05)), limits = c(0, NA)) +
scale_fill_scico_d(palette = "lajolla", end = 0.85, direction = -1) +
scale_color_scico_d(palette = "lajolla", end = 0.85, direction = -1)
“You can see that each outcome has maximum probability within its corresponding interval, but there is considerable smearing of outcomes into adjacent intervals” (p. 695).
Finishing off, Kruschke’s discussion in the text
referred to
as “noise” merely for linguistic ease. Calling the outcomes “noisy” does not mean the underlying generator of the outcomes is inherently wildly random. The “noise” is merely variation in the outcome that cannot be accounted for by the particular model we have chosen with the particular predictors we have chosen. A different model and/or different predictors might account for the outcomes well with little residual noise. In this sense, the noise is in the model, not in the data. (p. 698, emphasis in the original)
Through this lens, noisy-looking data are a symptom of weak theory and/or poor data-collection methods.
23.5 Posterior prediction
The cumulative-normal model makes posterior predictions for the probabilities of the fit11
. Here’s that formula
.
.11$formula fit23
Happiness ~ 1 + Assets_s
With brms, you can get this information with fitted()
. Let’s say we wanted to focus on response probabilities for Assets_s = -1)
. Here’s what we get.
fitted(fit23.11,
newdata = tibble(Assets_s = -1))
, , P(Y = 1)
Estimate Est.Error Q2.5 Q97.5
[1,] 0.02997066 0.002483131 0.02533447 0.03505768
, , P(Y = 2)
Estimate Est.Error Q2.5 Q97.5
[1,] 0.1247085 0.004814203 0.1154987 0.1342713
, , P(Y = 3)
Estimate Est.Error Q2.5 Q97.5
[1,] 0.3435083 0.006438852 0.3306628 0.3562814
, , P(Y = 4)
Estimate Est.Error Q2.5 Q97.5
[1,] 0.450502 0.007042199 0.436879 0.4643867
, , P(Y = 5)
Estimate Est.Error Q2.5 Q97.5
[1,] 0.05131061 0.002930612 0.04583096 0.05721674
As is typical of brms, those probability summaries were in terms of the posterior mean and percentile-based 95% intervals. If you’re like Kruschke and prefer posterior modes and HDIs, you’ll need to set summary = F
and wrangle a bit.
<- fitted(fit23.11,
f newdata = tibble(Assets_s = -1),
summary = FALSE)
cbind(
1],
f[, , 2],
f[, , 3],
f[, , 4],
f[, , 5]) |>
f[, , data.frame() |>
set_names(str_c("p(Happiness = ", 1:5, " | Assets_s = -1)")) |>
pivot_longer(cols = everything()) |>
group_by(name) |>
mode_hdi(value) |>
mutate_if(is.double, .funs = round, digits = 4)
# A tibble: 5 × 7
name value .lower .upper .width .point .interval
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 p(Happiness = 1 | Assets_s = -1) 0.0294 0.0253 0.035 0.95 mode hdi
2 p(Happiness = 2 | Assets_s = -1) 0.125 0.116 0.134 0.95 mode hdi
3 p(Happiness = 3 | Assets_s = -1) 0.344 0.330 0.356 0.95 mode hdi
4 p(Happiness = 4 | Assets_s = -1) 0.450 0.437 0.464 0.95 mode hdi
5 p(Happiness = 5 | Assets_s = -1) 0.0513 0.0457 0.0571 0.95 mode hdi
23.6 Generalizations and extensions
In this section, Kruschke mentioned extensions of this class of models might include using the cumulative
In addition, there are other likelihoods one might use to model ordinal data using brms. Your first stop should be Bürkner and Vourre’s (2019) Ordinal regression models in psychology: A Tutorial, where, in addition to the cumulative normal model, they cover the sequential and adjacent category models. You might also check out Chapter 11 of my (2023) ebook recoding McElreath’s Statistical rethinking, wherein I show how one might use the logit link (i.e., family = cumulative(logit)
) to fit ordered-categorical models with brms.
If you’d just like more practice with the cumulative probit model, you might check out my blog post called Notes on the Bayesian cumulative probit.
Session info
sessionInfo()
R version 4.4.3 (2025-02-28)
Platform: aarch64-apple-darwin20
Running under: macOS Ventura 13.4
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Chicago
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] GGally_2.2.1 tidybayes_3.0.7 brms_2.22.0 Rcpp_1.0.14
[5] patchwork_1.3.0 scico_1.5.0 lubridate_1.9.3 forcats_1.0.0
[9] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.4 readr_2.1.5
[13] tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] tidyselect_1.2.1 svUnit_1.0.6 farver_2.1.2
[4] loo_2.8.0 fastmap_1.1.1 TH.data_1.1-2
[7] tensorA_0.36.2.1 digest_0.6.37 estimability_1.5.1
[10] timechange_0.3.0 lifecycle_1.0.4 StanHeaders_2.32.10
[13] survival_3.8-3 magrittr_2.0.3 posterior_1.6.0
[16] compiler_4.4.3 rlang_1.1.5 tools_4.4.3
[19] utf8_1.2.4 yaml_2.3.8 knitr_1.49
[22] labeling_0.4.3 bridgesampling_1.1-2 htmlwidgets_1.6.4
[25] curl_6.0.1 pkgbuild_1.4.4 bit_4.0.5
[28] plyr_1.8.9 RColorBrewer_1.1-3 multcomp_1.4-26
[31] abind_1.4-8 withr_3.0.2 stats4_4.4.3
[34] grid_4.4.3 inline_0.3.19 xtable_1.8-4
[37] colorspace_2.1-1 emmeans_1.10.3 scales_1.3.0
[40] MASS_7.3-64 cli_3.6.4 mvtnorm_1.2-5
[43] rmarkdown_2.29 crayon_1.5.3 generics_0.1.3
[46] RcppParallel_5.1.7 rstudioapi_0.16.0 reshape2_1.4.4
[49] tzdb_0.4.0 rstan_2.32.6 splines_4.4.3
[52] bayesplot_1.11.1 parallel_4.4.3 matrixStats_1.4.1
[55] vctrs_0.6.5 V8_4.4.2 Matrix_1.7-2
[58] sandwich_3.1-1 jsonlite_1.8.9 hms_1.1.3
[61] arrayhelpers_1.1-0 bit64_4.0.5 ggdist_3.3.2
[64] glue_1.8.0 ggstats_0.6.0 codetools_0.2-20
[67] distributional_0.5.0 stringi_1.8.4 gtable_0.3.6
[70] QuickJSR_1.1.3 munsell_0.5.1 pillar_1.10.2
[73] htmltools_0.5.8.1 Brobdingnag_1.2-9 R6_2.6.1
[76] vroom_1.6.5 evaluate_1.0.1 lattice_0.22-6
[79] backports_1.5.0 rstantools_2.4.0 gridExtra_2.3
[82] coda_0.19-4.1 nlme_3.1-167 checkmate_2.3.2
[85] xfun_0.49 zoo_1.8-12 pkgconfig_2.0.3