23 Ordinal Predicted Variable

This chapter considers data that have an ordinal predicted variable. For example, we might want to predict people’s happiness ratings on a $1$ -to- $7$ scale as a function of their total financial assets. Or we might want to predict ratings of movies as a function of the year they were made.

One traditional treatment of this sort of data structure is called ordinal or ordered probit regression. We will consider a Bayesian approach to this model. As usual, in Bayesian software, it is easy to generalize the traditional model so it is robust to outliers, allows different variances within levels of a nominal predictor, or has hierarchical structure to share information across levels or factors as appropriate.

In the context of the generalized linear model (GLM) introduced in Chapter 15, this chapter’s situation involves an inverse-link function that is a thresholded cumulative normal with a categorical distribution for describing noise in the data, as indicated in the fourth row of Table 15.2 (p. 443). For a reminder of how this chapter’s combination of predicted and predictor variables relates to other combinations, see Table 15.3 (p. 444). (Kruschke, 2015, p. 671, emphasis in the original)

Load the primary packages.

library(tidyverse)
library(scico)
library(patchwork)
library(brms)
library(tidybayes)
library(GGally)

We might follow Kruschke’s advice and rewind a little before tackling this chapter. The cumulative normal function is denoted $Φ (x; μ, σ)$ , where $x$ is some real number and $μ$ and $σ$ have their typical meaning. The function is cumulative in that it produces values ranging from zero to 1. Figure 15.8 showed an example of the normal distribution atop of the cumulative normal. Here it is, again.

d <- tibble(z = seq(from = -3, to = 3, by = 0.1)) |> 
  # Add the density values
  mutate(`p(z)`   = dnorm(z, mean = 0, sd = 1),
         # Add the CDF values
         `Phi(z)` = pnorm(z, mean = 0, sd = 1))

head(d)

# A tibble: 6 × 3
      z  `p(z)` `Phi(z)`
  <dbl>   <dbl>    <dbl>
1  -3   0.00443  0.00135
2  -2.9 0.00595  0.00187
3  -2.8 0.00792  0.00256
4  -2.7 0.0104   0.00347
5  -2.6 0.0136   0.00466
6  -2.5 0.0175   0.00621

It’s time to talk color and theme. For this chapter, we’ll take our color palette from the scico package (Pedersen & Crameri, 2021), which provides 17 perceptually-uniform and colorblind-safe palettes based on the work of Fabio Crameri. Our palette of interest will be "lajolla".

sl <- scico(palette = "lajolla", n = 9, direction = -1)

scales::show_col(sl)

Our overall plot theme will be based on ggplot2::theme_linedraw() with just a few adjustments.

theme_set(
  theme_linedraw() +
    theme(panel.grid = element_blank(),
          strip.background = element_rect(color = sl[9], fill = sl[9]),
          strip.text = element_text(color = sl[1]))
)

Now plot!

p1 <- d |> 
  ggplot(aes(x = z, y = `p(z)`)) +
  geom_area(aes(fill = z <= 1),
            show.legend = FALSE) +
  geom_line(color = sl[3], linewidth = 1) +
  scale_x_continuous(NULL, breaks = NULL) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.05))) + 
  scale_fill_manual(values = c("transparent", sl[2])) +
  ylab(expression(p(italic(z)))) +
  facet_grid("Normal Density" ~ .)

p2 <- d |> 
  ggplot(aes(x = z, y = `Phi(z)`)) +
  geom_area(aes(fill = z <= 1),
            show.legend = FALSE) +
  geom_line(color = sl[3], linewidth = 1) +
  scale_x_continuous(breaks = -2:2) +
  scale_y_continuous(expand = expansion(mult = c(0, 0)), limits = 0:1) + 
  scale_fill_manual(values = c("transparent", sl[2])) +
  ylab(expression(Phi(italic(z)))) +
  facet_grid("Cumulative Normal" ~ .)

# Combine, adjust, and display
p1 / p2 & 
  coord_cartesian(xlim = c(-2.5, 2.5))

For both plots, z is in a standardized metric (i.e., $z$ -score). With the cumulative normal function, the cumulative probability $Φ (z)$ increases nonlinearly with the $z$ -scores such that, much like with the logistic curve, the greatest change occurs around $z = 0$ and tapers off in the tails.

The inverse of $Φ (x)$ is the probit function. As indicated in the above block quote, we’ll be making extensive use of the probit function in this chapter for our Bayesian models.

23.1 Modeling ordinal data with an underlying metric variable

You can imagine that the distribution of ordinal values might not resemble a normal distribution, even though the underlying metric values are normally distributed. Figure 23.1 shows some examples of ordinal outcome probabilities generated from an underlying normal distribution. The horizontal axis is the underlying continuous metric value. Thresholds are plotted as vertical dashed lines, labeled $θ$ . In all examples, the ordinal scale has $7$ levels, and hence, there are $6$ thresholds. The lowest threshold is set at $θ_{1} = 1.5$ (to separate outcomes $1$ and $2$ ), and the highest threshold is set at $θ_{1} = 6.5$ (to separate outcomes $6$ and $7$ ). The normal curve in each panel shows the distribution of underlying continuous values. What differs across panels are the settings of means, standard deviations, and remaining thresholds. (p. 672)

The four panels in Figure 23.1 require several steps. We’ll start with the top panel and build from there. Here is how we might make the values necessary for the density curve.

# Set the population Gaussian values
mu <- 4; sigma <- 1.5

# How many draws would you like?
n <- 1e5

# Simulate the primary data
set.seed(23)
d <- data.frame(x = rnorm(n = n, mean = mu, sd = sigma)) 

# What?
glimpse(d)

Rows: 100,000
Columns: 1
$ x <dbl> 4.289819, 3.347977, 5.369901, 6.690082, 5.494908, 5.661236, 3.582871…

These x values are the drawn from the underlying distribution $N (4, 1.5)$ . As we saw back in Section 4.3.1, we can use the base R cut() function to divide continuous values into bins. We can define the cut-points for those bins (i.e., the thresholds) with the breaks argument. Here we define our breaks as a vector saved as thresholds_1, and take an initial look.

thresholds_1 <- 1:6 + 0.5

d |> 
  mutate(y = cut(x, breaks = c(-Inf, thresholds_1, Inf))) |> 
  head()

         x          y
1 4.289819  (3.5,4.5]
2 3.347977  (2.5,3.5]
3 5.369901  (4.5,5.5]
4 6.690082 (6.5, Inf]
5 5.494908  (4.5,5.5]
6 5.661236  (5.5,6.5]

By default, the cut() function saved the bins with labels based on their upper and lower boundaries. To help define those boundaries, we nested the thresholds_1 values in between negative and positive infinity, which are the lower and upper bounds for the Gaussian range. However, it would be better if we labeled the bins with the integers 1 through 7, as in the text, which we can do with the labels argument. Though this will give the correct values, they will still be saved as a factor. As we will want these as integer values for some of the operations to come, we will convert those values with as.integer().

d <- d |> 
  mutate(y = cut(x, breaks = c(-Inf, thresholds_1, Inf), labels = 1:7) |> 
           as.integer())

head(d)

         x y
1 4.289819 4
2 3.347977 3
3 5.369901 5
4 6.690082 7
5 5.494908 5
6 5.661236 6

Here we compute and save the sample mean and standard deviation for y. Note how they match with the sample values displayed in the top panel of Figure 23.1.

(m <- mean(d$y) |> round(digits = 2))

[1] 4

(s <- sd(d$y) |> round(digits = 2))

[1] 1.47

As an intermediary step, we will want to compute and save the maximum value for the Gaussian likelihood function, given our mu and sigma values. For the Gaussian, the maximum value will be when x is at the mean. For later reference, we save that value as ml.

(ml <- dnorm(mu, mean = mu, sd = sigma))

[1] 0.2659615

Next we can use count() and some follow-up mutate() lines to determine the proportion and percentage of each of the seven possible values in the y vector.

d <- d |> 
  count(y) |> 
  mutate(proportion = n / sum(n)) |> 
  mutate(percent = (100 * proportion) |> round(digits = 0))

print(d)

  y     n proportion percent
1 1  4763    0.04763       5
2 2 10844    0.10844      11
3 3 21174    0.21174      21
4 4 26256    0.26256      26
5 5 21233    0.21233      21
6 6 10951    0.10951      11
7 7  4779    0.04779       5

Those proportion and percent values will be important for the bars displayed in the figure. When you look at the bars, you’ll see they are offset to the right of their thresholds. Thus we will save a vector of those desired x values. In the code below, we add in those values, and further amend the y columns to make the labels for the percentages and the y values.

x_1 <- 1:7

d <- d |>
  mutate(label_p = str_c(percent, "%"),
         label_y = str_c("italic(y)==", y),
         density = ml * (proportion / max(proportion)),
         x = x_1)

print(d)

  y     n proportion percent label_p      label_y    density x
1 1  4763    0.04763       5      5% italic(y)==1 0.04824706 1
2 2 10844    0.10844      11     11% italic(y)==2 0.10984486 2
3 3 21174    0.21174      21     21% italic(y)==3 0.21448314 3
4 4 26256    0.26256      26     26% italic(y)==4 0.26596152 4
5 5 21233    0.21233      21     21% italic(y)==5 0.21508078 5
6 6 10951    0.10951      11     11% italic(y)==6 0.11092873 6
7 7  4779    0.04779       5      5% italic(y)==7 0.04840913 7

In this special case, the y values are the same as the x values; this will not be the case for some of the other panels. Also notice how we used the ml value from above to scale the density values for the bars. As a final preparatory step, we now define a supplemental data frame for the Gaussian density in the background.

# Set boundaries for `x`
xmin <- -0.5; xmax <- 8.5

# Make the data frame for the underlying normal
d_normal <- data.frame(x = seq(from = xmin, to = xmax, length.out = 101)) |> 
  mutate(density = dnorm(x, mean = mu, sd = sigma))

glimpse(d_normal)

Rows: 101
Columns: 2
$ x       <dbl> -0.50, -0.41, -0.32, -0.23, -0.14, -0.05, 0.04, 0.13, 0.22, 0.…
$ density <dbl> 0.002954566, 0.003530896, 0.004204484, 0.004988582, 0.00589763…

We can now make a version of the top panel like so.

d |> 
  ggplot(aes(x = x, y = density)) +
  geom_area(data = d_normal,
            fill = sl[2]) +
  geom_col(alpha = 0.85, fill = sl[5], width = 0.45,
           color = alpha(sl[5], 0.85)) +
  geom_text(aes(label = label_p),
            size = 3.0, vjust = -2) +
  geom_text(aes(label = label_y),
            parse = TRUE, size = 3.0, vjust = -0.25) +
  geom_vline(xintercept = thresholds_1, color = sl[7], linetype = 3) +
  annotate(geom = "text",
           x = xmax - 0.05, y = ml, label = str_c("mu==", mu, "*','*~sigma==", sigma),
           hjust = 1, parse = TRUE, size = 3.2, vjust = -2.75) +
  annotate(geom = "text",
           x = xmax - 0.05, y = ml, label = str_c("italic(M[y])==", m, "*','*~italic(S[y])==", s),
           hjust = 1, parse = TRUE, size = 3.2, vjust = -1) +
  scale_x_continuous(NULL, breaks = thresholds_1, expand = c(0, 0), labels = thresholds_1,
                     sec.axis = dup_axis(labels = parse(text = str_c("theta[", 1:6, "]")))) +
  scale_y_continuous(NULL, breaks = NULL, expand = expansion(mult = c(0, 0.41)))

Sadly, this kind of figure doesn’t lend itself well to a faceted plot. The main difficulty is the thresholds–and thus their primary and secondary x-axis breaks–vary across panels, and I’m not aware of a simple way to achieve that with scale_x_continuous(). So we will instead wrap the code above into a custom function called plot_bar_den(), which we’ll use to make the four panels one at a time.

plot_bar_den <- function(mu, sigma, thresholds, x, digits = 1, n = 1e5, seed = 23) {
  
  # Simulate the primary data
  set.seed(seed)
  d <- data.frame(x = rnorm(n = n, mean = mu, sd = sigma)) |> 
    mutate(y = cut(x, breaks = c(-Inf, thresholds, Inf), labels = 1:7) |> 
             as.integer()) 
  
  # Compute sample statistics for `y`
  m <- mean(d$y) |> round(digits = digits)
  s <- sd(d$y) |> round(digits = digits)
  
  # Compute the maximum of the likelihood function
  ml <- dnorm(mu, mean = mu, sd = sigma)
  
  # Set boundaries for `x`
  xmin <- -0.5; xmax <- 8.5
  
  # Make the data frame for the underlying normal
  d_normal <- data.frame(x = seq(from = xmin, to = xmax, length.out = 101)) |> 
    mutate(density = dnorm(x, mean = mu, sd = sigma))
  
  # Adjust the primary data
  d |> 
    count(y) |> 
    mutate(proportion = n / sum(n)) |> 
    mutate(percent = (100 * proportion) |> round(digits = 0)) |> 
    mutate(label_p = str_c(percent, "%"),
           label_y = str_c("italic(y)==", y),
           density = ml * (proportion / max(proportion)),
           x = x) |> 
    
    # Plot
    ggplot(aes(x = x)) +
    geom_area(data = d_normal,
              aes(y = density),
              fill = sl[2]) +
    geom_col(aes(y = density),
             alpha = 0.85, fill = sl[5], width = 0.45,
             color = alpha(sl[5], 0.85)) +
    geom_text(aes(y = density, label = label_p),
              size = 3.0, vjust = -2) +
    geom_text(aes(y = density, label = label_y),
              parse = TRUE, size = 3.0, vjust = -0.25) +
    geom_vline(xintercept = thresholds, color = sl[7], linetype = 3) +
    annotate(geom = "text",
             x = xmax - 0.05, y = ml, label = str_c("mu==", mu, "*','*~sigma==", sigma),
             hjust = 1, parse = TRUE, size = 3.2, vjust = -2.75) +
    annotate(geom = "text",
             x = xmax - 0.05, y = ml, label = str_c("italic(M[y])==", m, "*','*~italic(S[y])==", s),
             hjust = 1, parse = TRUE, size = 3.2, vjust = -1) +
    scale_x_continuous(NULL, breaks = thresholds, expand = c(0, 0), labels = thresholds,
                       sec.axis = dup_axis(labels = parse(text = str_c("theta[", 1:6, "]")))) +
    scale_y_continuous(NULL, breaks = NULL, expand = expansion(mult = c(0, 0.41)))
}

Here’s how plot_bar_den() works for the top panel.

plot_bar_den(mu = 4, sigma = 1.5, thresholds = thresholds_1, x = x_1)

The mu and sigma arguments take single numeric inputs for the desired $N (μ, σ)$ values. The thresholds argument takes a vector of $k - 1$ threshold values, where $k$ is the number of levels for y (i.e., 7 for all our use cases in this figure). The x argument takes a vector of $k$ values for where the bars sit on the x axis. The final arguments digits, n, and seed all have sensible default values for our needs. Here’s how to use plot_bar_den() and patchwork syntax to make the full Figure 23.1.

# Save the threshold values for each panel
thresholds_1 <- 1:6 + 0.5
thresholds_2 <- thresholds_1
thresholds_3 <- c(1.5, 3.1, 3.7, 4.3, 4.9, 6.5)
thresholds_4 <- c(1.5, 2.25, 3, 5, 5.75, 6.5)

# Save the x-axis values for the bars in each panel
x_1 <- 1:7
x_2 <- x_1
x_3 <- c(1, 2.2, 3.4, 4, 4.6, 5.7, 7)
x_4 <- c(1, 1.875, 2.625, 4, 5.375, 6.125, 7)

# Combine and display
plot_bar_den(mu = 4, sigma = 1.5, thresholds = thresholds_1, x = x_1) /
plot_bar_den(mu = 1, sigma = 2.5, thresholds = thresholds_2, x = x_2) /
plot_bar_den(mu = 4, sigma = 1.0, thresholds = thresholds_3, x = x_3) /
plot_bar_den(mu = 4, sigma = 3.0, thresholds = thresholds_4, x = x_4)

Oh mamma. “The crucial concept in Figure 23.1 is that the probability of a particular ordinal outcome is the area under the normal curve between the thresholds of that outcome” (p. 672, emphasis in the original). In each of the subplots, we used six thresholds to discretize the continuous data into seven categories. More generally, we need $K$ thresholds to make $K + 1$ ordinal categories. To make this work,

the idea is that we consider the cumulative area under the normal up the high-side threshold, and subtract away the cumulative area under the normal up to the low-side threshold. Recall that the cumulative area under the standardized normal is denoted $Φ (z)$ , as was illustrated in Figure 15.8 [which we remade at the top of this chapter]. Thus, the area under the normal to the left of $θ_{k}$ is $Φ ((θ_{k} - μ) / σ)$ , and the area under the normal to the left of $θ_{k - 1}$ is $Φ ((θ_{k - 1} - μ) / σ)$ . Therefore, the area under the normal curve between the two thresholds, which is the probability of outcome $k$ , is

$p (y = k ∣ μ, σ, {θ_{j}}) = Φ ((θ_{k} - μ) / σ) - Φ ((θ_{k - 1} - μ) / σ)$

[This equation] applies even to the least and greatest ordinal values if we append two “virtual” thresholds at $- \infty$ and $+ \infty$ …

Thus, a normally distributed underlying metric value can yield a clearly non-normal distribution of discrete ordinal values. This result does not imply that the ordinal values can be treated as if they were themselves metric and normally distributed; in fact it implies the opposite: We might be able to model a distribution of ordinal values as consecutive intervals of a normal distribution on an underlying metric scale with appropriately positioned thresholds. (pp. 674–675)

23.2 The case of a single group

Given a model with no predictors, “if there are $K$ ordinal values, the model has $K + 1$ parameters: $θ_{1}, \dots, θ_{K - 1}, μ$ , and $σ$ . If you think about it a moment, you’ll realize that the parameter values trade-off and are undetermined” (p. 675). The solution Kruschke took throughout this chapter was to fix the two thresholds at the ends, $θ_{1}$ and $θ_{K - 1}$ , to the constants

$\begin{aligned} θ_{1} \equiv 1 + 0.5 & and & θ_{K - 1} \equiv K - 0.5 . \end{aligned}$

For example, all four subplots from Figure 23.1 had $K = 7$ categories, ranging from 1 to 7. Following Kruschke’s convention would mean setting the endmost thresholds to

$\begin{aligned} θ_{1} \equiv 1.5 & and & θ_{6} \equiv 6.5 . \end{aligned}$

As we’ll see, there are other ways to parameterize these models.

23.2.1 Implementation in JAGS brms.

The syntax to fit a basic ordered probit model with brms::brm() is pretty simple.

fit <- brm(
  data = my_data,
  family = cumulative(probit),
  y ~ 1,
  prior(normal(0, 4), class = Intercept))

The family = cumulative(probit) tells brms you’d like to use the probit link for the ordered-categorical data. It’s important to specify probit because the brms default is to use the logit link, instead. We’ll talk more about that approach at the end of this chapter.

Remember how, at the end of the last section, we said there are other ways to parameterize the ordered probit model? As it turns out, brms does not follow Kruschke’s approach for fixing the thresholds on the ends. Rather, brms freely estimates all thresholds, $θ_{1}, . . ., θ_{K - 1}$ , by fixing $μ = 0$ and $σ = 1$ . That is, instead of estimating $μ$ and $σ$ from the normal cumulative density function $Φ (x)$ , brms::brm() uses the standard normal cumulative density function $Φ (z)$ .

This all probably seems abstract. We’ll get a lot of practice comparing the two approaches as we go along. Each has its strengths and weaknesses. At this point, the thing to get is that when fitting a single-group ordered-probit model with the brm() function, there will be no priors for $μ$ and $σ$ . We only have to worry about setting the priors for all $K - 1$ thresholds. And because those thresholds are conditional on $Φ (z)$ , we should think about their priors with respect to the scale of standard normal distribution. Thus, to continue on with Kruschke’s minimally-informative prior approach, something like prior(normal(0, 4), class = Intercept) might be a good starting place. Do feel free to experiment with different settings.

23.2.2 Examples: Bayesian estimation recovers true parameter values.

The data for Kruschke’s first example come from his OrdinalProbitData-1grp-1.csv file. Load the data.

my_data_1 <- read_csv("data.R/OrdinalProbitData-1grp-1.csv")

glimpse(my_data_1)

Rows: 100
Columns: 1
$ Y <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

Plot the distribution for Y.

my_data_1 |> 
  mutate(Y = factor(Y)) |> 
  
  ggplot(aes(x = Y)) +
  geom_bar(fill = sl[5]) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.05)))

It looks a lot like the distribution of the data from one of the panels from Figure 23.1.

Now we fit the first cumulative-probit model.

fit23.1 <- brm(
  data = my_data_1,
  family = cumulative(probit),
  Y ~ 1,
  prior(normal(0, 4), class = Intercept),
  iter = 3000, warmup = 1000, chains = 4, cores = 4,
  seed = 23,
  file = "fits/fit23.01")

Examine the model summary.

print(fit23.1)

 Family: cumulative 
  Links: mu = probit; disc = identity 
Formula: Y ~ 1 
   Data: my_data_1 (Number of observations: 100) 
  Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
         total post-warmup draws = 8000

Regression Coefficients:
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1]     0.18      0.13    -0.07     0.43 1.00     9765     6602
Intercept[2]     0.60      0.13     0.34     0.87 1.00    10580     6148
Intercept[3]     1.04      0.15     0.75     1.34 1.00    11813     6972
Intercept[4]     1.50      0.19     1.15     1.89 1.00    11818     7489
Intercept[5]     1.97      0.25     1.50     2.50 1.00    11836     6559
Intercept[6]     2.58      0.40     1.91     3.45 1.00    11454     7239

Further Distributional Parameters:
     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
disc     1.00      0.00     1.00     1.00   NA       NA       NA

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The brms output for these kinds of models names the thresholds $θ_{[i]}$ as Intercept[i]. Again, whereas Kruschke identified his model by fixing $θ_{1} = 1.5$ (i.e., $1 + 0.5$ ) and $θ_{6} = 5.5$ (i.e., $6 - 0.5$ ), we freely estimated all six thresholds by using the cumulative density function for the standard normal. As a result, our thresholds are in a different metric from Kruschke’s.

Let’s extract the posterior draws.

draws <- as_draws_df(fit23.1)

glimpse(draws)

Rows: 8,000
Columns: 18
$ `b_Intercept[1]` <dbl> 0.22219600, 0.17617457, 0.12889934, 0.31154831, 0.200…
$ `b_Intercept[2]` <dbl> 0.6401313, 0.5768805, 0.4810705, 0.7504918, 0.6286168…
$ `b_Intercept[3]` <dbl> 1.0925275, 1.0247167, 0.8647831, 1.1303165, 1.0431510…
$ `b_Intercept[4]` <dbl> 1.538993, 1.768666, 1.274813, 1.712827, 1.458790, 1.4…
$ `b_Intercept[5]` <dbl> 1.707398, 2.093944, 2.053914, 1.958433, 1.802841, 1.8…
$ `b_Intercept[6]` <dbl> 2.581430, 2.581517, 2.395196, 2.801048, 2.387104, 2.2…
$ disc             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ `Intercept[1]`   <dbl> 0.22219600, 0.17617457, 0.12889934, 0.31154831, 0.200…
$ `Intercept[2]`   <dbl> 0.6401313, 0.5768805, 0.4810705, 0.7504918, 0.6286168…
$ `Intercept[3]`   <dbl> 1.0925275, 1.0247167, 0.8647831, 1.1303165, 1.0431510…
$ `Intercept[4]`   <dbl> 1.538993, 1.768666, 1.274813, 1.712827, 1.458790, 1.4…
$ `Intercept[5]`   <dbl> 1.707398, 2.093944, 2.053914, 1.958433, 1.802841, 1.8…
$ `Intercept[6]`   <dbl> 2.581430, 2.581517, 2.395196, 2.801048, 2.387104, 2.2…
$ lprior           <dbl> -14.25641, -14.31861, -14.22442, -14.34868, -14.22515…
$ lp__             <dbl> -151.4283, -150.8879, -151.2538, -151.2717, -149.6003…
$ .chain           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ .iteration       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ .draw            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…

Simplify the draws object.

draws <- draws |>
  select(.draw, `b_Intercept[1]`:`b_Intercept[6]`)

Here’s our brms version of the bottom plot of Figure 23.2.

# Compute the posterior means for each threshold
means <- draws |> 
  summarise_at(vars(`b_Intercept[1]`:`b_Intercept[6]`), mean) |> 
  pivot_longer(cols = everything(), values_to = "mean")

# Further wrangle
draws |> 
  pivot_longer(cols = -.draw, values_to = "threshold") |> 
  group_by(.draw) |> 
  mutate(theta_bar = mean(threshold)) |> 
  
  # Plot
  ggplot(aes(x = threshold, y = theta_bar, color = name)) +
  geom_vline(data = means,
             aes(xintercept = mean, color = name),
             linetype = 2) +
  geom_point(alpha = 1/10) +
  scale_color_scico_d(palette = "lajolla", end = 0.85, direction = -1) +
  ylab("mean threshold") +
  theme(legend.position = "none")

The initial means data at the top contains the $θ_{i}$ -specific means, which we used to make the dashed vertical lines with geom_vline(). Did you see what we did there with those group_by() and mutate() lines? That’s how we computed the mean threshold within each step of the HMC chain, what Kruschke (p. 680) denoted as $\bar{θ} (s) = \sum_{k}^{K - 1} θ_{k} (s) / (K - 1)$ , where $s$ refers to particular steps in the HMC chain.

Perhaps of greater interest, you might have noticed how different our plot is from the one in the text. We might should compare the results of our brms parameterization of $θ_{[i]}$ with one based on the parameterization in the text in an expanded version of the bottom plot of Figure 23.2. To convert our brms output to match Kruschke’s, we’ll rescale our $θ_{[i]}$ draws with help from the scales::rescale() function, about which you might learn more here.

# Primary data wrangling
p <- bind_rows(
  # brms parameterization
  draws |> 
    pivot_longer(cols = -.draw, values_to = "threshold") |> 
    group_by(.draw) |> 
    mutate(theta_bar = mean(threshold)),
  # Kruschke's parameterization
  draws |> 
    pivot_longer(cols = -.draw, values_to = "threshold") |> 
    group_by(.draw) |> 
    mutate(threshold = scales::rescale(threshold, to = c(1.5, 6.5))) |> 
    mutate(theta_bar = mean(threshold))) |> 
  # Add a parameterization index
  mutate(model = rep(c("brms parameterization", "Kruschke's parameterization"), each = n() / 2))

# Compute the means by model and threshold for the vertical lines
means <- p |> 
  ungroup() |> 
  group_by(model, name) |> 
  summarise(mean = mean(threshold))

# Plot!
p |> 
  ggplot(aes(x = threshold, y = theta_bar)) +
  geom_vline(data = means,
             aes(xintercept = mean, color = name),
             linetype = 2) +
  geom_point(aes(color = name),
             alpha = 1/10, size = 1/2) +
  scale_color_scico_d(palette = "lajolla", end = 0.85, direction = -1) +
  ylab("mean threshold") +
  theme(legend.position = "none") +
  facet_wrap(~ model, ncol = 1, scales = "free")

We can take our rescaling approach further to convert the posterior distributions for $μ$ and $σ$ from the brms $Normal (0, 1)$ constants to the metric from Kruschke’s approach. Say $y_{1}$ and $y_{2}$ are two draws from some Gaussian and $z_{1}$ and $z_{2}$ are their corresponding $z$ -scores. Here’s how to solve for $σ$ .

$\begin{aligned} z_{1} - z_{2} & = \frac{(y_{1} - μ)}{σ} - \frac{(y_{2} - μ)}{σ} \\ = \frac{(y_{1} - μ) - (y_{2} - μ)}{σ} \\ = \frac{y_{1} - μ - y_{2} + μ}{σ} \\ = \frac{y_{1} - y_{2}}{σ}, therefore \\ σ & = \frac{y_{1} - y_{2}}{z_{1} - z_{2}} . \end{aligned}$

If you’d like to compute $μ$ , it’s even simpler.

$\begin{aligned} z_{1} & = \frac{y_{1} - μ}{σ} \\ z_{1} σ & = y_{1} - μ \\ z_{1} σ + μ & = y_{1}, therefore \\ μ & = y_{1} - z_{1} σ \end{aligned}$

Big shout out to my math-stats savvy friends academic twitter for the formulas, especially Ph.Demetri, Lukas Neugebauer, and Brenton Wiernik for walking out the formulas (see this twitter thread). For our application, Intercept[1] and Intercept[6] will be our two $z$ -scores and Kruschke’s 1.5 and 6.5 will be their corresponding $y$ -values.

draws |> 
  select(.draw, `b_Intercept[1]`, `b_Intercept[6]`) |> 
  mutate(`y[1]` = 1.5,
         `y[6]` = 6.5) |> 
  mutate(mu    = `y[1]` - `b_Intercept[1]` * 1,
         sigma = (`y[1]` - `y[6]`) / (`b_Intercept[1]` - `b_Intercept[6]`)) |> 
  mutate(`(mu-2)/sigma` = (mu - 2) / sigma) |> 
  pivot_longer(cols = mu:`(mu-2)/sigma`) |> 
  mutate(name = factor(name, levels = c("mu", "sigma", "(mu-2)/sigma"))) |>
  
  ggplot(aes(x = value)) +
  stat_halfeye(point_interval = mode_hdi, .width = 0.95,
               color = sl[8], fill = sl[4],
               normalize = "panels") +
  scale_y_continuous(NULL, breaks = NULL) +
  facet_wrap(~ name, labeller = label_parsed, scales = "free")

Our results are similar to Kruschke’s. Given we used a different algorithm, a different parameterization, and different priors, I’m not terribly surprised they’re a little different. If you have more insight on the matter or have spotted a flaw in this method, please share with the rest of us.

It’s unclear, to me, how we’d interpret the effect size. The difficulty isn’t that Kruschke’s comparison of $C = 2.0$ is arbitrary, but that we can only interpret the comparison given the model assumption of $θ_{1} = 1.5$ and $θ_{6} = 6.5$ . If your theory doesn’t allow you to understand the meaning of those constants and why you’d prefer them to slightly different ones, you’d be fooling yourself if you attempted to interpret any effect sizes conditional on those values. Proceed with caution, friends.

In the large paragraph on the lower part of page 679, Kruschke discussed why the thresholds tend to have nontrivial covariances. This is what he was trying to convey with the bottom subplot in Figure 23.2. Just for practice, we might also explore the correlation matrix among the thresholds with a customized GGally::ggpairs() plot.

# Customize
my_upper <- function(data, mapping, ...) {
  ggplot(data = data, mapping = mapping) + 
    geom_point(alpha = 1/5, color = sl[7], size = 1/5)
}

my_diag <- function(data, mapping, ...) {
  ggplot(data = data, mapping = mapping) + 
    geom_density(fill = sl[3], linewidth = 0) +
    scale_x_continuous(NULL, breaks = NULL) +
    scale_y_continuous(NULL, breaks = NULL)
}

my_lower <- function(data, mapping, ...) {
  
  # Get the x and y data to use the other code
  x <- eval_data_col(data, mapping$x)
  y <- eval_data_col(data, mapping$y)
  
  # Compute the correlations
  corr <- cor(x, y, method = "p", use = "pairwise")
  
  # Plot the cor value
  ggally_text(
    label = formatC(corr, digits = 2, format = "f") |> str_replace("0\\.", "."),
    mapping = aes(),
    color = "black", size = 4) +
    scale_x_continuous(NULL, breaks = NULL) +
    scale_y_continuous(NULL, breaks = NULL)
}

# Wrangle
as_draws_df(fit23.1) |> 
  select(contains("b_Intercept")) |> 
  set_names(str_c("theta[", 1:6, "]")) |> 
  
  # Plot
  ggpairs(upper = list(continuous = my_upper),
          diag = list(continuous = my_diag),
          lower = list(continuous = my_lower),
          labeller = label_parsed) +
  ggtitle("The thresholds are highly correlated")

Kruschke didn’t do this in the text, but it might be informative to plot the probability distributions for the seven categories from Y, $p (y = k ∣ μ = 0, σ = 1, {θ_{i}})$ .

draws |> 
  select(-.draw) |> 
  mutate_all(.funs = pnorm) |> 
  transmute(`p[Y==1]` = `b_Intercept[1]`,
            `p[Y==2]` = `b_Intercept[2]` - `b_Intercept[1]`,
            `p[Y==3]` = `b_Intercept[3]` - `b_Intercept[2]`,
            `p[Y==4]` = `b_Intercept[4]` - `b_Intercept[3]`,
            `p[Y==5]` = `b_Intercept[5]` - `b_Intercept[4]`,
            `p[Y==6]` = `b_Intercept[6]` - `b_Intercept[5]`,
            `p[Y==7]` = 1 - `b_Intercept[6]`) |> 
  set_names(1:7) |> 
  pivot_longer(cols = everything(), names_to = "Y") |> 
  
  ggplot(aes(x = value, y = Y)) +
  stat_halfeye(point_interval = mode_hdi, .width = 0.95, 
               color = sl[8], fill = sl[4], height = 2.5, size = 1/2) +
  scale_x_continuous(expression(italic(p)*"["*Y==italic(i)*"]"),
                     breaks = 0:5 / 5, expand = c(0, 0), limits = c(0, 1))

Happily, the model produces data that look a lot like those from which it was generated.

set.seed(23)

draws |> 
  mutate(z = rnorm(n(), mean = 0, sd = 1)) |> 
  mutate(Y = case_when(
    z < `b_Intercept[1]` ~ 1,
    z < `b_Intercept[2]` ~ 2,
    z < `b_Intercept[3]` ~ 3,
    z < `b_Intercept[4]` ~ 4,
    z < `b_Intercept[5]` ~ 5,
    z < `b_Intercept[6]` ~ 6,
    z >= `b_Intercept[6]` ~ 7
  ) |> as.factor()) |> 

  ggplot(aes(x = Y)) +
  geom_bar(fill = sl[5]) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.05)))

Along similar lines, we can use the pp_check() function to make a version of the upper right panel of Figure 23.2. The type = "bars" argument will allow us to summarize the posterior predictions as a dot (mean) and standard error bars superimposed on a bar plot of the original data. Note how this differs a little from Kruschke’s use of the posterior median and 95% HDIs. The ndraws = 1000 argument controls how many posterior predictions we wanted to summarize over. The rest is just formatting.

bayesplot::color_scheme_set(sl[2:7])

set.seed(23)

pp_check(fit23.1, type = "bars", ndraws = 1000, fatten = 2) +
  scale_x_continuous("y", breaks = 1:7) +
  scale_y_continuous(NULL, breaks = NULL, expand = expansion(mult = c(0, 0.05))) +
  ggtitle("Data with posterior predictions",
          subtitle = "N = 100") +
  theme(legend.background = element_blank(),
        legend.position = c(0.9, 0.8))

Load the data for the next model.

my_data_2 <- read_csv("data.R/OrdinalProbitData-1grp-2.csv")

Since we’re reusing all the specifications from the last model for this one, we can just use update().

fit23.2 <- update(
  fit23.1, 
  newdata = my_data_2,
  iter = 3000, warmup = 1000, chains = 4, cores = 4,
  seed = 23,
  file = "fits/fit23.02")

print(fit23.2)

 Family: cumulative 
  Links: mu = probit; disc = identity 
Formula: Y ~ 1 
   Data: my_data_2 (Number of observations: 70) 
  Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
         total post-warmup draws = 8000

Regression Coefficients:
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1]    -1.41      0.21    -1.84    -1.00 1.00     5545     5461
Intercept[2]    -0.17      0.15    -0.47     0.12 1.00     9587     6805
Intercept[3]     0.17      0.15    -0.12     0.47 1.00     9258     7298
Intercept[4]     0.46      0.16     0.16     0.76 1.00     9223     6339
Intercept[5]     0.83      0.17     0.50     1.16 1.00     9120     7027
Intercept[6]     1.99      0.31     1.42     2.66 1.00     8518     6280

Further Distributional Parameters:
     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
disc     1.00      0.00     1.00     1.00   NA       NA       NA

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Extract and wrangle the posterior draws.

draws <- as_draws_df(fit23.2) |>
  select(.draw, `b_Intercept[1]`:`b_Intercept[6]`)

Now we might compare the brms parameterization of $θ_{[i]}$ with Kruschke’s parameterization in an expanded version of the bottom plot of Figure 23.3. As we’ll be making a lot of these plots throughout this chapter, it might be worthwhile to just make a custom function. We’ll call it compare_thresholds().

compare_thresholds <- function(data, lb = 1.5, ub = 6.5) {
  
  # `lb` = lower bound
  # `ub` = upper bound
  
  # Primary data wrangling
  p <- bind_rows(
    data |> 
      pivot_longer(cols = -.draw, values_to = "threshold") |> 
      group_by(.draw) |> 
      mutate(theta_bar = mean(threshold)),
    data |> 
      pivot_longer(cols = -.draw, values_to = "threshold") |> 
      group_by(.draw) |> 
      mutate(threshold = scales::rescale(threshold, to = c(lb, ub))) |> 
      mutate(theta_bar = mean(threshold))) |> 
    mutate(model = rep(c("brms parameterization", "Kruschke's parameterization"), each = n() / 2)) 
  
  # Compute the means by model and threshold for the vertical lines
  means <- p |> 
    ungroup() |> 
    group_by(model, name) |> 
    summarise(mean = mean(threshold))
  
  # Plot
  p |> 
    ggplot(aes(x = threshold, y = theta_bar)) +
    geom_vline(data = means,
               aes(xintercept = mean, color = name),
               linetype = 2) +
    geom_point(aes(color = name),
               alpha = 1/10, size = 1/2) +
    scale_color_scico_d(palette = "lajolla", end = 0.85, direction = -1) +
    ylab("mean threshold") +
    theme(legend.position = "none") +
    facet_wrap(~ model, ncol = 1, scales = "free")
}

Take that puppy for a spin.

draws |> 
  compare_thresholds(lb = 1.5, ub = 6.5)

Oh man, that works sweet. Now let’s use the same parameter-transformation approach from before to get our un-standardized posteriors for $μ$ , $σ$ , and the effect size.

draws |> 
  select(.draw, `b_Intercept[1]`, `b_Intercept[6]`) |> 
  mutate(`y[1]` = 1.5,
         `y[6]` = 6.5) |> 
  mutate(mu    = `y[1]` - `b_Intercept[1]` * 1,
         sigma = (`y[1]` - `y[6]`) / (`b_Intercept[1]` - `b_Intercept[6]`)) |> 
  mutate(`(mu-4)/sigma` = (mu - 4) / sigma) |> 
  pivot_longer(cols = mu:`(mu-4)/sigma`) |> 
  mutate(name = factor(name, levels = c("mu", "sigma", "(mu-4)/sigma"))) |>
  
  ggplot(aes(x = value)) +
  stat_halfeye(point_interval = mode_hdi, .width = 0.95,
               color = sl[8], fill = sl[4],
               normalize = "panels") +
  scale_y_continuous(NULL, breaks = NULL) +
  facet_wrap(~ name, labeller = label_parsed, scales = "free")

Use pp_check() to make our version of the upper-right panel of Figure 23.3.

set.seed(23)

pp_check(fit23.2, type = "bars", fatten = 2, ndraws = 1000) +
  scale_x_continuous("y", breaks = 1:7) +
  scale_y_continuous(NULL, breaks = NULL, expand = expansion(mult = c(0, 0.05))) +
  ggtitle("Data with posterior predictions",
          subtitle = "N = 70") +
  theme(legend.background = element_blank(),
        legend.position = c(0.9, 0.8))

Just as in the text, “the posterior predictive distribution in the top-right subpanel accurately describes the bimodal distribution of the outcomes” (p. 680).

Here are the probability distributions for each of the 7 categories of Y.

draws |> 
  select(-.draw) |> 
  mutate_all(.funs = pnorm) |> 
  transmute(`p[Y==1]` = `b_Intercept[1]`,
            `p[Y==2]` = `b_Intercept[2]` - `b_Intercept[1]`,
            `p[Y==3]` = `b_Intercept[3]` - `b_Intercept[2]`,
            `p[Y==4]` = `b_Intercept[4]` - `b_Intercept[3]`,
            `p[Y==5]` = `b_Intercept[5]` - `b_Intercept[4]`,
            `p[Y==6]` = `b_Intercept[6]` - `b_Intercept[5]`,
            `p[Y==7]` = 1 - `b_Intercept[6]`) |> 
  set_names(1:7) |> 
  pivot_longer(cols = everything(), names_to = "Y") |> 
  
  ggplot(aes(x = value, y = Y)) +
  stat_halfeye(point_interval = mode_hdi, .width = 0.95,
               color = sl[8], fill = sl[4], size = 1/2) +
  scale_x_continuous(expression(italic(p)*"["*Y==italic(i)*"]"), 
                     breaks = 0:5 / 5, expand = c(0, 0), limits = c(0, 1))

Before we move on, it might be helpful to nail down what the thresholds mean within the context of our brms parameterization. To keep things simple, we’ll focus on their posterior means.

tibble(x = seq(from = -3.5, to = 3.5, by = 0.01)) |>
  mutate(density = dnorm(x)) |> 
  
  ggplot(aes(x = x, ymin = 0, ymax = density)) +
  geom_ribbon(fill = sl[3]) +
  geom_vline(xintercept = fixef(fit23.2)[, 1], color = "white", linetype = 3) +
  scale_x_continuous(NULL, breaks = fixef(fit23.2)[, 1],
                     labels = parse(text = str_c("theta[", 1:6, "]"))) +
  scale_y_continuous(NULL, breaks = NULL, expand = expansion(mult = c(0, 0.05))) +
  ggtitle("Standard normal distribution underlying the ordinal Y data:",
          subtitle = "The dashed vertical lines mark the posterior means for the thresholds.") +
  coord_cartesian(xlim = c(-3, 3))

Compare that to Figure 23.1.

23.2.2.1 Not the same results as pretending the data are metric.

“In some conventional approaches to ordinal data, the data are treated as if they were metric and normally distributed” (p. 681). Here’s what that brms::brm() model might look like using methods from back in Chapter 16. First, we’ll define our stanvars.

mean_y <- mean(my_data_1$Y)
sd_y   <- sd(my_data_1$Y)

stanvars <- stanvar(mean_y, name = "mean_y") + 
  stanvar(sd_y, name = "sd_y")

Fit the model.

fit23.3 <- brm(
  data = my_data_1,
  family = gaussian,
  Y ~ 1,
  prior = c(prior(normal(mean_y, sd_y * 100), class = Intercept),
            prior(normal(0, sd_y), class = sigma)),
  chains = 4, cores = 4,
  stanvars = stanvars, 
  seed = 23,
  file = "fits/fit23.03")

Check the results.

print(fit23.3)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: Y ~ 1 
   Data: my_data_1 (Number of observations: 100) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     1.95      0.14     1.67     2.22 1.00     2995     2286

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     1.41      0.10     1.23     1.63 1.00     3529     2816

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

As Kruschke indicated in the text, it yielded a distributional mean of about 1.95 and a standard deviation of about 1.41. Here we’ll use a posterior predictive check to compare histograms of data generated from this model to that of the original data.

pp_check(fit23.3, type = "hist", binwidth = 1, ndraws = 10) +
  scale_x_continuous(breaks = seq(from = -3, to = 7, by = 2)) +
  theme(legend.position = c(0.9, 0.15))

Yeah, that’s not a good fit. We won’t be conducting a $t$ -test like Kruschke did on page 681. But we might compromise and take a look at the marginal distribution of the intercept (i.e., for $μ$ ) and its difference from 2, the reference value.

as_draws_df(fit23.3) |> 
  mutate(`2 - b_Intercept` = 2 - b_Intercept,
         `effect size`     = (2 - b_Intercept) / sigma) |> 
  pivot_longer(cols = `2 - b_Intercept`:`effect size`) |> 
  
  ggplot(aes(x = value)) +
  stat_halfeye(point_interval = mode_hdi, .width = 0.95,
               color = sl[8], fill = sl[4], normalize = "panels") +
  scale_y_continuous(NULL, breaks = NULL) +
  facet_wrap(~ name, scales = "free")

Yes indeed, 2 is a credible value for the intercept. And as reported in the text, we got a very small $d$ effect size. Now we repeat the process for the second data set.

mean_y <- mean(my_data_2$Y)
sd_y   <- sd(my_data_2$Y)

stanvars <- stanvar(mean_y, name = "mean_y") + 
  stanvar(sd_y, name = "sd_y")

fit23.4 <- update(
  fit23.3,
  newdata = my_data_2,
  chains = 4, cores = 4,
  stanvars = stanvars, 
  seed = 23,
  file = "fits/fit23.04")

Let’s just jump to the plot. This time we’re comparing the b_Intercept to the value of 4.0.

as_draws_df(fit23.4) |> 
  mutate(`2 - b_Intercept` = 4 - b_Intercept,
         `effect size`     = (4 - b_Intercept) / sigma) |> 
  pivot_longer(cols = `2 - b_Intercept`:`effect size`) |> 
  
  ggplot(aes(x = value)) +
  stat_halfeye(point_interval = mode_hdi, .width = 0.95,
               color = sl[8], fill = sl[4], normalize = "panels") +
  scale_y_continuous(NULL, breaks = NULL) +
  facet_wrap(~ name, scales = "free")

As in the text, our $d$ is centered around 0.3. Let’s use a posterior predictive check to see how well fit23.4 summarized these data.

pp_check(fit23.4, type = "hist", binwidth = 1, ndraws = 10) +
  scale_x_continuous(breaks = seq(from = -3, to = 7, by = 2)) +
  theme(legend.position = c(0.9, 0.15))

The histograms aren’t as awful as the ones from the previous model. But they’re still not great. We might further inspect the model misspecification with a cumulative distribution function overlay, this time comparing fit23.2 directly to fit23.4.

p1 <- pp_check(fit23.2, type = "ecdf_overlay", ndraws = 50) +
  ggtitle("Cumulative-normal (fit23.2)")

p2 <- pp_check(fit23.4, type = "ecdf_overlay", ndraws = 50) +
  ggtitle("Conventional-normal (fit23.4)")
  
(p1 + p2 & 
  scale_x_continuous(breaks = 0:7, limits = c(0, 7)) &
  scale_y_continuous(expand = c(0, 0)) &
  theme(title = element_text(size = 10.5))) + 
  plot_layout(guides = 'collect')

“Which of the analyses yields the more trustworthy conclusion? The one that describes the data better. In these cases, there is no doubt that the cumulative-normal model is the better description of the data” than the conventional Gaussian model (p. 682).

23.2.2.2 Ordinal outcomes versus Likert scales.

Just for fun,

rate how much you agree with the statement, “Bayesian estimation is more informative than null-hypothesis significance testing,” by selecting one option from the following: $1$ = strongly disagree; $2$ = disagree; $3$ = undecided; $4$ = agree; $5$ = strongly agree. This sort of ordinal response interface is often called a Likert-type response (Likert, 1932), pronounced LICK-ert not LIKE-ert). Sometimes, it is called a Likert “scale” but the term “scale” in this context is more properly reserved for referring to an underlying metric variable that is indicated by the arithmetic mean of several meaningfully related Likert-type responses (Carifio & Perla, 2008; e.g., Carifio & Perla, 2007; Norman, 2010). (p. 681)

Kruschke then briefly introduced how one might combine several such meaningfully-related Likert-type responses with latent variable methods. He then clarified this text will not explore that approach, further. The current version of brms (i.e., 2.12.0) has very limited latent variable capacities. However, they are in the works. Interested modelers can follow Bürkner’s progress in GitHub issue #304. He also has a (2020) paper on how one might use brms to fit item response theory models, which can be viewed as a special family of latent variable models. One can also fit Bayesian latent variable models with the blavaan package.

23.3 The case of two groups

In both examples in the preceding text, the two groups of outcomes were on the same ordinal scale. In the first example, both questionnaire statements were answered on the same disagree–agree scale. In the second example, both groups responded on the same very unhappy–very happy scale. Therefore, we assume that both groups have the same underlying metric variable with the same thresholds. (p. 682)

23.3.1 Implementation in JAGS brms.

The brm() syntax for adding a single categorical predictor to an ordered-probit model is much like that for any other likelihood. We just add the variable name to the right side of the ~ in the formula argument. If you’re like me and like to use the verbose 1 syntax for your model intercepts–thresholds in these models–just use the + operator between them. For example, this is what it’d look like for an ordered-categorical criterion y and a single categorical predictor x.

fit <- brm(
  data = my_data,
  family = cumulative(probit),
  y ~ 1 + x,
  prior = c(prior(normal(0, 4), class = Intercept),
            prior(normal(0, 4), class = b)))

Also of note, we’ve expanded the prior section to include a line for class = b. As with the thresholds, interpret this prior through the context of the underlying standard normal cumulative distribution, $Φ (z)$ . Note the interpretation, though. By brms defaults, the underlying Gaussian for the reference category of x will be $Normal (0, 1)$ . Thus whatever parameter value you get for the other categories in x, those will be standardized mean differences, making them a kind of effect size.

Note, the above all presumes you’re only interested in comparing means between groups. Things get more complicated if you want groups to vary by $σ$ , too. Hold on tight!

First, look back at the output from print(fit1) or print(fit2). The second line for both reads: Links: mu = probit; disc = identity. Hopefully the mu = probit part is no surprise. Probit regression is the primary focus of this chapter. But check out the disc = identity part and notice that nowhere in there is there any mention of sigma = identity like we get when treating the criterion as metric as in conventional Gaussian models (i.e., execute print(fit3) or print(fit4)).

Yes, there is a relationship between disc and sigma. disc is shorthand for discrimination. The term comes from the item response theory (IRT) literature and discrimination is the inverse of $σ$ (see Bürkner’s Bayesian item response modelling in R with brms and Stan). In IRT, discrimination is often denoted $a$ or $α$ . Here I’ll adopt the latter, making $σ = 1 / α$ . But focusing back on brms summary output, notice how both disc and sigma are modeled using the identity link. If you recall from earlier chapters, we switched to the log link to constrain the values to zero and above when we allowed $σ$ to vary across groups. It’s the same thing for our discrimination parameter, $α$ . Because $α$ should always be zero or above, brms defaults to the log link when modeling it with predictors.

As with $σ$ in conventional Gaussian models, we’ll be using some version of the bf() syntax when modeling the discrimination parameter in brms. For a general introduction to what Bürkner calls distributional modeling, see his (2022a) vignette, Estimating distributional models with brms. In the case of the discrimination parameter for the cumulative model, we’ll want more focused instructions. Happily, Bürkner & Vuorre (2019) have our backs. We read:

Conceptually, unequal variances are incorporated in the model by specifying an additional regression formula for the variance component of the latent variable $\tilde{Y}$ . In brms, the parameter related to latent variances is called disc (short for “discrimination”), following conventions in item response theory. Note that disc is not the variance itself, but the inverse of the standard deviation, $s .$ That is, $s = 1 / disc$ . Further, because disc must be strictly positive, it is by default modeled on the log scale.

Predicting auxiliary parameters (parameters of the distribution other than the mean/location) in brms is accomplished by passing multiple regression formulas to the brm() function. Each formula must first be wrapped in another function, bf() or lf() (for “linear formula”)–depending on whether it is a main or an auxiliary formula, respectively. The formulas are then combined and passed to the formula argument of brm(). Because the standard deviation of the latent variable is fixed to 1 for the baseline [group, disc cannot be estimated for the baseline group]. We must therefore ensure that disc is estimated only for [non-baseline groups]. To do so, we omit the intercept from the model of disc by writing 0 + ... on the right-hand side of the regression formula. By default, R applies cell-mean coding to factors in formulas without an intercept. That would lead to disc being estimated for [all groups], so we must deactivate it via the cmc argument of lf(). (pp. 11–12)

Here’s what that might look like.

fit <- brm(
  data = my_data,
  family = cumulative(probit),
  bf(y ~ 1 + x) +
    lf(disc ~ 0 + x, cmc = F),
  prior = c(prior(normal(0, 4), class = Intercept),
            prior(normal(0, 4), class = b),
            prior(normal(0, 4), class = b, dpar = disc)))

Note how when using the disc ~ 0 + ... syntax, the disc parameters are of class = b within the prior() function. If you’d like to assign them priors differing from the other b parameters, you’ll need to specify dpar = disc. Again, though the mean structure for this model is on the probit scale, the discrimination structure is on the log scale. Recalling that $σ = 1 / α$ , which means $α = 1 / σ$ , and also that we’re modeling $\log (α)$ , the priors for the standard deviations of the non-reference category groups are on the scale of $\log (1 / σ)$ .

To get a better sense of how one might set a prior on such a scale, we might compare $σ$ , $α$ , and $\log (α)$ . Here are the density and cumulative density functions for $Normal (0, 0.5)$ , $Normal (0, 1)$ , and $Normal (0, 2)$ .

tibble(mu    = 0,
       sigma = c(0.5, 1, 2)) |>
  expand_grid(y = seq(from = -5, to = 5, by = 0.1)) |> 
  mutate(`p(y)`   = dnorm(y, mu, sigma),
         `Phi(y)` = pnorm(y, mu, sigma)) |> 
  mutate(alpha = 1 / sigma,
         loga  = log(1 / sigma)) |> 
  mutate(label = str_c("list(sigma==", sigma, ",alpha==", alpha, ",log(alpha)==", round(loga, 2), ")")) |> 
  pivot_longer(cols = `p(y)`:`Phi(y)`) |> 
  
  ggplot(aes(x = y, y = value)) +
  geom_line(linewidth = 1.5, color = sl[7]) +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab(NULL) +
  coord_cartesian(xlim = c(-4, 4)) +
  facet_grid(name ~ label, labeller = label_parsed, switch = "y")

Put another way, here’s how $α$ and $\log (α)$ scale on values of $σ$ ranging from 0.0001 to 10.

tibble(sigma = seq(from = 0.0001, to = 10, by = 0.01)) |>
  mutate(alpha        = 1 / sigma,
         `log(alpha)` = log(1 / sigma)) |> 
  pivot_longer(cols = -sigma, names_to = "labels") |> 
  
  ggplot(aes(x = sigma, y = value)) +
  geom_hline(yintercept = 0, color = sl[3], linetype = 2) +
  geom_vline(xintercept = 0, color = sl[3], linetype = 2) +
  geom_line(linewidth = 1.5, color = sl[7]) +
  coord_cartesian(ylim = c(-2, 10)) +
  facet_grid(~ labels, labeller = label_parsed)

When $σ$ goes below 1, both explode upward. As $σ$ increases, $α$ asymptotes at zero and $\log (α)$ slowly descends below zero. Put another way, here is how $σ$ scales as a function of $\log (α)$ .

tibble(`log(alpha)` = seq(from = -3, to = 3, by = 0.01)) |>
  mutate(sigma = 1 / exp(`log(alpha)`)) |>
  
  ggplot(aes(x = `log(alpha)`, y = sigma)) +
  geom_hline(yintercept = 0, color = sl[3], linetype = 2) +
  geom_line(color = sl[7], linewidth = 1.5) +
  labs(x = expression(log(alpha)),
       y = expression(sigma)) +
  coord_cartesian(xlim = c(-2.5, 2.5),
                  ylim = c(0, 10))

In the context where the underlying distribution for the reference category will be the standard normal, it seems like a $Normal (0, 1)$ prior would be fairly permissive for $\log (α)$ . This is what I will use going forward. Choose your priors with care.

23.3.2 Examples: Not funny.

Load the data for the next model.

my_data <- read_csv("data.R/OrdinalProbitData1.csv")

glimpse(my_data)

Rows: 88
Columns: 2
$ X <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A"…
$ Y <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

Fit the first ordinal probit model with group-specific $μ$ and $σ$ values for the underlying normal distributions for the ordinal variable Y.

fit23.5 <- brm(
  data = my_data,
  family = cumulative(probit),
  bf(Y ~ 1 + X) +
    lf(disc ~ 0 + X, cmc = FALSE),
  prior = c(prior(normal(0, 4), class = Intercept),
            prior(normal(0, 4), class = b),
            prior(normal(0, 1), class = b, dpar = disc)),
  iter = 3000, warmup = 1000, chains = 4, cores = 4,
  seed = 23,
  file = "fits/fit23.05")

Look over the summary.

print(fit23.5)

 Family: cumulative 
  Links: mu = probit; disc = log 
Formula: Y ~ 1 + X 
         disc ~ 0 + X
   Data: my_data (Number of observations: 88) 
  Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
         total post-warmup draws = 8000

Regression Coefficients:
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1]     0.50      0.20     0.12     0.89 1.00     8682     5711
Intercept[2]     1.30      0.24     0.87     1.78 1.00     9614     6223
Intercept[3]     2.20      0.38     1.51     3.00 1.00     5284     5223
Intercept[4]     3.42      0.82     2.14     5.31 1.00     3958     4971
XB               0.43      0.34    -0.31     1.05 1.00     4816     3770
disc_XB         -0.32      0.28    -0.86     0.25 1.00     3105     3916

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Because our fancy new parameter disc_XB is on the $\log (α)$ scale, we can convert it to the $σ$ scale with $\frac{1}{\exp (\log α)}$ . For a quick and dirty example, here it is with the posterior mean.

1 / (exp(fixef(fit23.5)["disc_XB", 1]))

[1] 1.371802

Before we follow along with Kruschke, let’s hammer the meaning of these model parameters home. Here is a density plot of the two underlying latent distributions for Y, given X. We’ll throw in the thresholds for good measure. To keep things simple, we’ll just express the distributions in terms of the posterior means of each parameter.

tibble(X     = LETTERS[1:2],
       mu    = c(0, fixef(fit23.5)["XB", 1]),
       sigma = c(1, 1 / (exp(fixef(fit23.5)["disc_XB", 1])))) |> 
  expand_grid(y = seq(from = -5, to = 5, by = 0.1)) |> 
  mutate(density = dnorm(y, mean = mu, sd = sigma)) |> 
  
  ggplot(aes(x = y, y = density, fill = X)) +
  geom_area(alpha = 2/3, position = "identity") +
  geom_vline(xintercept = fixef(fit23.5)[1:4, 1], color = sl[9], linetype = 3) +
  scale_fill_scico_d(palette = "lajolla", begin = 0.75, end = 0.4) +
  scale_x_continuous(sec.axis = dup_axis(breaks = fixef(fit23.5)[1:4, 1] |> as.double(),
                                         labels = parse(text = str_c("theta[", 1:4, "]")))) +
  scale_y_continuous(NULL, breaks = NULL, expand = expansion(mult = c(0, 0.05))) +
  labs(x = NULL,
       title = "Underlying latent scale for Y, given X") +
  theme(axis.ticks.x.top = element_blank())

Returning to our previous workflow, extract the posterior draws and wrangle.

draws <- as_draws_df(fit23.5)

glimpse(draws)

Rows: 8,000
Columns: 15
$ `b_Intercept[1]` <dbl> 0.6225250, 0.3970634, 0.4201213, 0.5717441, 0.6611438…
$ `b_Intercept[2]` <dbl> 1.2447460, 0.9979544, 0.9056600, 1.5148320, 1.6155517…
$ `b_Intercept[3]` <dbl> 2.015155, 1.432478, 1.869619, 2.601692, 2.592201, 1.9…
$ `b_Intercept[4]` <dbl> 3.182607, 1.760798, 2.241776, 3.869627, 3.710775, 3.1…
$ b_XB             <dbl> 0.67725443, 0.23475709, 0.35607950, 0.47933554, 0.638…
$ b_disc_XB        <dbl> -0.27934313, 0.31402332, 0.02621193, -0.14254980, -0.…
$ `Intercept[1]`   <dbl> 0.28389776, 0.27968489, 0.24208154, 0.33207630, 0.342…
$ `Intercept[2]`   <dbl> 0.9061188, 0.8805758, 0.7276202, 1.2751642, 1.2964657…
$ `Intercept[3]`   <dbl> 1.676528, 1.315099, 1.691579, 2.362024, 2.273115, 1.6…
$ `Intercept[4]`   <dbl> 2.843980, 1.643419, 2.063736, 3.629960, 3.391689, 2.8…
$ lprior           <dbl> -12.86722, -12.66125, -12.69030, -13.10282, -13.06029…
$ lp__             <dbl> -107.9541, -111.8837, -110.4143, -109.7284, -108.2663…
$ .chain           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ .iteration       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ .draw            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…

Now, let’s use our handy compare_thresholds() function to make an expanded version of the lower-left plot of Figure 23.4.

draws |> 
  select(`b_Intercept[1]`:`b_Intercept[4]`, .draw) |> 
  compare_thresholds(lb = 1.5, ub = 4.5)

It will no longer be straightforward to use the formulas from 23.2.2 to convert the output from our brms parameterization to match the way Kruschke parameterized his conditional means and standard deviations. I will leave the conversion up to the interested reader. Going forward, we will focus on the output from our brms parameterization.

draws |> 
  # Simple parameters
  mutate(`mu[A]` = 0,
         `mu[B]` = b_XB,
         `sigma[A]` = 1,
         `sigma[B]` = 1 / exp(b_disc_XB)) |> 
  # Simple differences
  mutate(`mu[B]-mu[A]`       = `mu[B]` - `mu[A]`,
         `sigma[B]-sigma[A]` = `sigma[B]` - `sigma[A]`) |> 
  # Effect size
  mutate(`(mu[B]-mu[A])/sqrt((sigma[A]^2+sigma[B]^2)/2)` = (`mu[B]-mu[A]`) / sqrt((`sigma[A]`^2 + `sigma[B]`^2) / 2)) |> 
  # Wrangle
  pivot_longer(cols = `mu[A]`:`(mu[B]-mu[A])/sqrt((sigma[A]^2+sigma[B]^2)/2)`) |> 
  mutate(name = factor(name,
                       levels = c("mu[A]", "mu[B]", "mu[B]-mu[A]",
                                  "sigma[A]", "sigma[B]", "sigma[B]-sigma[A]",
                                  "(mu[B]-mu[A])/sqrt((sigma[A]^2+sigma[B]^2)/2)"))) |>
  
  # Plot
  ggplot(aes(x = value)) +
  stat_halfeye(point_interval = mode_hdi, .width = 0.95,
               color = sl[8], fill = sl[4], normalize = "panels") +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab(NULL) +
  facet_wrap(~ name, labeller = label_parsed, scales = "free")

$μ_{A}$ and $σ_{A}$ are both constants, which doesn’t show up well with our stat_halfeye() approach with the scales freed across facets. If these plots really mattered for a scientific presentation or something for industry, you could experiment using either a common scale across all facets, or making the plots individually and then combining them with patchwork syntax. Returning to interpretation, because $μ_{A} = 0$ , it turns out that $μ_{B} - μ_{A} = μ_{B}$ , which is on display on the top row. Because $σ_{A} = 1$ , it turns out that $σ_{B} - σ_{A}$ is just $σ_{B}$ moved over one unit to the left, which is hopefully clear in the panels of the second row. Very happily, the effect size formula worked with our brms parameters the same way it did for Kruschke’s. Both yield an effect size of about 0.5, with 95% intervals extending about $\pm 0.5$ .

Here we make good use of the type = "bars_grouped" and group = "X" arguments to make the posterior predictive plots at the top right of Figure 23.4 with the brms::pp_check() function.

set.seed(23)

pp_check(fit23.5, type = "bars_grouped", fatten = 2, group = "X", ndraws = 100) +
  scale_x_continuous("y", breaks = 1:7) +
  scale_y_continuous(NULL, breaks = NULL, expand = expansion(mult = c(0, 0.05))) +
  ggtitle("Data with posterior predictions",
          subtitle = expression(list(italic(N[A])==44, italic(N[B])==44))) +
  theme(legend.background = element_blank(),
        legend.position = c(0.9, 0.75))

Using more tricks from back in Chapter 16, here’s the corresponding conventional Gaussian model for metric data.

mean_y <- mean(my_data$Y)
sd_y   <- sd(my_data$Y)

stanvars <- stanvar(mean_y, name = "mean_y") + 
  stanvar(sd_y, name = "sd_y")

fit23.6 <- brm(
  data = my_data,
  family = gaussian,
  bf(Y ~ 0 + X, sigma ~ 0 + X),
  prior = c(prior(normal(mean_y, sd_y * 100), class = b),
            prior(normal(0, 1), class = b, dpar = sigma)),
  chains = 4, cores = 4,
  stanvars = stanvars, 
  seed = 23,
  file = "fits/fit23.06")

Check the summary.

print(fit23.6)

 Family: gaussian 
  Links: mu = identity; sigma = log 
Formula: Y ~ 0 + X 
         sigma ~ 0 + X
   Data: my_data (Number of observations: 88) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
         Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
XA           1.43      0.12     1.20     1.66 1.00     4042     2503
XB           1.86      0.17     1.53     2.18 1.00     3749     2648
sigma_XA    -0.26      0.11    -0.47    -0.03 1.00     4222     3112
sigma_XB     0.08      0.11    -0.12     0.30 1.00     3623     2465

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Here are the marginal posteriors, including the effect size (i.e., a standardized mean difference using the pooled standard deviation formula presuming equal sample sizes, $(μ_{2} - μ_{1}) / \sqrt{(σ_{1}^{2} + σ_{2}^{2}) / 2}$ .

as_draws_df(fit23.6) |> 
  mutate(`A mean` = b_XA,
         `B mean` = b_XB,
         `A Std. Dev.` = exp(b_sigma_XA),
         `B Std. Dev.` = exp(b_sigma_XB)) |> 
  mutate(`Difference of Means`     = `B mean` - `A mean`,
         `Difference of Std. Devs` = `B Std. Dev.` - `A Std. Dev.`) |> 
  mutate(`Effect Size` = `Difference of Means` / sqrt((`A Std. Dev.`^2 + `B Std. Dev.`^2) / 2)) |> 
  pivot_longer(cols = `A mean`:`Effect Size`) |> 
  
  ggplot(aes(x = value)) +
  stat_halfeye(point_interval = mode_hdi, .width = 0.95,
               color = sl[8], fill = sl[4], normalize = "panels") +
  scale_y_continuous(NULL, breaks = NULL) +
  labs(x = "Marginal posterior",
       subtitle = "These are based on the conventional Gaussian model, NOT the cumulative-normal\nmodel Kruschke displayed in Figure 23.4") +
  facet_wrap(~ name, ncol = 2, scales = "free")

Compare those results to those Kruschke reported from an NHST analysis in the note below Figure 23.4:

$M_{1} = 1.43, M_{2} = 1.86, t = 2.18, p = 0.032$ , with effect size $d = 0.466$ with $95 %$ CI of $0.036 - 0.895$ . An $F$ test of the variances concludes that the standard deviations are significantly different: $S_{1} = 0.76, S_{2} = 1.07, p = 0.027$ . Notice in this case that treating the values as metric greatly underestimates their variances, as well as erroneously concludes the variances are different. (p. 684)

As to the data in the analyses Kruschke reported in Figure 23.5 and the prose in the second paragraph on page 685, I’m not aware that Kruschke provided them. From his footnote #2, we read: “Data in Figure 23.5 are from an as-yet unpublished study I conducted with the collaboration of Allison Vollmer as part of her undergraduate honors project.” In place of the real data, I eyeballed the values based on the upper-right panels in Figure 23.5. Here they are.

d <- tibble(x = rep(str_c("joke ", c(1, 6)), each = 177),
            y = c(rep(1:7, times = c(95, 19, 18, 10, 17, 10, 8)),
                  rep(1:7, times = c(53, 33, 31, 22, 23, 14, 1))))

glimpse(d)

Rows: 354
Columns: 2
$ x <chr> "joke 1", "joke 1", "joke 1", "joke 1", "joke 1", "joke 1", "joke 1"…
$ y <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

My approximation to Kruschke’s data looks like this.

d |> 
  ggplot(aes(x = y)) +
  geom_bar(fill = sl[5]) +
  scale_x_continuous(breaks = 1:7) +
  scale_y_continuous(NULL, breaks = NULL, expand = expansion(mult = c(0, 0.05))) +
  facet_wrap(~ x, ncol = 1)

Here we fit the cumulative-normal model based on our version of the data, continuing to allow both $μ$ and $σ$ (i.e., $1 / \exp (\log α)$ ) to differ across groups.

fit23.7 <- brm(
  data = d,
  family = cumulative(probit),
  bf(y ~ 1 + x) +
    lf(disc ~ 0 + x, cmc = FALSE),
  prior = c(prior(normal(0, 4), class = Intercept),
            prior(normal(0, 4), class = b),
            prior(normal(0, 1), class = b, dpar = disc)),
  iter = 3000, warmup = 1000, chains = 4, cores = 4,
  seed = 23,
  file = "fits/fit23.07")

Check the model summary.

print(fit23.7)

 Family: cumulative 
  Links: mu = probit; disc = log 
Formula: y ~ 1 + x 
         disc ~ 0 + x
   Data: d (Number of observations: 354) 
  Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
         total post-warmup draws = 8000

Regression Coefficients:
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1]     0.08      0.09    -0.11     0.26 1.00     4586     5209
Intercept[2]     0.37      0.09     0.19     0.54 1.00     6484     5951
Intercept[3]     0.65      0.09     0.47     0.83 1.00     7716     5929
Intercept[4]     0.87      0.10     0.68     1.06 1.00     7589     6623
Intercept[5]     1.26      0.12     1.03     1.49 1.00     5985     5797
Intercept[6]     1.81      0.16     1.51     2.13 1.00     6906     6579
xjoke6           0.39      0.10     0.20     0.58 1.00     6747     5827
disc_xjoke6      0.50      0.12     0.27     0.73 1.00     4011     5214

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Save and wrangle the posterior draws, then use our compare_thresholds() function to compare the brms parameterization of $θ_{[i]}$ with the parameterization in the text in an expanded version of the lower-left plot of Figure 23.5.

draws <- as_draws_df(fit23.7)

draws |> 
  select(`b_Intercept[1]`:`b_Intercept[6]`, .draw) |> 
  compare_thresholds(lb = 1.5, ub = 6.5)

Given our data are only approximations of Kruschke’s, I think we did pretty good. Here are the histograms for our brms versions of the $μ$ - and $σ$ -related parameters.

draws |> 
  transmute(`mu[Joke~1]` = 0,
            `mu[Joke~6]` = b_xjoke6,
            `sigma[Joke~1]` = 1,
            `sigma[Joke~6]` = 1 / exp(b_disc_xjoke6)) |> 
  mutate(`mu[Joke~6]-mu[Joke~1]`       = `mu[Joke~6]` - `mu[Joke~1]`,
         `sigma[Joke~6]-sigma[Joke~1]` = `sigma[Joke~6]` - `sigma[Joke~1]`) |> 
  mutate(`(mu[Joke~6]-mu[Joke~1])/sqrt((sigma[Joke~1]^2+sigma[Joke~6]^2)/2)` = (`mu[Joke~6]-mu[Joke~1]`) / sqrt((`sigma[Joke~1]`^2 + `sigma[Joke~6]`^2) / 2)) |> 
  pivot_longer(cols = everything()) |> 
  mutate(name = factor(name,
                       levels = c("mu[Joke~1]", "mu[Joke~6]", "mu[Joke~6]-mu[Joke~1]",
                                  "sigma[Joke~1]", "sigma[Joke~6]", "sigma[Joke~6]-sigma[Joke~1]",
                                  "(mu[Joke~6]-mu[Joke~1])/sqrt((sigma[Joke~1]^2+sigma[Joke~6]^2)/2)"))) |>
  
  ggplot(aes(x = value)) +
  stat_halfeye(point_interval = mode_hdi, .width = 0.95,
               color = sl[8], fill = sl[4], normalize = "panels") +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab(NULL) +
  facet_wrap(~ name, labeller = label_parsed, scales = "free")

Here are our versions of the two panels in the upper right of Figure 23.5.

set.seed(23)

pp_check(fit23.7, type = "bars_grouped", fatten = 2, group = "x", ndraws = 100) +
  scale_x_continuous("y", breaks = 1:7) +
  scale_y_continuous(NULL, breaks = NULL, expand = expansion(mult = c(0, 0.05))) +
  ggtitle("Data with posterior predictions",
          subtitle = expression(list(italic(N["joke "*1])==177, italic(N["joke "*6])==177))) +
  theme(legend.position = "none")

Now here’s the corresponding model where we treat the y data as metric.

mean_y <- mean(d$y)
sd_y   <- sd(d$y)

stanvars <- stanvar(mean_y, name = "mean_y") + 
  stanvar(sd_y, name = "sd_y")

fit23.8 <- brm(
  data = d,
  family = gaussian,
  bf(y ~ 0 + x, sigma ~ 0 + x),
  prior = c(prior(normal(mean_y, sd_y * 100), class = b),
            prior(normal(0, exp(sd_y)), class = b, dpar = sigma)),
  chains = 4, cores = 4,
  stanvars = stanvars, 
  seed = 23,
  file = "fits/fit23.08")

Check the summary.

print(fit23.8)

 Family: gaussian 
  Links: mu = identity; sigma = log 
Formula: y ~ 0 + x 
         sigma ~ 0 + x
   Data: d (Number of observations: 354) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
xjoke1           2.41      0.14     2.14     2.69 1.00     3958     3066
xjoke6           2.86      0.12     2.61     3.09 1.00     4498     2809
sigma_xjoke1     0.64      0.05     0.54     0.75 1.00     4485     3035
sigma_xjoke6     0.52      0.05     0.42     0.63 1.00     5132     3413

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Make the marginal posteriors, including the effect size.

as_draws_df(fit23.8) |> 
  mutate(`Joke 1 Mean` = b_xjoke1,
         `Joke 6 Mean` = b_xjoke6,
         `Joke 1 Std. Dev.` = exp(b_sigma_xjoke1),
         `Joke 6 Std. Dev.` = exp(b_sigma_xjoke6)) |> 
  mutate(`Difference of Means`     = `Joke 6 Mean` - `Joke 1 Mean`,
         `Difference of Std. Devs` = `Joke 6 Std. Dev.` - `Joke 1 Std. Dev.`) |> 
  mutate(`Effect Size` = `Difference of Means` / sqrt((`Joke 1 Std. Dev.`^2 + `Joke 6 Std. Dev.`^2) / 2)) |> 
  pivot_longer(cols = `Joke 1 Mean`:`Effect Size`) |> 
  mutate(name = factor(name, 
                       levels = c("Joke 1 Mean", "Joke 1 Std. Dev.", 
                                  "Joke 6 Mean", "Joke 6 Std. Dev.", 
                                  "Difference of Means", "Difference of Std. Devs", 
                                  "Effect Size"))) |> 
  
  ggplot(aes(x = value)) +
  stat_halfeye(point_interval = mode_hdi, .width = 0.95,
               color = sl[8], fill = sl[4], normalize = "panels") +
  scale_y_continuous(NULL, breaks = NULL) +
  labs(x = "Marginal posterior",
       subtitle = "These are based on the conventional Gaussian model, NOT the cumulative-normal\nmodel Kruschke displayed in Figure 23.5") +
  facet_wrap(~ name, ncol = 2, scales = "free")

If you think you have a better approximation of Kruschke’s data, please share.

23.4 The Case of metric predictors

“This type of model is often referred to as ordinal probit regression or ordered probit regression because the probit function is the link function corresponding to the cumulative-normal inverse-link function” (p. 688, emphasis in the original).

23.4.1 Implementation in JAGS brms.

This model is easy to specify in brms. Just make sure to think clearly about your priors.

23.4.2 Example: Happiness and money.

Load the data for the next model.

my_data <- read_csv("data.R/OrdinalProbitData-LinReg-2.csv")

glimpse(my_data)

Rows: 200
Columns: 2
$ X <dbl> 1.386389, 1.223879, 1.454505, 1.112068, 1.222715, 1.545099, 1.360256…
$ Y <dbl> 1, 1, 5, 5, 1, 4, 6, 2, 5, 4, 1, 4, 4, 4, 4, 6, 1, 1, 6, 2, 1, 7, 1,…

Take a quick look at the data.

my_data |> 
  ggplot(aes(x = X, y = Y)) +
  geom_point(alpha = 1/3, color = sl[9]) +
  scale_y_continuous(breaks = 1:7)

Kruschke standardized his predictor within his model code. Here we’ll standardize X before fitting the model.

my_data <- my_data |> 
  mutate(X_s = (X - mean(X)) / sd(X))

Fit the model.

fit23.9 <- brm(
  data = my_data,
  family = cumulative(probit),
  Y ~ 1 + X_s,
  prior = c(prior(normal(0, 4), class = Intercept),
            prior(normal(0, 4), class = b)),
  iter = 3000, warmup = 1000, chains = 4, cores = 4,
  seed = 23,
  file = "fits/fit23.09")

Check the summary.

print(fit23.9)

 Family: cumulative 
  Links: mu = probit; disc = identity 
Formula: Y ~ 1 + X_s 
   Data: my_data (Number of observations: 200) 
  Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
         total post-warmup draws = 8000

Regression Coefficients:
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1]    -1.18      0.12    -1.42    -0.94 1.00     4929     5788
Intercept[2]    -0.76      0.11    -0.98    -0.55 1.00     6648     6423
Intercept[3]    -0.29      0.10    -0.50    -0.09 1.00     8515     6940
Intercept[4]     0.25      0.11     0.04     0.45 1.00     9773     7244
Intercept[5]     0.73      0.11     0.51     0.95 1.00     8933     6684
Intercept[6]     1.26      0.13     1.01     1.52 1.00     8414     7401
X_s              1.16      0.10     0.97     1.35 1.00     6232     6258

Further Distributional Parameters:
     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
disc     1.00      0.00     1.00     1.00   NA       NA       NA

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Extract the posterior draws and compare the brms parameterization of $θ_{[i]}$ with the parameterization in the text in an expanded version of the bottom panel of Figure 23.7.

draws <- as_draws_df(fit23.9)

draws |> 
  select(`b_Intercept[1]`:`b_Intercept[6]`, .draw) |> 
  compare_thresholds(lb = 1.5, ub = 6.5)

Here’s the marginal distribution of b_X_s, our effect size for the number of jokes.

draws |> 
  ggplot(aes(x = b_X_s)) +
  stat_halfeye(point_interval = mode_hdi, .width = 0.95,
               color = sl[8], fill = sl[4]) +
  scale_y_continuous(NULL, breaks = NULL)

This differs from Kruschks’s $β_{1}$ , which is in an unstandardized metric based on the parameters in his version of the model. But unlike the effect sizes from previous models, this one is not in a Cohen’s- $d$ metric. Rather, this is a fully-standardized regression coefficient. As to the large subplot at the top of Figure 23.7, we can make something like it by nesting conditional_effects() within plot().

conditional_effects(fit23.9) |> 
  plot(line_args = list(color = sl[7], fill = sl[3]))

Here’s a more elaborated version of the same plot, this time depicting the model with 100 fitted lines randomly drawn from the posterior.

set.seed(23)

conditional_effects(fit23.9, 
                    ndraws = 100,
                    spaghetti = TRUE) |> 
  plot(points = TRUE,
       line_args = list(size = 0),
       point_args = list(alpha = 1/3, color = sl[9]),
       spaghetti_args = list(colour = alpha(sl[6], 0.2)))

Note the warning message. There was a similar one in the first plot, which I suppressed for simplicity sake. The message suggests treating the fitted lines as “continuous variables” might lead to a deceptive plot. Here’s what happens if we follow the suggestion.

set.seed(23)

conditional_effects(fit23.9, categorical = TRUE)

Recall that our thresholds, $θ_{1}, . . ., θ_{K - 1}$ , in conjunction with the standard normal density, give us the probability of a given Y value, given X_s. That is, $p (y = k ∣ μ, σ, {θ_{j}})$ , where $μ$ is conditional on $x$ . This plot returned the fitted lines of those conditional probabilities, each depicted by the posterior mean and percentile-based 95% intervals. At lower values of X_s, lower values of Y are more probable. At higher values of X_s, higher values of Y are more probable.

It might be useful to get more practice in with this model. Here we’ll use fitted() to make a similar plot, depicting the model with may fitted lines instead of summary statistics.

# How many fitted lines do you want?
n_draws <- 50

# Define the `X_s` values you want to condition on
# because the lines are nonlinear, you'll need many of them
nd <- tibble(X_s = seq(from = -2, to = 2, by = 0.05))

f <- fitted(fit23.9,
            newdata = nd,
            ndraws = n_draws,
            summary = FALSE)

# Inspect the output
str(f)

 num [1:50, 1:81, 1:7] 0.8 0.898 0.789 0.869 0.874 ...
 - attr(*, "dimnames")=List of 3
  ..$ : chr [1:50] "1" "2" "3" "4" ...
  ..$ : NULL
  ..$ : chr [1:7] "1" "2" "3" "4" ...

Our output came in three dimensions. We have 50 rows, corresponding to n_draws <- 50 (i.e., 50 posterior draws). There are 81 columns, based on how we defined the X_s values within our nd data (i.e., seq(from = -2, to = 2, by = 0.05)). The third dimension has seven levels, one corresponding to each of the seven levels of our criterion variable Y. Here’s a way to rearrange that output into a useful format for plotting.

# Rearrange the output
rbind(
  f[, , 1],
  f[, , 2],
  f[, , 3],
  f[, , 4],
  f[, , 5],
  f[, , 6],
  f[, , 7]) |> 
  # Wrangle
  data.frame() |> 
  set_names(nd |> pull(X_s)) |> 
  mutate(draw   = rep(1:n_draws, times = 7),
         rating = rep(1:7, each = n_draws)) |> 
  pivot_longer(cols = -c(draw, rating), names_to = "X_s", values_to = "probability") |> 
  mutate(rating = str_c("Y: ", rating),
         X_s    = X_s |> as.double()) |> 
  
  # Plot
  ggplot(aes(x = X_s, y = probability, 
             group = interaction(draw, rating),
             color = rating)) +
  geom_line(alpha = 1/2, linewidth = 1/4)  +
  scale_y_continuous(breaks = c(0, 0.5, 1), expand = c(0, 0), limits = c(0, 1)) +
  scale_color_scico_d(palette = "lajolla", end = 0.85, direction = -1) +
  theme(legend.position = "none") +
  facet_wrap(~ rating, ncol = 7)

So far, we’ve been plotting our model with the context of the default scale = "response" setting within fitted(). Within the context of the response variable Y, our model returns response probabilities. We can also look at the output within the context of scale = "linear". We’ll plot these fitted lines across our nd values two different ways. For the first, p1, we’ll use summary statistics. For the second, p2, we’ll set summary = TRUE.

# Adjust the predictor data
nd <- tibble(X_s = seq(from = -2.5, to = 2.5, by = 0.1))

# Use summary statistics
p1 <- fitted(fit23.9,
             newdata = nd, 
             scale = "linear") |> 
  data.frame() |> 
  bind_cols(nd) |> 
  
  ggplot(aes(x = X_s, y = Estimate, ymin = Q2.5, ymax = Q97.5)) +
  geom_smooth(stat = "identity",
              alpha = 1/2, color = sl[7], fill = sl[3]) +
  labs(y = "underlying standard normal",
       title = "summary statistics")


# Set `summary = FALSE`
set.seed(23)
p2 <- fitted(fit23.9, 
             newdata = nd,
             ndraws = n_draws,
             scale = "linear", 
             summary = FALSE) |> 
  data.frame() |> 
  set_names(nd |> pull(X_s)) |> 
  mutate(draw = 1:n_draws) |> 
  pivot_longer(cols = -draw, names_to = "X_s") |> 
  mutate(X_s = X_s |> as.double()) |> 
  
  ggplot(aes(x = X_s, y = value, group = draw)) +
  geom_line(alpha = 1/2, color = sl[7]) +
  scale_y_continuous(NULL, breaks = NULL) +
  ggtitle("50 posterior draws")

# Combine, adjust, and display
p1 + p2 &
  coord_cartesian(ylim = c(-2, 2))

Both methods returned the fitted lines in the metric of the underlying standard normal distribution. The fitted lines are nonlinear in the metric of the raw data Y, but they’re linear in the metric of the presumed underlying distribution. If it helps, we’ll make a marginal plot of the standard normal distribution and tack it onto the right.

# Make Phi
p3 <- tibble(x = seq(from = -3, to = 3, by = 0.01)) |>
  mutate(density = dnorm(x)) |> 
  
  ggplot(aes(x = x, y = density)) +
  geom_area(fill = sl[5]) +
  # Add the thresholds!
  geom_vline(xintercept = posterior_summary(fit23.9)[1:6, 1], 
             color = sl[9], linetype = 3) +
  # Mark the thresholds with the axis breaks
  scale_x_reverse(NULL, breaks = fixef(fit23.9)[1:6, 1], position = "top",
                  labels = parse(text = str_c("theta[", 1:6, "]"))) +
  scale_y_continuous(NULL, breaks = NULL) +
  ggtitle(expression(italic(N)(0,1))) +
  coord_flip(xlim = c(-2, 2))

# Combine, adjust, and display
(((p1 | p2) & 
    geom_hline(yintercept = posterior_summary(fit23.9)[1:6, 1], 
               color = sl[9], linetype = 3) &
    coord_cartesian(ylim = c(-2, 2)) | p3)) + 
  plot_layout(widths = c(4, 4, 1))

The lines intersecting the plots are the posterior means for thresholds, $θ_{1}, \dots, θ_{6}$ .

But these still aren’t faithful depictions of the top panel of Figure 23.7, you say. Okay, fine. One of the distinctive elements of that panel is the left-tilted bar-and-error plots. If you look closely at the vertical lines at their bases, you’ll see that the leftmost subplot starts at the minimum value of X and the rightmost subplot starts at the maximum value of X. Since our plots, so far, have been based on X_s, we’ll use the minimum and maximum values from that. Here are those values.

(r <- range(my_data$X_s))

[1] -1.774444  1.750168

To my eye, it appears that the three middle subplots are equally distributed between those at the ends. If we proceed under that assumption, here’s how we might use fitted() to get us rolling on computing the relevant values.

nd <- tibble(X_s = seq(from = r[1], to = r[2], length.out = 5))

f <- fitted(fit23.9, newdata = nd)

# Inspect the output
str(f)

 num [1:5, 1:4, 1:7] 0.807183 0.444354 0.124257 0.015638 0.000907 ...
 - attr(*, "dimnames")=List of 3
  ..$ : NULL
  ..$ : chr [1:4] "Estimate" "Est.Error" "Q2.5" "Q97.5"
  ..$ : chr [1:7] "P(Y = 1)" "P(Y = 2)" "P(Y = 3)" "P(Y = 4)" ...

Here we’ll rearrange the output to make it useful for plotting.

# Rearrange the output
f <- rbind(
  f[, , 1],
  f[, , 2],
  f[, , 3],
  f[, , 4],
  f[, , 5],
  f[, , 6],
  f[, , 7]) |> 
  # Wrangle
  data.frame() |> 
  bind_cols(expand_grid(nd, Y = 1:7) |> arrange(Y, X_s))

head(f)

      Estimate    Est.Error         Q2.5       Q97.5         X_s Y
1 0.8071829772 0.0469968777 0.7067266277 0.889247368 -1.77444380 1
2 0.4443543291 0.0485583395 0.3494432061 0.540381426 -0.89329094 1
3 0.1242573951 0.0248619516 0.0805231604 0.177020747 -0.01213809 1
4 0.0156381288 0.0067075474 0.0057884052 0.032051758  0.86901477 1
5 0.0009066288 0.0007959035 0.0001158735 0.003025388  1.75016762 1
6 0.0925281535 0.0252089786 0.0500434480 0.146541824 -1.77444380 2

Now we can make those bar-and-error plots.

f |> 
  mutate(X_s = round(X_s, digits = 3)) |> 
  
  ggplot(aes(x = Y, y = Estimate,
             ymin = Q2.5, ymax = Q97.5)) +
  geom_col(fill = sl[4]) +
  geom_pointrange(color = sl[7], fatten = 1.5, linewidth = 1)  +
  scale_x_continuous(breaks = 1:7) +
  scale_y_reverse(NULL, position = "right", breaks = c(1, 0.5, 0), expand = c(0, 0),
                  labels = c(1, ".5", 0), limits = c(1, 0)) +
  coord_flip() +
  facet_wrap(~ X_s, ncol = 7, strip.position = "bottom")

The X_s values are depicted in the panel strips on the bottom. The response probabilities are scaled based on the axis on the top. The points and leftmost sides of the bars are the posterior means. The thin, dark horizontal lines are the percentile-based 95% intervals. Here we reformat f a little more to put those bar-and-error plots in a format more similar to that of Figure 23.7.

f |> 
  select(-Est.Error) |> 
  # Rescale the probability summaries
  mutate_at(.vars = vars(Estimate:Q97.5), .funs = \(x) x / 2) |> 
  
  # Plot
  ggplot() +
  geom_vline(xintercept = seq(from = r[1], to = r[2], length.out = 5),
             color = sl[2]) +
  # Bar marking the Estimate
  geom_segment(aes(x = X_s, xend = X_s - Estimate, 
                   y = Y + 0.1, yend = Y + 0.1),
               color = sl[4], linewidth = 8) +
  # Bar marking the 95% interval
  geom_segment(aes(x = X_s - Q2.5, xend = X_s - Q97.5,
                   y = Y + 0.2, yend = Y + 0.2),
               color = sl[7], linewidth = 1) +
  # Data
  geom_point(data = my_data,
             aes(x = X_s, y = Y),
             color = sl[9], shape = 1, size = 2) +
  scale_y_continuous("Y", breaks = 1:7, limits = c(0.5, 7.5)) +
  coord_cartesian(xlim = c(-2.4, 2.4))

I’m not going to attempt superimposing fitted lines on this plot the way Kruschke did. Given that our ordered-probit model is nonlinear on the scale of the criterion, it seems misleading to present linear fitted lines atop the raw data. If you’d like to do so, you’re on your own.

Now here’s the corresponding model is we treat the y data as metric with tricks from Chapter 17.

sd_x <- sd(my_data$X)
sd_y <- sd(my_data$Y)
m_x  <- mean(my_data$X)
m_y  <- mean(my_data$Y)

beta_0_sigma <- 10 * abs(m_x * sd_y / sd_x)
beta_1_sigma <- 10 * abs(sd_y / sd_x) 

stanvars <- stanvar(beta_0_sigma, name = "beta_0_sigma") + 
  stanvar(beta_1_sigma, name = "beta_1_sigma") +
  stanvar(sd_y, name = "sd_y")

fit23.10 <- brm(
  data = my_data,
  family = gaussian,
  Y ~ 1 + X,
  prior = c(prior(normal(0, beta_0_sigma), class = Intercept),
            prior(normal(0, beta_1_sigma), class = b),
            prior(normal(0, sd_y), class = sigma)),
  chains = 4, cores = 4,
  stanvars = stanvars, 
  seed = 23,
  file = "fits/fit23.10")

It may not have been readily apparent from Kruschke’s prose in the note for Figure 23.7, but his OLS model was based on the fully unstandardized data (i.e., using X as the predictor), not the partially standardized data he used in his JAGS code from 23.4.1. We followed the same sensibilities for fit23.10. Speaking of which, here are the summaries for the marginal posteriors.

posterior_summary(fit23.10)[1:3, ] |> round(digits = 3)

            Estimate Est.Error   Q2.5  Q97.5
b_Intercept   -5.412     0.629 -6.648 -4.199
b_X            6.701     0.440  5.849  7.570
sigma          1.527     0.076  1.385  1.686

These values are very close to those he reported at the bottom of page 690. Here are what the fitted lines from that model look like when superimposed on the data, when presuming both variables are metric.

set.seed(23)

conditional_effects(fit23.10, 
                    ndraws = 100,
                    spaghetti = TRUE) |> 
  plot(points = TRUE,
       line_args = list(size = 0),
       point_args = list(alpha = 1/3, color = sl[9]),
       spaghetti_args = list(colour = alpha(sl[6], 0.2)))

For the next example, we’ll load the HappinessAssetsDebt.csv data from Shi (2009).

my_data <- read_csv("data.R/HappinessAssetsDebt.csv")

glimpse(my_data)

Rows: 6,759
Columns: 3
$ Happiness <dbl> 3, 3, 3, 3, 1, 3, 2, 2, 4, 2, 3, 5, 3, 3, 4, 3, 3, 2, 4, 3, …
$ Assets    <dbl> 0, 10000, 30000, 40000, 21000, 20000, 20000, 0, 0, 20000, 50…
$ Debt      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5000, 0, 0, 0, 0, 0, 0, 0, 0, …

Here’s a quick scatter plot of the data. To help with the overplotting, the points have been horizontally jittered a bit. But as indicated in the text, Happiness is a discrete variable.

set.seed(23)

my_data |> 
  ggplot(aes(x = Assets, y = Happiness)) +
  geom_jitter(alpha = 1/4, color = sl[9], height = 0.25) +
  scale_x_continuous(expand = expansion(0, 0.05))

Standardize our predictor.

my_data <- my_data |> 
  mutate(Assets_s = (Assets - mean(Assets)) / sd(Assets))

Fit the model like before.

fit23.11 <- brm(
  data = my_data,
  family = cumulative(probit),
  Happiness ~ 1 + Assets_s,
  prior = c(prior(normal(0, 4), class = Intercept),
            prior(normal(0, 4), class = b)),
  iter = 3000, warmup = 1000, chains = 4, cores = 4,
  seed = 23,
  file = "fits/fit23.11")

Check the summary.

print(fit23.11)

 Family: cumulative 
  Links: mu = probit; disc = identity 
Formula: Happiness ~ 1 + Assets_s 
   Data: my_data (Number of observations: 6759) 
  Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
         total post-warmup draws = 8000

Regression Coefficients:
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1]    -2.03      0.03    -2.10    -1.96 1.00     5122     4419
Intercept[2]    -1.17      0.02    -1.20    -1.13 1.00     7825     6240
Intercept[3]    -0.15      0.02    -0.18    -0.12 1.00     8527     6266
Intercept[4]     1.48      0.02     1.44     1.53 1.00     9010     6816
Assets_s         0.15      0.01     0.12     0.17 1.00     8160     4947

Further Distributional Parameters:
     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
disc     1.00      0.00     1.00     1.00   NA       NA       NA

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Extract the posterior draws and compare the brms parameterization of $θ_{[i]}$ with the parameterization in the text in an expanded version of the bottom panel of Figure 23.8.

draws <- as_draws_df(fit23.11)

draws |> 
  select(`b_Intercept[1]`:`b_Intercept[4]`, .draw) |> 
  compare_thresholds(lb = 1.5, ub = 4.5)

Behold the marginal distribution of b_Assets_s, our effect size for Assets.

draws |> 
  ggplot(aes(x = b_Assets_s)) +
  stat_halfeye(point_interval = mode_hdi, .width = 0.95,
               color = sl[8], fill = sl[4]) +
  scale_y_continuous(NULL, breaks = NULL)

Here’s the fitted()-oriented preparatory work for our version of the top panel of Figure 23.8.

# Define the range for the predictor
r <- range(my_data$Assets_s)

# Re-define the new data
nd <- tibble(Assets_s = seq(from = r[1], to = r[2], length.out = 5))

# Compute the fitted summaries
f <- fitted(fit23.11, newdata = nd)

# Rearrange the output
f <- rbind(
  f[, , 1],
  f[, , 2],
  f[, , 3],
  f[, , 4],
  f[, , 5]) |> 
  # Wrangle
  data.frame() |> 
  bind_cols(expand_grid(nd, Happiness = 1:5) |> arrange(Happiness, Assets_s))

# Examine
head(f)

      Estimate    Est.Error         Q2.5        Q97.5   Assets_s Happiness
1 2.611798e-02 2.133404e-03 2.215960e-02 3.047004e-02 -0.5971712         1
2 3.159217e-03 7.290196e-04 1.953182e-03 4.774445e-03  4.7565107         1
3 2.339827e-04 1.329509e-04 6.859301e-05 5.735401e-04 10.1101925         1
4 1.151040e-05 1.296018e-05 1.039798e-06 4.714594e-05 15.4638743         1
5 4.162833e-07 8.774536e-07 6.685249e-09 2.528812e-06 20.8175561         1
6 1.147304e-01 4.207162e-03 1.065825e-01 1.230356e-01 -0.5971712         2

Like with the same variant from Figure 23.7, we will not be superimposing linear fitted lines. The model is nonlinear on the scale of the data and I don’t want to confuse readers by pretending otherwise.

f |> 
  select(-Est.Error) |> 
  mutate_at(.vars = vars(Estimate:Q97.5), .funs = \(x) x * 2.5) |> 
  
  ggplot() +
  geom_vline(xintercept = seq(from = r[1], to = r[2], length.out = 5),
             color = sl[2]) +
  geom_segment(aes(x = Assets_s, xend = Assets_s - Estimate, 
                   y = Happiness + 0.1, yend = Happiness + 0.1),
               color = sl[4], linewidth = 8) +
  geom_segment(aes(x = Assets_s - Q2.5, xend = Assets_s - Q97.5,
                   y = Happiness + 0.2, yend = Happiness + 0.2),
               color = sl[7], linewidth = 1) +
  geom_point(data = my_data,
             aes(x = Assets_s, y = Happiness),
             color = sl[9], shape = 1, size = 2) +
  scale_y_continuous("Happiness", breaks = 1:5, limits = c(0.5, 5.5)) +
  coord_cartesian(xlim = c(-4, 24))

Now here’s the corresponding model is we treat Happiness as metric. Unlike our method for the corresponding model from Figure 23.7, fit10, we will use the standardized version of the predictor, Assets_s. The unstandardized values for Happiness and Assets are on vastly different scales, which can be difficulty for HMC with broad priors of the type Kruschke often uses. Standardizing the predictor helps.

sd_y <- sd(my_data$Happiness)
stanvars <- stanvar(sd_y, name = "sd_y")

fit23.12 <- brm(
  data = my_data,
  family = gaussian,
  Happiness ~ 1 + Assets_s,
  prior = c(prior(normal(3.5, 5), class = Intercept),
            prior(normal(0, 5), class = b),
            prior(normal(0, sd_y), class = sigma)),
  chains = 4, cores = 4,
  stanvars = stanvars, 
  seed = 23,
  file = "fits/fit23.12")

Here are the summaries for the marginal posteriors.

posterior_summary(fit23.12)[1:3, ] |> round(digits = 6)

            Estimate Est.Error     Q2.5    Q97.5
b_Intercept 3.484393  0.010273 3.464044 3.504634
b_Assets_s  0.115658  0.010255 0.095015 0.136033
sigma       0.847120  0.007128 0.833300 0.860943

They’re just a bit different from those produced by OLS. Here are what the fitted lines from that model look like when superimposed on the data, when presuming both variables are metric.

set.seed(23)

conditional_effects(fit23.12, 
                    ndraws = 100,
                    spaghetti = TRUE) |> 
  plot(points = TRUE,
       line_args = list(size = 0),
       point_args = list(alpha = 1/3, color = sl[9]),
       spaghetti_args = list(colour = alpha(sl[6], 0.2)))

23.4.3 Example: Movies–They don’t make ’em like they used to.

For this section, we’ll load the Moore (2006) Movies.csv data.

my_data <- read_csv("data.R/Movies.csv")

glimpse(my_data)

Rows: 100
Columns: 6
$ Title       <chr> "A_Ticklish_Affair", "Action_in_the_North_Atlantic", "And_…
$ Year        <dbl> 1963, 1943, 1984, 1978, 1931, 1930, 1950, 1989, 1940, 1947…
$ Length      <dbl> 89, 127, 138, 97, 77, 69, 93, 119, 70, 69, 101, 62, 97, 85…
$ Cast        <dbl> 5, 7, 7, 5, 6, 8, 5, 8, 9, 9, 9, 6, 10, 10, 9, 6, 10, 6, 1…
$ Rating      <dbl> 2.0, 3.0, 3.0, 3.0, 2.5, 2.5, 3.0, 2.5, 2.5, 2.0, 3.0, 2.0…
$ Description <dbl> 7, 9, 15, 11, 7, 10, 8, 15, 8, 8, 11, 10, 12, 13, 9, 7, 10…

In footnote #5 at the bottom of page 693, Kruschke explained that whereas the original Ratings data ranged from 1.0 to 4.0 in half-unit increments, he recoded them to range from 1 to 7. Here we recode Ratings in the same way using dplyr::case_match(). While we’re at it, we’ll make standardized versions of the predictors, too.

my_data <- my_data |> 
  mutate(Rating = case_match(Rating,
                             1.0 ~ 1,
                             1.5 ~ 2,
                             2.0 ~ 3,
                             2.5 ~ 4,
                             3.0 ~ 5,
                             3.5 ~ 6,
                             4.0 ~ 7),
         Year_s   = (Year   - mean(Year))   / sd(Year),
         Length_s = (Length - mean(Length)) / sd(Length))

Here’s a scatter plot of the data, with points colored by Rating.

my_data |> 
  mutate(Rating = factor(Rating)) |> 
  
  ggplot(aes(x = Year, y = Length, label = Rating)) +
  geom_point(aes(color = Rating),
             size = 3) +
  geom_text(size = 3) +
  scale_color_scico_d(palette = "lajolla", begin = 0.35, end = 0.90, direction = -1)

Fitting the multivariable ordered-probit model with brms is about as simple as fitting any other multivariable model. Just tack on predictors with the + operator.

fit23.13 <- brm(
  data = my_data,
  family = cumulative(probit),
  Rating ~ 1 + Year_s + Length_s,
  prior = c(prior(normal(0, 4), class = Intercept),
            prior(normal(0, 4), class = b)),
  iter = 3000, warmup = 1000, chains = 4, cores = 4,
  seed = 23,
  file = "fits/fit23.13")

Check the model summary.

print(fit23.13)

 Family: cumulative 
  Links: mu = probit; disc = identity 
Formula: Rating ~ 1 + Year_s + Length_s 
   Data: my_data (Number of observations: 100) 
  Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
         total post-warmup draws = 8000

Regression Coefficients:
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1]    -1.69      0.20    -2.11    -1.30 1.00     6014     5318
Intercept[2]    -0.91      0.15    -1.22    -0.62 1.00     8443     5737
Intercept[3]    -0.22      0.13    -0.48     0.04 1.00     8669     7066
Intercept[4]     0.61      0.14     0.34     0.88 1.00     8518     7047
Intercept[5]     1.69      0.20     1.31     2.10 1.00     9009     6885
Intercept[6]     2.59      0.37     1.95     3.39 1.00     9103     6786
Year_s          -0.49      0.13    -0.74    -0.24 1.00     6982     5936
Length_s         0.62      0.13     0.37     0.88 1.00     6975     6019

Further Distributional Parameters:
     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
disc     1.00      0.00     1.00     1.00   NA       NA       NA

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Extract the posterior draws and compare the brms parameterization of $θ_{[i]}$ with the parameterization in the text in an expanded version of the bottom panel of Figure 23.9.

draws <- as_draws_df(fit23.13)

draws |> 
  select(`b_Intercept[1]`:`b_Intercept[6]`, .draw) |> 
  compare_thresholds(lb = 1.5, ub = 6.5)

To sate any curiosity, here are the correlations among the parameters.

# Wrangle
as_draws_df(fit23.13) |> 
  select(contains("b_Intercept"), b_Year_s, b_Length_s) |> 
  set_names(str_c("theta[", 1:6, "]"), str_c("beta[", 1:2, "]")) |> 
  
  # Plot
  ggpairs(upper = list(continuous = my_upper),
          diag = list(continuous = my_diag),
          lower = list(continuous = my_lower),
          labeller = label_parsed) +
  ggtitle("These parameters have milder correlations")

Now focus on the marginal distribution of our two effect-size parameters.

draws |> 
  pivot_longer(cols = ends_with("_s")) |> 
  mutate(name = factor(name, levels = c("b_Year_s", "b_Length_s"))) |> 
  
  ggplot(aes(x = value)) +
  stat_halfeye(point_interval = mode_hdi, .width = 0.95,
               color = sl[8], fill = sl[4], normalize = "panels") +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab("effect size") +
  facet_wrap(~ name, scales = "free")

Before we make the top panel from Figure 23.9, I’d like to wander a bit and look at something related. We’ll use fitted().

# Define the new data
nd <- crossing(Year_s   = seq(from = -3, to = 3, by = 0.25), 
               Length_s = seq(from = -3, to = 3, by = 0.25))

# Compute the `Response` probabilities
f <- fitted(fit23.13, newdata = nd)

# Rearrange the output
f <- rbind(
  f[, , 1],
  f[, , 2],
  f[, , 3],
  f[, , 4],
  f[, , 5],
  f[, , 6],
  f[, , 7] ) |> 
  # Wrangle
  data.frame() |> 
  bind_cols(
    nd |> 
      expand_grid(Rating = 1:7) |> 
      arrange(Rating, Year_s))

# What did we do?
head(f)

    Estimate  Est.Error        Q2.5      Q97.5 Year_s Length_s Rating
1 0.11652439 0.07893031 0.018596666 0.31399252     -3    -3.00      1
2 0.08997773 0.06485363 0.013122143 0.25771735     -3    -2.75      1
3 0.06826646 0.05244825 0.008879778 0.20764293     -3    -2.50      1
4 0.05092608 0.04181886 0.005860147 0.16105489     -3    -2.25      1
5 0.03738891 0.03293782 0.003661057 0.12565676     -3    -2.00      1
6 0.02704712 0.02568048 0.002199393 0.09533809     -3    -1.75      1

We just computed the response probabilities across the two-dimensional grid of the predictor values. Now plot using the posterior means.

f |> 
  mutate(strip = str_c("Rating: ", Rating)) |> 
  
  ggplot(aes(x = Year_s, y = Length_s)) +
  geom_raster(aes(fill = Estimate),
              interpolate = TRUE) +
  geom_text(data = my_data |> mutate(strip = str_c("Rating: ", Rating)),
            aes(label = Rating),
            color = "white", size = 2.5) +
  scale_x_continuous(breaks = seq(from = -2, to = 2, by = 2),
                     expand = c(0, 0)) +
  scale_y_continuous(breaks = seq(from = -2, to = 2, by = 2),
                     expand = c(0, 0)) +
  scale_fill_scico(palette = 'lajolla', breaks = c(0, 0.5, 1),
                   labels = c(0, ".5", 1), limits = c(0, 1)) +
  theme(legend.position = "bottom") +
  facet_wrap(~ strip, nrow = 1)

This model didn’t do a great job capturing the Response probabilities. If you’re curious, you’ll find you can do a little bit better if you allow the two predictors to interact (i.e., add + Year_s:Length_s to the formula line). Even then, the model isn’t great. I leave that as an exercise for the interested reader.

For this model, however, we will follow Kruschke and make a more faithful version of the top panel of Figure 23.9. We’ll need to wrangle our draws data a bit to get things ready. Here’s the work.

draws <- draws |> 
  # We just need the data from three steps in the HMC chain
  slice(1:3) |> 
  mutate(.draw = .draw |> as.factor()) |> 
  rename(b1 = b_Year_s,
         b2 = b_Length_s) |> 
  # Because these are straight lines, two extreme x1-values are all we need
  expand_grid(x1 = c(-10, 10)) |> 
  pivot_longer(cols = contains("b_Intercept"), names_to = "theta") |> 
  # Use Kruschke's Formula 23.5
  mutate(x2       = (value / b2) + (-b1 / b2) * x1,
         # This just renames our x variables for easy plotting
         Year_s   = x1,
         Length_s = x2) 

glimpse(draws)

Rows: 36
Columns: 20
$ b1             <dbl> -0.5650158, -0.5650158, -0.5650158, -0.5650158, -0.5650…
$ b2             <dbl> 0.6424351, 0.6424351, 0.6424351, 0.6424351, 0.6424351, …
$ disc           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ `Intercept[1]` <dbl> -1.650021, -1.650021, -1.650021, -1.650021, -1.650021, …
$ `Intercept[2]` <dbl> -0.8908118, -0.8908118, -0.8908118, -0.8908118, -0.8908…
$ `Intercept[3]` <dbl> -0.10894122, -0.10894122, -0.10894122, -0.10894122, -0.…
$ `Intercept[4]` <dbl> 0.4600140, 0.4600140, 0.4600140, 0.4600140, 0.4600140, …
$ `Intercept[5]` <dbl> 1.469098, 1.469098, 1.469098, 1.469098, 1.469098, 1.469…
$ `Intercept[6]` <dbl> 2.494688, 2.494688, 2.494688, 2.494688, 2.494688, 2.494…
$ lprior         <dbl> -18.84353, -18.84353, -18.84353, -18.84353, -18.84353, …
$ lp__           <dbl> -181.6031, -181.6031, -181.6031, -181.6031, -181.6031, …
$ .chain         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ .iteration     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2…
$ .draw          <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2…
$ x1             <dbl> -10, -10, -10, -10, -10, -10, 10, 10, 10, 10, 10, 10, -…
$ theta          <chr> "b_Intercept[1]", "b_Intercept[2]", "b_Intercept[3]", "…
$ value          <dbl> -1.6500213, -0.8908118, -0.1089412, 0.4600140, 1.469098…
$ x2             <dbl> -11.363295, -10.181527, -8.964485, -8.078862, -6.508144…
$ Year_s         <dbl> -10, -10, -10, -10, -10, -10, 10, 10, 10, 10, 10, 10, -…
$ Length_s       <dbl> -11.363295, -10.181527, -8.964485, -8.078862, -6.508144…

Now just plot.

draws |> 
  ggplot(aes(x = Year_s, y = Length_s)) +
  geom_line(aes(group = interaction(.draw, theta), color = theta, linetype = .draw)) +
  geom_text(data = my_data,
            aes(label = Rating)) +
  scale_color_scico_d(expression(theta), palette = "lajolla", end = 0.85, direction = -1, labels = 1:6) +
  coord_cartesian(xlim = range(my_data$Year_s),
                  ylim = range(my_data$Length_s))

It might be easier to see Kruschke’s main point if we facet by .draw.

draws |> 
  ggplot(aes(x = Year_s, y = Length_s)) +
  geom_line(aes(group = interaction(.draw, theta), color = theta, linetype = .draw)) +
  geom_text(data = my_data,
            aes(label = Rating)) +
  scale_color_scico_d(expression(theta), palette = "lajolla", end = 0.85, direction = -1, labels = 1:6) +
  coord_cartesian(xlim = range(my_data$Year_s),
                  ylim = range(my_data$Length_s)) +
  theme(legend.direction = "horizontal",
        legend.position = c(0.75, 0.25)) +
  facet_wrap(~ .draw, ncol = 2)

Both our versions of the plot show what Kruschke pointed out in the text:

Threshold lines from the same step in the chain must be parallel because the regression coefficients are constant at that step, but are different at another step. The threshold lines in Figure 23.9 are level contours on the underlying metric planar surface, and the lines reveal that the ratings increase toward the top left, that is, as $x_{1}$ decreases and $x_{2}$ increases. (p. 693)

Before we move on to the next section, what these diagonal 2-dimensional threshold lines also hint at is that when we use two predictors to describe ordinal data as having been produces by an underlying unit Gaussian distribution, that underlying distribution is actually bivariate Gaussian. Here we’ll use fitted() one more time to depict that bivariate-Gaussian distribution with a little geom_raster().

# Define the new data
nd <- crossing(Year_s   = seq(from = -5, to = 5, by = 0.1), 
               Length_s = seq(from = -5, to = 5, by = 0.1))

fitted(fit23.13,
       newdata = nd,
       # This will yield z-scores
       scale = "linear") |> 
  data.frame() |>
  bind_cols(nd) |> 
  # Convert the z-scores to density values
  mutate(density = dnorm(Estimate, 0, 1)) |> 

  ggplot(aes(x = Year_s, y = Length_s)) +
  geom_raster(aes(fill = density),
              interpolate = TRUE) +
  geom_text(data = my_data,
            aes(label = Rating),
            size = 2.5) +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_fill_scico("density", palette = 'lajolla', 
                   limits = c(0, 0.4))

As with many of our previous approaches with geom_raster(), this plot is based on the posterior means in each cell and, therefore, does a poor job depicting the uncertainty in the posterior distribution.

23.4.4 Why are some thresholds outside the data?

Now load Kruschke’s simulated data.

my_data <- read_csv("data.R/OrdinalProbitData-Movies.csv")

glimpse(my_data)

Rows: 400
Columns: 3
$ Year   <dbl> 1959, 1946, 1964, 1938, 1946, 1971, 1957, 1970, 1968, 1962, 192…
$ Length <dbl> 88, 117, 130, 85, 111, 105, 93, 119, 78, 77, 138, 107, 70, 60, …
$ Rating <dbl> 4, 5, 5, 4, 5, 4, 3, 4, 3, 3, 7, 4, 2, 2, 5, 5, 3, 4, 3, 2, 2, …

These data imitate the movie ratings, but with two key differences. First and foremost, the artificial data have much smaller noise, with $σ = 0.20$ as opposed to $σ \approx 1.25$ in the real data. Second, the artificial data have points that span the entire range of both predictors, unlike the real data which had points mostly in the central region. (p. 695)

Like with the real movie data, we’ll inspect these data with a colored scatter plot.

my_data |> 
  mutate(Rating = factor(Rating)) |> 
  
  ggplot(aes(x = Year, y = Length, label = Rating)) +
  geom_point(aes(color = Rating),
             size = 3) +
  geom_text(size = 3) +
  scale_color_scico_d(palette = "lajolla", begin = 0.35, end = 0.9, direction = -1)

Unlike in last section, there appears to be a clear trend in Kruschke’s simulated data. Kruschke’s simulated critic liked his movies old and long. Time to standardize the predictors.

my_data <- my_data |> 
  mutate(Year_s   = (Year   - mean(Year))   / sd(Year),
         Length_s = (Length - mean(Length)) / sd(Length))

Fitting the multivariable ordered-probit model with brms is about as simple as fitting any other multivariable model. Just tack on predictors with the + operator.

fit23.14 <- update(
  fit23.13,
  newdata = my_data,
  iter = 3000, warmup = 1000, chains = 4, cores = 4,
  seed = 23,
  file = "fits/fit23.14")

Check the model summary.

print(fit23.14)

 Family: cumulative 
  Links: mu = probit; disc = identity 
Formula: Rating ~ 1 + Year_s + Length_s 
   Data: my_data (Number of observations: 400) 
  Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
         total post-warmup draws = 8000

Regression Coefficients:
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1]    -9.77      0.67   -11.10    -8.51 1.00     1930     2823
Intercept[2]    -5.96      0.43    -6.82    -5.14 1.01     2085     3268
Intercept[3]    -2.25      0.21    -2.67    -1.84 1.00     3297     4854
Intercept[4]     2.43      0.21     2.02     2.85 1.00     3086     4399
Intercept[5]     7.79      0.55     6.73     8.89 1.00     2045     2886
Intercept[6]    10.88      0.77     9.41    12.43 1.00     2121     3422
Year_s          -2.80      0.20    -3.20    -2.42 1.00     2018     3068
Length_s         4.73      0.32     4.12     5.37 1.00     1912     2866

Further Distributional Parameters:
     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
disc     1.00      0.00     1.00     1.00   NA       NA       NA

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Extract the posterior draws and use compare_thresholds() to make our expanded version of the bottom panel of Figure 23.10.

draws <- as_draws_df(fit23.14)

draws |> 
  select(`b_Intercept[1]`:`b_Intercept[6]`, .draw) |> 
  compare_thresholds(lb = 1.5, ub = 6.5)

Make the marginal distribution of our two effect-size parameters.

draws |> 
  pivot_longer(cols = ends_with("_s")) |> 
  mutate(name = factor(name, levels = c("b_Year_s", "b_Length_s"))) |> 
  
  ggplot(aes(x = value)) +
  stat_halfeye(point_interval = mode_hdi, .width = 0.95,
               color = sl[8], fill = sl[4], normalize = "panels") +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab("effect size") +
  facet_wrap(~ name, scales = "free")

Make the top panel for Figure 23.10, just like we did for its analogue in Figure 23.9.

as_draws_df(fit23.14) |> 
  slice(1:3) |> 
  mutate(.draw = .draw |> as.factor()) |>
  rename(b1 = b_Year_s,
         b2 = b_Length_s) |> 
  expand_grid(x1 = c(-10, 10)) |> 
  pivot_longer(cols = contains("b_Intercept"), names_to = "theta") |> 
  mutate(x2       = (value / b2) + (-b1 / b2) * x1,
         Year_s   = x1,
         Length_s = x2) |>
  
  ggplot(aes(x = Year_s, y = Length_s)) +
  geom_line(aes(group = interaction(.draw, theta), color = theta, linetype = .draw)) +
  geom_text(data = my_data,
            aes(label = Rating)) +
  scale_color_scico_d(expression(theta), palette = "lajolla", end = 0.85, direction = -1, labels = 1:6) +
  coord_cartesian(xlim = range(my_data$Year_s),
                  ylim = range(my_data$Length_s))

Those are some tight thresholds. They “very clearly cleave parallel regions of distinct ordinal values. The example demonstrates that the threshold lines do make intuitive sense when there is low noise and a broad range of data” (p. 695, emphasis in the original).

With our various bonus plots, we’ve been anticipating Figure 23.11 for some time, now. The thresholds from fit23.14 result in beautifully nonlinear probability curves for the Rating levels. Take a quick look with conditional_effects().

ce <- conditional_effects(fit23.14, categorical = TRUE)

plot(ce, plot = FALSE)[[1]] +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 1)) +
  scale_fill_scico_d(palette = "lajolla", end = 0.85, direction = -1) +
  scale_color_scico_d(palette = "lajolla", end = 0.85, direction = -1)

plot(ce, plot = FALSE)[[2]] +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 1)) +
  scale_fill_scico_d(palette = "lajolla", end = 0.85, direction = -1) +
  scale_color_scico_d(palette = "lajolla", end = 0.85, direction = -1)

Because the model had two predictors, we got two plots. What brms::conditional_effects() called Probability on the $y$ -axis is the same as what Kruschke called $p (y)$ on his. Rather than generic predictors $x$ on the $x$ -axis, our plots had either Year_s or Length_s. Whereas Kruschke marked off his different outcomes by line styles, ours were marked by color. Since we don’t have the data Kruschke used to make Figure 23.11, we won’t be able to reproduce it exactly. However, you’ll note that our plot for Length_s corresponded nicely with his subplot on the top (i.e., the one for which $σ = 0.1$ ). If we set effects = "Length_s", we can use conditional_effects() to make a similar plot to Kruschke’s subplot for which $σ = 1$ .

ce <- conditional_effects(fit23.13, 
                          categorical = TRUE, 
                          effects = "Length_s")

plot(ce, plot = FALSE)[[1]] +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.05)), limits = c(0, NA)) +
  scale_fill_scico_d(palette = "lajolla", end = 0.85, direction = -1) +
  scale_color_scico_d(palette = "lajolla", end = 0.85, direction = -1)

“You can see that each outcome has maximum probability within its corresponding interval, but there is considerable smearing of outcomes into adjacent intervals” (p. 695).

Finishing off, Kruschke’s discussion in the text

referred to $σ$ as “noise” merely for linguistic ease. Calling the outcomes “noisy” does not mean the underlying generator of the outcomes is inherently wildly random. The “noise” is merely variation in the outcome that cannot be accounted for by the particular model we have chosen with the particular predictors we have chosen. A different model and/or different predictors might account for the outcomes well with little residual noise. In this sense, the noise is in the model, not in the data. (p. 698, emphasis in the original)

Through this lens, noisy-looking data are a symptom of weak theory and/or poor data-collection methods.

23.5 Posterior prediction

The cumulative-normal model makes posterior predictions for the probabilities of the $K$ categories in the criterion variable by computing $p (y ∣ μ (x), σ, {θ_{k}})$ in each step in the HMC chain. In this equation, $μ (x) = β_{0} + \sum_{j} β_{j} x_{j}$ . Though recall that with our brms parameterization, we have $β_{0}$ fixed at 0. Kruschke framed part of his discussion in this chapter in terms of a single-predictor model, such as was entertained in Figure 23.8. Recall that corresponds to our fit11. Here’s that formula.

fit23.11$formula

Happiness ~ 1 + Assets_s

With brms, you can get this information with fitted(). Let’s say we wanted to focus on response probabilities for Assets_s = -1). Here’s what we get.

fitted(fit23.11, 
       newdata = tibble(Assets_s = -1))

, , P(Y = 1)

       Estimate   Est.Error       Q2.5      Q97.5
[1,] 0.02997066 0.002483131 0.02533447 0.03505768

, , P(Y = 2)

      Estimate   Est.Error      Q2.5     Q97.5
[1,] 0.1247085 0.004814203 0.1154987 0.1342713

, , P(Y = 3)

      Estimate   Est.Error      Q2.5     Q97.5
[1,] 0.3435083 0.006438852 0.3306628 0.3562814

, , P(Y = 4)

     Estimate   Est.Error     Q2.5     Q97.5
[1,] 0.450502 0.007042199 0.436879 0.4643867

, , P(Y = 5)

       Estimate   Est.Error       Q2.5      Q97.5
[1,] 0.05131061 0.002930612 0.04583096 0.05721674

As is typical of brms, those probability summaries were in terms of the posterior mean and percentile-based 95% intervals. If you’re like Kruschke and prefer posterior modes and HDIs, you’ll need to set summary = F and wrangle a bit.

f <- fitted(fit23.11, 
            newdata = tibble(Assets_s = -1),
            summary = FALSE)

cbind(
  f[, , 1],
  f[, , 2],
  f[, , 3],
  f[, , 4],
  f[, , 5]) |> 
  data.frame() |> 
  set_names(str_c("p(Happiness = ", 1:5, " | Assets_s = -1)")) |> 
  pivot_longer(cols = everything()) |> 
  group_by(name) |> 
  mode_hdi(value) |> 
  mutate_if(is.double, .funs = round, digits = 4)

# A tibble: 5 × 7
  name                              value .lower .upper .width .point .interval
  <chr>                             <dbl>  <dbl>  <dbl>  <dbl> <chr>  <chr>    
1 p(Happiness = 1 | Assets_s = -1) 0.0294 0.0253 0.035    0.95 mode   hdi      
2 p(Happiness = 2 | Assets_s = -1) 0.125  0.116  0.134    0.95 mode   hdi      
3 p(Happiness = 3 | Assets_s = -1) 0.344  0.330  0.356    0.95 mode   hdi      
4 p(Happiness = 4 | Assets_s = -1) 0.450  0.437  0.464    0.95 mode   hdi      
5 p(Happiness = 5 | Assets_s = -1) 0.0513 0.0457 0.0571   0.95 mode   hdi

23.6 Generalizations and extensions

In this section, Kruschke mentioned extensions of this class of models might include using the cumulative $t$ function to handle outliers or adding a guessing parameter. Full disclosure: I have not fit models like these. Based on my knowledge of brms, I suspect they’re possible. For insights how, you might review Bürkner’s (2022b) Define custom response distributions with brms and his (2022c) Estimating non-linear models with brms vignettes.

In addition, there are other likelihoods one might use to model ordinal data using brms. Your first stop should be Bürkner and Vourre’s (2019) Ordinal regression models in psychology: A Tutorial, where, in addition to the cumulative normal model, they cover the sequential and adjacent category models. You might also check out Chapter 11 of my (2023) ebook recoding McElreath’s Statistical rethinking, wherein I show how one might use the logit link (i.e., family = cumulative(logit)) to fit ordered-categorical models with brms.

If you’d just like more practice with the cumulative probit model, you might check out my blog post called Notes on the Bayesian cumulative probit.

Session info

sessionInfo()

R version 4.4.3 (2025-02-28)
Platform: aarch64-apple-darwin20
Running under: macOS Ventura 13.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] GGally_2.2.1    tidybayes_3.0.7 brms_2.22.0     Rcpp_1.0.14    
 [5] patchwork_1.3.0 scico_1.5.0     lubridate_1.9.3 forcats_1.0.0  
 [9] stringr_1.5.1   dplyr_1.1.4     purrr_1.0.4     readr_2.1.5    
[13] tidyr_1.3.1     tibble_3.2.1    ggplot2_3.5.1   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1     svUnit_1.0.6         farver_2.1.2        
 [4] loo_2.8.0            fastmap_1.1.1        TH.data_1.1-2       
 [7] tensorA_0.36.2.1     digest_0.6.37        estimability_1.5.1  
[10] timechange_0.3.0     lifecycle_1.0.4      StanHeaders_2.32.10 
[13] survival_3.8-3       magrittr_2.0.3       posterior_1.6.0     
[16] compiler_4.4.3       rlang_1.1.5          tools_4.4.3         
[19] utf8_1.2.4           yaml_2.3.8           knitr_1.49          
[22] labeling_0.4.3       bridgesampling_1.1-2 htmlwidgets_1.6.4   
[25] curl_6.0.1           pkgbuild_1.4.4       bit_4.0.5           
[28] plyr_1.8.9           RColorBrewer_1.1-3   multcomp_1.4-26     
[31] abind_1.4-8          withr_3.0.2          stats4_4.4.3        
[34] grid_4.4.3           inline_0.3.19        xtable_1.8-4        
[37] colorspace_2.1-1     emmeans_1.10.3       scales_1.3.0        
[40] MASS_7.3-64          cli_3.6.4            mvtnorm_1.2-5       
[43] rmarkdown_2.29       crayon_1.5.3         generics_0.1.3      
[46] RcppParallel_5.1.7   rstudioapi_0.16.0    reshape2_1.4.4      
[49] tzdb_0.4.0           rstan_2.32.6         splines_4.4.3       
[52] bayesplot_1.11.1     parallel_4.4.3       matrixStats_1.4.1   
[55] vctrs_0.6.5          V8_4.4.2             Matrix_1.7-2        
[58] sandwich_3.1-1       jsonlite_1.8.9       hms_1.1.3           
[61] arrayhelpers_1.1-0   bit64_4.0.5          ggdist_3.3.2        
[64] glue_1.8.0           ggstats_0.6.0        codetools_0.2-20    
[67] distributional_0.5.0 stringi_1.8.4        gtable_0.3.6        
[70] QuickJSR_1.1.3       munsell_0.5.1        pillar_1.10.2       
[73] htmltools_0.5.8.1    Brobdingnag_1.2-9    R6_2.6.1            
[76] vroom_1.6.5          evaluate_1.0.1       lattice_0.22-6      
[79] backports_1.5.0      rstantools_2.4.0     gridExtra_2.3       
[82] coda_0.19-4.1        nlme_3.1-167         checkmate_2.3.2     
[85] xfun_0.49            zoo_1.8-12           pkgconfig_2.0.3