23 Ordinal Predicted Variable
This chapter considers data that have an ordinal predicted variable. For example, we might want to predict people’s happiness ratings on a \(1\)-to-\(7\) scale as a function of their total financial assets. Or we might want to predict ratings of movies as a function of the year they were made.
One traditional treatment of this sort of data structure is called ordinal or ordered probit regression. We will consider a Bayesian approach to this model. As usual, in Bayesian software, it is easy to generalize the traditional model so it is robust to outliers, allows different variances within levels of a nominal predictor, or has hierarchical structure to share information across levels or factors as appropriate.
In the context of the generalized linear model (GLM) introduced in Chapter 15, this chapter’s situation involves an inverse-link function that is a thresholded cumulative normal with a categorical distribution for describing noise in the data, as indicated in the fourth row of Table 15.2 (p. 443). For a reminder of how this chapter’s combination of predicted and predictor variables relates to other combinations, see Table 15.3 (p. 444). (Kruschke, 2015, p. 671, emphasis in the original)
We might follow Kruschke’s advice and rewind a little before tackling this chapter. The cumulative normal function is denoted \(\Phi(x; \mu, \sigma)\), where \(x\) is some real number and \(\mu\) and \(\sigma\) have their typical meaning. The function is cumulative in that it produces values ranging from zero to 1. Figure 15.8 showed an example of the normal distribution atop of the cumulative normal. Here it is, again.
library(tidyverse)
<-
d tibble(z = seq(from = -3, to = 3, by = .1)) %>%
# add the density values
mutate(`p(z)` = dnorm(z, mean = 0, sd = 1),
# add the CDF values
`Phi(z)` = pnorm(z, mean = 0, sd = 1))
head(d)
## # A tibble: 6 × 3
## z `p(z)` `Phi(z)`
## <dbl> <dbl> <dbl>
## 1 -3 0.00443 0.00135
## 2 -2.9 0.00595 0.00187
## 3 -2.8 0.00792 0.00256
## 4 -2.7 0.0104 0.00347
## 5 -2.6 0.0136 0.00466
## 6 -2.5 0.0175 0.00621
It’s time to talk color and theme. For this chapter, we’ll take our color palette from the scico package (Thomas Lin Pedersen & Crameri, 2021), which provides 17 perceptually-uniform and colorblind-safe palettes based on the work of Fabio Crameri. Our palette of interest will be "lajolla"
.
library(scico)
<- scico(palette = "lajolla", n = 9)
sl
::show_col(sl) scales
Our overall plot theme will be based on ggplot2::theme_linedraw()
with just a few adjustments.
theme_set(
theme_linedraw() +
theme(panel.grid = element_blank(),
strip.background = element_rect(color = sl[9], fill = sl[9]),
strip.text = element_text(color = sl[1]))
)
Now plot!
<-
p1 %>%
d ggplot(aes(x = z, y = `p(z)`)) +
geom_area(aes(fill = z <= 1),
show.legend = F) +
geom_line(linewidth = 1, color = sl[3]) +
scale_fill_manual(values = c("transparent", sl[2])) +
scale_y_continuous(expand = expansion(mult = c(0, 0.05))) +
labs(title = "Normal Density",
y = expression(p(italic(z))))
<-
p2 %>%
d ggplot(aes(x = z, y = `Phi(z)`)) +
geom_area(aes(fill = z <= 1),
show.legend = F) +
geom_line(linewidth = 1, color = sl[3]) +
scale_fill_manual(values = c("transparent", sl[2])) +
scale_y_continuous(expand = expansion(mult = c(0, 0)), limits = 0:1) +
labs(title = "Cumulative Normal",
y = expression(Phi(italic(z))))
# combine and adjust with patchwork
library(patchwork)
/ p2 &
p1 scale_x_continuous(breaks = -2:2) &
coord_cartesian(xlim = c(-2.5, 2.5))
For both plots, z
is in a standardized metric (i.e., \(z\)-score). With the cumulative normal function, the cumulative probability \(\Phi(z)\) increases nonlinearly with the \(z\)-scores such that, much like with the logistic curve, the greatest change occurs around \(z = 0\) and tapers off in the tails.
The inverse of \(\Phi(x)\) is the probit function. As indicated in the above block quote, we’ll be making extensive use of the probit function in this chapter for our Bayesian models.
23.1 Modeling ordinal data with an underlying metric variable
You can imagine that the distribution of ordinal values might not resemble a normal distribution, even though the underlying metric values are normally distributed. Figure 23.1 shows some examples of ordinal outcome probabilities generated from an underlying normal distribution. The horizontal axis is the underlying continuous metric value. Thresholds are plotted as vertical dashed lines, labeled \(\theta\). In all examples, the ordinal scale has \(7\) levels, and hence, there are \(6\) thresholds. The lowest threshold is set at \(\theta_1 = 1.5\) (to separate outcomes \(1\) and \(2\)), and the highest threshold is set at \(\theta_1 = 6.5\) (to separate outcomes \(6\) and \(7\)). The normal curve in each panel shows the distribution of underlying continuous values. What differs across panels are the settings of means, standard deviations, and remaining thresholds. (p. 672)
The various Figure 23.1 subplots require a lot of ins and outs. We’ll start with the top panel and build from there. Here is how we might make the values necessary for the density curve.
<-
den # define the parameters for the underlying normal distribution
tibble(mu = 4,
sigma = 1.5) %>%
mutate(strip = str_c("mu==", mu, "~~sigma==", sigma)) %>%
# this will allow us to rescale the density in terms of the bar plot
mutate(multiplier = 26 / dnorm(mu, mu, sigma)) %>%
# we need values for the x-axis
expand_grid(y = seq(from = -1, to = 9, by = .1)) %>%
# compute the density values
mutate(density = dnorm(y, mu, sigma)) %>%
# use that multiplier value from above to rescale the density values
mutate(percent = density * multiplier)
head(den)
## # A tibble: 6 × 7
## mu sigma strip multiplier y density percent
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 4 1.5 mu==4~~sigma==1.5 97.8 -1 0.00103 0.101
## 2 4 1.5 mu==4~~sigma==1.5 97.8 -0.9 0.00128 0.125
## 3 4 1.5 mu==4~~sigma==1.5 97.8 -0.8 0.00159 0.155
## 4 4 1.5 mu==4~~sigma==1.5 97.8 -0.7 0.00196 0.192
## 5 4 1.5 mu==4~~sigma==1.5 97.8 -0.6 0.00241 0.236
## 6 4 1.5 mu==4~~sigma==1.5 97.8 -0.5 0.00295 0.289
Before making the data for the bar portion of the plot, we’ll need to define the \(\theta\)-values they’ll be placed between. We also need to define the exact points on the \(x\)-axis from which we’d like those bars to originate. Those points, which we’ll call label_1
, will double as names for the individual bars.
<- seq(from = 1.5, to = 6.5, by = 1)) (theta_1
## [1] 1.5 2.5 3.5 4.5 5.5 6.5
<- 1:7) (label_1
## [1] 1 2 3 4 5 6 7
Now we can define the data for the bars.
<-
bar # define the parameters for the underlying normal distribution
tibble(mu = 4,
sigma = 1.5) %>%
mutate(strip = str_c("mu==", mu, "~~sigma==", sigma)) %>%
# take random draws from the underlying normal distribution
mutate(draw = map2(mu, sigma, ~rnorm(1e4, mean = .x, sd = .y))) %>%
unnest(draw) %>%
# bin those draws into ordinal categories defined by `theta_1`
# and named by `label_1`
mutate(y = case_when(
< theta_1[1] ~ label_1[1],
draw < theta_1[2] ~ label_1[2],
draw < theta_1[3] ~ label_1[3],
draw < theta_1[4] ~ label_1[4],
draw < theta_1[5] ~ label_1[5],
draw < theta_1[6] ~ label_1[6],
draw >= theta_1[6] ~ label_1[7]
draw %>%
)) # summarize
count(y) %>%
mutate(percent = (100 * n / sum(n)) %>% round(0)) %>%
mutate(percent_label = str_c(percent, "%"),
percent_max = max(percent))
head(bar)
## # A tibble: 6 × 5
## y n percent percent_label percent_max
## <int> <int> <dbl> <chr> <dbl>
## 1 1 470 5 5% 27
## 2 2 1104 11 11% 27
## 3 3 2080 21 21% 27
## 4 4 2671 27 27% 27
## 5 5 2160 22 22% 27
## 6 6 1047 10 10% 27
Make the top subplot.
%>%
bar ggplot(aes(x = y, y = percent)) +
geom_area(data = den,
fill = sl[3]) +
geom_vline(xintercept = theta_1, color = "white", linetype = 3) +
geom_col(width = .5, alpha = .85, fill = sl[7]) +
geom_text(aes(y = percent + 2, label = percent_label),
size = 3.5) +
annotate(geom = "text",
x = theta_1, y = -5.55, label = theta_1,
size = 3) +
scale_x_continuous(NULL, expand = c(0, 0),
breaks = theta_1,
labels = parse(text = str_c("theta[", 1:6, "]"))) +
scale_y_continuous(NULL, breaks = NULL, expand = expansion(mult = c(0, 0.06))) +
coord_cartesian(ylim = c(0, 28.5), clip = F) +
theme(plot.margin = margin(5.5, 5.5, 11, 5.5)) +
facet_wrap(~ strip, labeller = label_parsed)
This method works okay for plotting one or two panels. The sheer number of code lines and moving parts seem unwieldy for plotting four. It’d be convenient if we could save the density information for all four panels in one data object. Here’s one way how.
<-
den tibble(panel = 1:4,
mu = c(4, 1, 4, 4),
sigma = c(1.5, 2.5, 1, 3)) %>%
mutate(strip = factor(panel,
labels = str_c("mu==", mu, "~~sigma==", sigma),
ordered = T)) %>%
mutate(multiplier = c(26, 58, 24, 26) / dnorm(mu, mu, sigma)) %>%
expand_grid(y = seq(from = -1, to = 9, by = .1)) %>%
mutate(density = dnorm(y, mu, sigma)) %>%
mutate(percent = density * multiplier)
head(den)
## # A tibble: 6 × 8
## panel mu sigma strip multiplier y density percent
## <int> <dbl> <dbl> <ord> <dbl> <dbl> <dbl> <dbl>
## 1 1 4 1.5 mu==4~~sigma==1.5 97.8 -1 0.00103 0.101
## 2 1 4 1.5 mu==4~~sigma==1.5 97.8 -0.9 0.00128 0.125
## 3 1 4 1.5 mu==4~~sigma==1.5 97.8 -0.8 0.00159 0.155
## 4 1 4 1.5 mu==4~~sigma==1.5 97.8 -0.7 0.00196 0.192
## 5 1 4 1.5 mu==4~~sigma==1.5 97.8 -0.6 0.00241 0.236
## 6 1 4 1.5 mu==4~~sigma==1.5 97.8 -0.5 0.00295 0.289
Notice we added a panel
column for indexing the subplots. Next we’ll need to define theta_[i]
and label_[i]
values for the remaining plots.
<- c(1.5, 3.1, 3.7, 4.3, 4.9, 6.5)
theta_3 <- c(1.5, 2.25, 3, 5, 5.75, 6.5)
theta_4
<- c(1, 2.2, 3.4, 4, 4.6, 5.7, 7)
label_3 <- c(1, 1.875, 2.625, 4, 5.375, 6.125, 7) label_4
Since the values are the same for the top two panels, we didn’t bother defining a theta_2
or label_2
. Now we have all the theta_[i]
and label_[i]
values, we’ll want to make a function that can use them within case_when()
for any of the four panels. Here’s one way to make such a function, which we’ll call make_ordinal()
.
<- function(x, panel) {
make_ordinal
if (panel < 3) {
case_when(
< theta_1[1] ~ label_1[1],
x < theta_1[2] ~ label_1[2],
x < theta_1[3] ~ label_1[3],
x < theta_1[4] ~ label_1[4],
x < theta_1[5] ~ label_1[5],
x < theta_1[6] ~ label_1[6],
x >= theta_1[6] ~ label_1[7]
x
)
else if (panel == 3) {
}
case_when(
< theta_3[1] ~ label_3[1],
x < theta_3[2] ~ label_3[2],
x < theta_3[3] ~ label_3[3],
x < theta_3[4] ~ label_3[4],
x < theta_3[5] ~ label_3[5],
x < theta_3[6] ~ label_3[6],
x >= theta_3[6] ~ label_3[7]
x
)
else {
}
case_when(
< theta_4[1] ~ label_4[1],
x < theta_4[2] ~ label_4[2],
x < theta_4[3] ~ label_4[3],
x < theta_4[4] ~ label_4[4],
x < theta_4[5] ~ label_4[5],
x < theta_4[6] ~ label_4[6],
x >= theta_4[6] ~ label_4[7]
x
)
} }
Now put those values and our make_ordinal()
function to work to make the data for the bar plots.
set.seed(23)
<-
bar tibble(panel = 1:4,
mu = c(4, 1, 4, 4),
sigma = c(1.5, 2.5, 1, 3)) %>%
mutate(strip = factor(panel,
labels = str_c("mu==", mu, "~~sigma==", sigma),
ordered = T)) %>%
mutate(draw = map2(mu, sigma, ~rnorm(1e5, mean = .x, sd = .y))) %>%
unnest(draw) %>%
mutate(y = map2_dbl(draw, panel, make_ordinal)) %>%
group_by(panel, strip) %>%
count(y) %>%
mutate(percent = (100 * n / sum(n)) %>% round(0)) %>%
mutate(percent_label = str_c(percent, "%"),
percent_max = max(percent))
head(bar)
## # A tibble: 6 × 7
## # Groups: panel, strip [1]
## panel strip y n percent percent_label percent_max
## <int> <ord> <dbl> <int> <dbl> <chr> <dbl>
## 1 1 mu==4~~sigma==1.5 1 4763 5 5% 26
## 2 1 mu==4~~sigma==1.5 2 10844 11 11% 26
## 3 1 mu==4~~sigma==1.5 3 21174 21 21% 26
## 4 1 mu==4~~sigma==1.5 4 26256 26 26% 26
## 5 1 mu==4~~sigma==1.5 5 21233 21 21% 26
## 6 1 mu==4~~sigma==1.5 6 10951 11 11% 26
Like before, we added a panel
index. As our final preparatory step, we will make something of a super function with which we’ll plug the desired information into ggplot2, which will then make each subplot. Much of the plotting and data wrangling code will be the same across subplots. As far as I can tell, we only need to vary four parameters. First, we’ll want to be able to subset the data by panel
index. We’ll do that with the panel_n
argument. Second, we’ll want to select which of the theta_[i]
values we’d like to use in geom_vline()
, annotate()
, and scale_x_continuous()
. We’ll do that with the theta
argument. We’ll make a y_second_x
to pin down exactly where below the \(x\)-axis we’d like to put those secondary axis values defined by the theta_[i]
values. Finally, we’ll want an ylim_ub
parameter to set the upper limit of the \(y\)-axis with. The name of our four-parameter super function will be plot_bar_den()
.
<- function(panel_n, theta, y_second_x, ylim_ub) {
plot_bar_den
%>%
bar filter(panel == panel_n) %>%
ggplot(aes(x = y)) +
geom_area(data = den %>% filter(panel == panel_n),
aes(y = percent),
fill = sl[3]) +
geom_vline(xintercept = theta, color = "white", linetype = 3) +
geom_linerange(aes(ymin = 0, ymax = percent),
color = sl[7], alpha = .85, linewidth = 8) +
geom_text(aes(y = percent + (percent_max / 15), label = percent_label),
size = 3.5) +
annotate(geom = "text", x = theta, y = y_second_x,
label = theta, size = 3) +
scale_x_continuous(NULL,
breaks = theta,
labels = parse(text = str_c("theta[", 1:6, "]")),
expand = c(0, 0)) +
scale_y_continuous(NULL, breaks = NULL, expand = expansion(mult = c(0, 0.05))) +
coord_cartesian(ylim = c(0, ylim_ub),
clip = F) +
theme(plot.margin = margin(5.5, 5.5, 11, 5.5)) +
facet_wrap(~ strip, labeller = label_parsed)
}
Finally, make all four subplots and combine them with patchwork syntax!
<- plot_bar_den(panel_n = 1,
p1 theta = theta_1,
# y_second_x = -6.75,
y_second_x = -5.55,
ylim_ub = 28)
<- plot_bar_den(panel_n = 2,
p2 theta = theta_1,
# y_second_x = -15.5 * 3/4,
y_second_x = -12.37,
ylim_ub = 63)
<- plot_bar_den(panel_n = 3,
p3 theta = theta_3,
# y_second_x = -6.25 * 3/4,
y_second_x = -5.12,
ylim_ub = 25.75)
<- plot_bar_den(panel_n = 4,
p4 theta = theta_4,
# y_second_x = -6.75 * 3/4,
y_second_x = -5.55,
ylim_ub = 28)
/ p2 / p3 / p4 p1
Oh mamma. “The crucial concept in Figure 23.1 is that the probability of a particular ordinal outcome is the area under the normal curve between the thresholds of that outcome” (p. 672, emphasis in the original). In each of the subplots, we used six thresholds to discretize the continuous data into seven categories. More generally, we need \(K\) thresholds to make \(K + 1\) ordinal categories. To make this work,
the idea is that we consider the cumulative area under the normal up the high-side threshold, and subtract away the cumulative area under the normal up to the low-side threshold. Recall that the cumulative area under the standardized normal is denoted \(\Phi(z)\), as was illustrated in Figure 15.8 [which we remade at the top of this chapter]. Thus, the area under the normal to the left of \(\theta_k\) is \(\Phi((\theta_k - \mu) / \sigma)\), and the area under the normal to the left of \(\theta_{k - 1}\) is \(\Phi((\theta_{k - 1} - \mu) / \sigma)\). Therefore, the area under the normal curve between the two thresholds, which is the probability of outcome \(k\), is
\[p(y = k | \mu, \sigma, \{ \theta_j \}) = \Phi((\theta_k - \mu) / \sigma) - \Phi((\theta_{k - 1} - \mu) / \sigma)\]
[This equation] applies even to the least and greatest ordinal values if we append two “virtual” thresholds at \(- \infty\) and \(+ \infty\)…
Thus, a normally distributed underlying metric value can yield a clearly non-normal distribution of discrete ordinal values. This result does not imply that the ordinal values can be treated as if they were themselves metric and normally distributed; in fact it implies the opposite: We might be able to model a distribution of ordinal values as consecutive intervals of a normal distribution on an underlying metric scale with appropriately positioned thresholds. (pp. 674–675)
23.2 The case of a single group
Given a model with no predictors, “if there are \(K\) ordinal values, the model has \(K + 1\) parameters: \(\theta_1, \dots,\theta_{K - 1}, \mu\), and \(\sigma\). If you think about it a moment, you’ll realize that the parameter values trade-off and are undetermined” (p. 675). The solution Kruschke took throughout this chapter was to fix the two thresholds at the ends, \(\theta_1\) and \(\theta_{K - 1}\), to the constants
\[\begin{align*} \theta_1 \equiv 1 + 0.5 && \text{and} && \theta_{K - 1} \equiv K - 0.5. \end{align*}\]
For example, all four subplots from Figure 23.1 had \(K = 7\) categories, ranging from 1 to 7. Following Kruschke’s convention would mean setting the endmost thresholds to
\[\begin{align*} \theta_1 \equiv 1.5 && \text{and} && \theta_6 \equiv 6.5. \end{align*}\]
As we’ll see, there are other ways to parameterize these models.
23.2.1 Implementation in JAGS brms.
The syntax to fit a basic ordered probit model with brms::brm()
is pretty simple.
<-
fit brm(data = my_data,
family = cumulative(probit),
~ 1,
y prior(normal(0, 4), class = Intercept))
The family = cumulative(probit)
tells brms you’d like to use the probit link for the ordered-categorical data. It’s important to specify probit
because the brms default is to use the logit
link, instead. We’ll talk more about that approach at the end of this chapter.
Remember how, at the end of the last section, we said there are other ways to parameterize the ordered probit model? As it turns out, brms does not follow Kruschke’s approach for fixing the thresholds on the ends. Rather, brms freely estimates all thresholds, \(\theta_1,...,\theta_{K - 1}\), by fixing \(\mu = 0\) and \(\sigma = 1\). That is, instead of estimating \(\mu\) and \(\sigma\) from the normal cumulative density function \(\Phi(x)\), brms::brm()
uses the standard normal cumulative density function \(\Phi(z)\).
This all probably seems abstract. We’ll get a lot of practice comparing the two approaches as we go along. Each has its strengths and weaknesses. At this point, the thing to get is that when fitting a single-group ordered-probit model with the brm()
function, there will be no priors for \(\mu\) and \(\sigma\). We only have to worry about setting the priors for all \(K - 1\) thresholds. And because those thresholds are conditional on \(\Phi(z)\), we should think about their priors with respect to the scale of standard normal distribution. Thus, to continue on with Kruschke’s minimally-informative prior approach, something like prior(normal(0, 4), class = Intercept)
might be a good starting place. Do feel free to experiment with different settings.
23.2.2 Examples: Bayesian estimation recovers true parameter values.
The data for Kruschke’s first example come from his OrdinalProbitData-1grp-1.csv
file. Load the data.
<- read_csv("data.R/OrdinalProbitData-1grp-1.csv")
my_data_1
glimpse(my_data_1)
## Rows: 100
## Columns: 1
## $ Y <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
Plot the distribution for Y
.
%>%
my_data_1 mutate(Y = factor(Y)) %>%
ggplot(aes(x = Y)) +
geom_bar(fill = sl[5]) +
scale_y_continuous(expand = expansion(mult = c(0, 0.05)))
It looks a lot like the distribution of the data from one of the panels from Figure 23.1. Load brms.
library(brms)
Fit the first cumulative-probit model.
.1 <-
fit23brm(data = my_data_1,
family = cumulative(probit),
~ 1,
Y prior(normal(0, 4), class = Intercept),
iter = 3000, warmup = 1000, chains = 4, cores = 4,
seed = 23,
file = "fits/fit23.01")
Examine the model summary.
print(fit23.1)
## Family: cumulative
## Links: mu = probit; disc = identity
## Formula: Y ~ 1
## Data: my_data_1 (Number of observations: 100)
## Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
## total post-warmup draws = 8000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept[1] 0.18 0.12 -0.06 0.42 1.00 9787 6646
## Intercept[2] 0.60 0.13 0.35 0.87 1.00 11397 6343
## Intercept[3] 1.04 0.15 0.75 1.34 1.00 11713 6088
## Intercept[4] 1.50 0.19 1.14 1.88 1.00 12416 6754
## Intercept[5] 1.97 0.25 1.51 2.50 1.00 12059 6627
## Intercept[6] 2.57 0.40 1.90 3.44 1.00 11463 6870
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## disc 1.00 0.00 1.00 1.00 NA NA NA
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
The brms output for these kinds of models names the thresholds \(\theta_{[i]}\) as Intercept[i]
. Again, whereas Kruschke identified his model by fixing \(\theta_1 = 1.5\) (i.e., \(1 + 0.5\)) and \(\theta_6 = 5.5\) (i.e., \(6 - 0.5\)), we freely estimated all six thresholds by using the cumulative density function for the standard normal. As a result, our thresholds are in a different metric from Kruschke’s.
Let’s extract the posterior draws.
<- as_draws_df(fit23.1)
draws
glimpse(draws)
## Rows: 8,000
## Columns: 12
## $ `b_Intercept[1]` <dbl> 0.44703738, 0.22443431, 0.30708260, -0.01127637, -0.18681541, 0.37826981, 0.0733989…
## $ `b_Intercept[2]` <dbl> 0.8253956, 0.5598467, 0.6366477, 0.3522155, 0.3877171, 0.7765147, 0.6104502, 0.5962…
## $ `b_Intercept[3]` <dbl> 1.0714275, 1.2367075, 0.9268432, 0.9791826, 1.0762356, 1.1229631, 0.8663489, 1.0003…
## $ `b_Intercept[4]` <dbl> 1.675261, 1.479992, 1.508809, 1.611100, 1.629725, 1.572215, 1.466856, 1.368460, 1.8…
## $ `b_Intercept[5]` <dbl> 2.439194, 1.817869, 2.224741, 2.034097, 1.955867, 1.749030, 2.591565, 1.605434, 2.2…
## $ `b_Intercept[6]` <dbl> 3.195665, 2.412770, 2.777146, 2.669718, 3.094114, 2.040137, 3.561389, 2.170847, 2.5…
## $ disc <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ lprior <dbl> -14.48757, -14.24420, -14.34068, -14.29838, -14.37510, -14.19703, -14.54015, -14.16…
## $ lp__ <dbl> -154.6614, -152.9830, -151.9284, -151.9778, -155.8266, -153.3221, -155.7108, -150.8…
## $ .chain <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ .iteration <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, …
## $ .draw <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, …
Wrangle the draws
a bit.
<-
draws %>%
draws select(.draw, `b_Intercept[1]`:`b_Intercept[6]`)
Here’s our brms version of the bottom plot of Figure 23.2.
# compute the posterior means for each threshold
<-
means %>%
draws summarise_at(vars(`b_Intercept[1]`:`b_Intercept[6]`), mean) %>%
pivot_longer(everything(),
values_to = "mean")
# wrangle
%>%
draws pivot_longer(-.draw, values_to = "threshold") %>%
group_by(.draw) %>%
mutate(theta_bar = mean(threshold)) %>%
# finally we plot
ggplot(aes(x = threshold, y = theta_bar, color = name)) +
geom_vline(data = means,
aes(xintercept = mean, color = name),
linetype = 2) +
geom_point(alpha = 1/10) +
scale_color_scico_d(palette = "lajolla", begin = .25) +
ylab("mean threshold") +
theme(legend.position = "none")
The initial means
data at the top contains the \(\theta_i\)-specific means, which we used to make the dashed vertical lines with geom_vline()
. Did you see what we did there with those group_by()
and mutate()
lines? That’s how we computed the mean threshold within each step of the HMC chain, what Kruschke (p. 680) denoted as \(\bar \theta (s) = \sum_k^{K-1} \theta_k (s) / (K - 1)\), where \(s\) refers to particular steps in the HMC chain.
Perhaps of greater interest, you might have noticed how different our plot is from the one in the text. We might should compare the results of our brms parameterization of \(\theta_{[i]}\) with one based on the parameterization in the text in an expanded version of the bottom plot of Figure 23.2. To convert our brms output to match Kruschke’s, we’ll rescale our \(\theta_{[i]}\) draws with help from the scales::rescale()
function, about which you might learn more here.
# primary data wrangling
<-
p bind_rows(
# brms parameterization
%>%
draws pivot_longer(-.draw, values_to = "threshold") %>%
group_by(.draw) %>%
mutate(theta_bar = mean(threshold)),
# Kruschke's parameterization
%>%
draws pivot_longer(-.draw, values_to = "threshold") %>%
group_by(.draw) %>%
mutate(threshold = scales::rescale(threshold, to = c(1.5, 6.5))) %>%
mutate(theta_bar = mean(threshold))
%>%
) # add an index
mutate(model = rep(c("brms parameterization", "Kruschke's parameterization"), each = n() / 2))
# compute the means by model and threshold for the vertical lines
<-
means %>%
p ungroup() %>%
group_by(model, name) %>%
summarise(mean = mean(threshold))
# plot!
%>%
p ggplot(aes(x = threshold, y = theta_bar)) +
geom_vline(data = means,
aes(xintercept = mean, color = name),
linetype = 2) +
geom_point(aes(color = name),
alpha = 1/10, size = 1/2) +
scale_color_scico_d(palette = "lajolla", begin = .25) +
ylab("mean threshold") +
theme(legend.position = "none") +
facet_wrap(~ model, ncol = 1, scales = "free")
We can take our rescaling approach further to convert the posterior distributions for \(\mu\) and \(\sigma\) from the brms \(\operatorname{Normal}(0, 1)\) constants to the metric from Kruschke’s approach. Say \(y_1\) and \(y_2\) are two draws from some Gaussian and \(z_1\) and \(z_2\) are their corresponding \(z\)-scores. Here’s how to solve for \(\sigma\).
\[\begin{align*} z_1 - z_2 & = \frac{(y_1 - \mu)}{\sigma} - \frac{(y_2 - \mu)}{\sigma} \\ & = \frac{(y_1 - \mu) - (y_2 - \mu)}{\sigma} \\ & = \frac{y_1 - \mu - y_2 + \mu}{\sigma} \\ & = \frac{y_1 - y_2}{\sigma}, \;\; \text{therefore} \\ \sigma & = \frac{y_1 - y_2}{z_1 – z_2}. \end{align*}\]
If you’d like to compute \(\mu\), it’s even simpler.
\[\begin{align*} z_1 & = \frac{y_1 - \mu}{\sigma} \\ z_1 \sigma & = y_1 - \mu \\ z_1 \sigma + \mu & = y_1, \;\; \text{therefore} \\ \mu & = y_1 - z_1 \sigma \end{align*}\]
Big shout out to my math-stats savvy friends academic twitter for the formulas, especially Ph.Demetri, Lukas Neugebauer, and Brenton Wiernik for walking out the formulas (see this twitter thread). For our application, Intercept[1]
and Intercept[6]
will be our two \(z\)-scores and Kruschke’s 1.5 and 6.5 will be their corresponding \(y\)-values.
library(tidybayes)
%>%
draws select(.draw, `b_Intercept[1]`, `b_Intercept[6]`) %>%
mutate(`y[1]` = 1.5,
`y[6]` = 6.5) %>%
mutate(mu = `y[1]` - `b_Intercept[1]` * 1,
sigma = (`y[1]` - `y[6]`) / (`b_Intercept[1]` - `b_Intercept[6]`)) %>%
mutate(`(mu-2)/sigma` = (mu - 2) / sigma) %>%
pivot_longer(mu:`(mu-2)/sigma`) %>%
mutate(name = factor(name, levels = c("mu", "sigma", "(mu-2)/sigma"))) %>%
ggplot(aes(x = value, y = 0)) +
stat_halfeye(point_interval = mode_hdi, .width = .95,
fill = sl[4], color = sl[8],
normalize = "panels") +
scale_y_continuous(NULL, breaks = NULL) +
facet_wrap(~ name, scales = "free", labeller = label_parsed)
Our results are similar to Kruschke’s. Given we used a different algorithm, a different parameterization, and different priors, I’m not terribly surprised they’re a little different. If you have more insight on the matter or have spotted a flaw in this method, please share with the rest of us.
It’s unclear, to me, how we’d interpret the effect size. The difficulty isn’t that Kruschke’s comparison of \(C = 2.0\) is arbitrary, but that we can only interpret the comparison given the model assumption of \(\theta_1 = 1.5\) and \(\theta_6 = 6.5\). If your theory doesn’t allow you to understand the meaning of those constants and why you’d prefer them to slightly different ones, you’d be fooling yourself if you attempted to interpret any effect sizes conditional on those values. Proceed with caution, friends.
In the large paragraph on the lower part of page 679, Kruschke discussed why the thresholds tend to have nontrivial covariances. This is what he was trying to convey with the bottom subplot in Figure 23.2. Just for practice, we might also explore the correlation matrix among the thresholds with a customized GGally::ggpairs()
plot.
# load
library(GGally)
# customize
<- function(data, mapping, ...) {
my_upper ggplot(data = data, mapping = mapping) +
geom_point(size = 1/5, alpha = 1/5, color = sl[7])
}
<- function(data, mapping, ...) {
my_diag ggplot(data = data, mapping = mapping) +
geom_density(linewidth = 0, fill = sl[3]) +
scale_x_continuous(NULL, breaks = NULL) +
scale_y_continuous(NULL, breaks = NULL)
}
<- function(data, mapping, ...) {
my_lower
# get the x and y data to use the other code
<- eval_data_col(data, mapping$x)
x <- eval_data_col(data, mapping$y)
y
# compute the correlations
<- cor(x, y, method = "p", use = "pairwise")
corr
# plot the cor value
ggally_text(
label = formatC(corr, digits = 2, format = "f") %>% str_replace(., "0\\.", "."),
mapping = aes(),
color = "black",
size = 4) +
scale_x_continuous(NULL, breaks = NULL) +
scale_y_continuous(NULL, breaks = NULL)
}
# wrangle
as_draws_df(fit23.1) %>%
select(contains("Intercept")) %>%
set_names(str_c("theta[", 1:6, "]")) %>%
# plot!
ggpairs(upper = list(continuous = my_upper),
diag = list(continuous = my_diag),
lower = list(continuous = my_lower),
labeller = label_parsed) +
ggtitle("The thresholds are highly correlated")
Kruschke didn’t do this in the text, but it might be informative to plot the probability distributions for the seven categories from Y
, \(p(y = k | \mu = 0, \sigma = 1, \{ \theta_i \})\).
library(tidybayes)
%>%
draws select(-.draw) %>%
mutate_all(.funs = ~pnorm(. ,0, 1)) %>%
transmute(`p[Y==1]` = `b_Intercept[1]`,
`p[Y==2]` = `b_Intercept[2]` - `b_Intercept[1]`,
`p[Y==3]` = `b_Intercept[3]` - `b_Intercept[2]`,
`p[Y==4]` = `b_Intercept[4]` - `b_Intercept[3]`,
`p[Y==5]` = `b_Intercept[5]` - `b_Intercept[4]`,
`p[Y==6]` = `b_Intercept[6]` - `b_Intercept[5]`,
`p[Y==7]` = 1 - `b_Intercept[6]`) %>%
set_names(1:7) %>%
pivot_longer(everything(), names_to = "Y") %>%
ggplot(aes(x = value, y = Y)) +
stat_halfeye(point_interval = mode_hdi, .width = .95,
fill = sl[4], color = sl[8], size = 1/2, height = 2.5) +
scale_x_continuous(expression(italic(p)*"["*Y==italic(i)*"]"),
breaks = 0:5 / 5,
expand = c(0, 0), limits = c(0, 1))
Happily, the model produces data that look a lot like those from which it was generated.
set.seed(23)
%>%
draws mutate(z = rnorm(n(), mean = 0, sd = 1)) %>%
mutate(Y = case_when(
< `b_Intercept[1]` ~ 1,
z < `b_Intercept[2]` ~ 2,
z < `b_Intercept[3]` ~ 3,
z < `b_Intercept[4]` ~ 4,
z < `b_Intercept[5]` ~ 5,
z < `b_Intercept[6]` ~ 6,
z >= `b_Intercept[6]` ~ 7
z %>% as.factor(.)) %>%
)
ggplot(aes(x = Y)) +
geom_bar(fill = sl[5]) +
scale_y_continuous(expand = expansion(mult = c(0, 0.05)))
Along similar lines, we can use the pp_check()
function to make a version of the upper right panel of Figure 23.2. The type = "bars"
argument will allow us to summarize the posterior predictions as a dot (mean) and standard error bars superimposed on a bar plot of the original data. Note how this differs a little from Kruschke’s use of the posterior median and 95% HDIs. The ndraws = 1000
argument controls how many posterior predictions we wanted to summarize over. The rest is just formatting.
::color_scheme_set(sl[2:7])
bayesplot
set.seed(23)
pp_check(fit23.1, type = "bars", ndraws = 1000, fatten = 2) +
scale_x_continuous("y", breaks = 1:7) +
scale_y_continuous(NULL, breaks = NULL, expand = expansion(mult = c(0, 0.05))) +
ggtitle("Data with posterior predictions",
subtitle = "N = 100") +
theme(legend.background = element_blank(),
legend.position = c(.9, .8))
Load the data for the next model.
<- read_csv("data.R/OrdinalProbitData-1grp-2.csv") my_data_2
Since we’re reusing all the specifications from the last model for this one, we can just use update()
.
.2 <-
fit23update(fit23.1,
newdata = my_data_2,
iter = 3000, warmup = 1000, chains = 4, cores = 4,
seed = 23,
file = "fits/fit23.02")
print(fit23.2)
## Family: cumulative
## Links: mu = probit; disc = identity
## Formula: Y ~ 1
## Data: my_data_2 (Number of observations: 70)
## Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
## total post-warmup draws = 8000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept[1] -1.41 0.21 -1.83 -1.01 1.00 4879 4982
## Intercept[2] -0.17 0.15 -0.47 0.12 1.00 9111 6632
## Intercept[3] 0.17 0.15 -0.11 0.47 1.00 9222 7109
## Intercept[4] 0.46 0.15 0.15 0.76 1.00 8706 6590
## Intercept[5] 0.83 0.17 0.51 1.17 1.00 9277 6513
## Intercept[6] 2.00 0.31 1.46 2.67 1.00 8063 6649
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## disc 1.00 0.00 1.00 1.00 NA NA NA
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Extract and wrangle the posterior draws.
<-
draws as_draws_df(fit23.2) %>%
select(.draw, `b_Intercept[1]`:`b_Intercept[6]`)
Now we might compare the brms parameterization of \(\theta_{[i]}\) with Kruschke’s parameterization in an expanded version of the bottom plot of Figure 23.3. As we’ll be making a lot of these plots throughout this chapter, it might be worthwhile to just make a custom function. We’ll call it compare_thresholds()
.
<- function(data, lb = 1.5, ub = 6.5) {
compare_thresholds
# we have two parameters:
# lb = lower bound
# ub = upper bound
# primary data wrangling
<-
p bind_rows(
%>%
data pivot_longer(-.draw, values_to = "threshold") %>%
group_by(.draw) %>%
mutate(theta_bar = mean(threshold)),
%>%
data pivot_longer(-.draw, values_to = "threshold") %>%
group_by(.draw) %>%
mutate(threshold = scales::rescale(threshold, to = c(lb, ub))) %>%
mutate(theta_bar = mean(threshold))
%>%
) mutate(model = rep(c("brms parameterization", "Kruschke's parameterization"), each = n() / 2))
# compute the means by model and threshold for the vertical lines
<-
means %>%
p ungroup() %>%
group_by(model, name) %>%
summarise(mean = mean(threshold))
# plot!
%>%
p ggplot(aes(x = threshold, y = theta_bar)) +
geom_vline(data = means,
aes(xintercept = mean, color = name),
linetype = 2) +
geom_point(aes(color = name),
alpha = 1/10, size = 1/2) +
scale_color_scico_d(palette = "lajolla", begin = .25) +
ylab("mean threshold") +
theme(legend.position = "none") +
facet_wrap(~ model, ncol = 1, scales = "free")
}
Take that puppy for a spin.
%>%
draws compare_thresholds(lb = 1.5, ub = 6.5)
Oh man, that works sweet. Now let’s use the same parameter-transformation approach from before to get our un-standardized posteriors for \(\mu\), \(\sigma\), and the effect size.
%>%
draws select(.draw, `b_Intercept[1]`, `b_Intercept[6]`) %>%
mutate(`y[1]` = 1.5,
`y[6]` = 6.5) %>%
mutate(mu = `y[1]` - `b_Intercept[1]` * 1,
sigma = (`y[1]` - `y[6]`) / (`b_Intercept[1]` - `b_Intercept[6]`)) %>%
mutate(`(mu-4)/sigma` = (mu - 4) / sigma) %>%
pivot_longer(mu:`(mu-4)/sigma`) %>%
mutate(name = factor(name, levels = c("mu", "sigma", "(mu-4)/sigma"))) %>%
ggplot(aes(x = value, y = 0)) +
stat_halfeye(point_interval = mode_hdi, .width = .95,
fill = sl[4], color = sl[8],
normalize = "panels") +
scale_y_continuous(NULL, breaks = NULL) +
facet_wrap(~ name, scales = "free", labeller = label_parsed)
Use pp_check()
to make our version of the upper-right panel of Figure 23.3.
set.seed(23)
pp_check(fit23.2, type = "bars", ndraws = 1000, fatten = 2) +
scale_x_continuous("y", breaks = 1:7) +
scale_y_continuous(NULL, breaks = NULL, expand = expansion(mult = c(0, 0.05))) +
ggtitle("Data with posterior predictions",
subtitle = "N = 70") +
theme(legend.background = element_blank(),
legend.position = c(.9, .8))
Just as in the text, “the posterior predictive distribution in the top-right subpanel accurately describes the bimodal distribution of the outcomes” (p. 680).
Here are the probability distributions for each of the 7 categories of Y
.
%>%
draws select(-.draw) %>%
mutate_all(.funs = ~pnorm(. ,0, 1)) %>%
transmute(`p[Y==1]` = `b_Intercept[1]`,
`p[Y==2]` = `b_Intercept[2]` - `b_Intercept[1]`,
`p[Y==3]` = `b_Intercept[3]` - `b_Intercept[2]`,
`p[Y==4]` = `b_Intercept[4]` - `b_Intercept[3]`,
`p[Y==5]` = `b_Intercept[5]` - `b_Intercept[4]`,
`p[Y==6]` = `b_Intercept[6]` - `b_Intercept[5]`,
`p[Y==7]` = 1 - `b_Intercept[6]`) %>%
set_names(1:7) %>%
pivot_longer(everything(), names_to = "Y") %>%
ggplot(aes(x = value, y = Y)) +
stat_halfeye(point_interval = mode_hdi, .width = .95,
fill = sl[4], color = sl[8], size = 1/2) +
scale_x_continuous(expression(italic(p)*"["*Y==italic(i)*"]"),
breaks = 0:5 / 5,
expand = c(0, 0), limits = c(0, 1))
Before we move on, it might be helpful to nail down what the thresholds mean within the context of our brms parameterization. To keep things simple, we’ll focus on their posterior means.
tibble(x = seq(from = -3.5, to = 3.5, by = .01)) %>%
mutate(d = dnorm(x)) %>%
ggplot(aes(x = x, ymin = 0, ymax = d)) +
geom_ribbon(fill = sl[3]) +
geom_vline(xintercept = fixef(fit23.2)[, 1], color = "white", linetype = 3) +
scale_x_continuous(NULL, breaks = fixef(fit23.2)[, 1],
labels = parse(text = str_c("theta[", 1:6, "]"))) +
scale_y_continuous(NULL, breaks = NULL, expand = expansion(mult = c(0, 0.05))) +
ggtitle("Standard normal distribution underlying the ordinal Y data:",
subtitle = "The dashed vertical lines mark the posterior means for the thresholds.") +
coord_cartesian(xlim = c(-3, 3))
Compare that to Figure 23.1.
23.2.2.1 Not the same results as pretending the data are metric.
“In some conventional approaches to ordinal data, the data are treated as if they were metric and normally distributed” (p. 681). Here’s what that brms::brm()
model might look like using methods from back in Chapter 16. First, we’ll define our stanvars
.
<- mean(my_data_1$Y)
mean_y <- sd(my_data_1$Y)
sd_y
<-
stanvars stanvar(mean_y, name = "mean_y") +
stanvar(sd_y, name = "sd_y")
Fit the model.
.3 <-
fit23brm(data = my_data_1,
family = gaussian,
~ 1,
Y prior = c(prior(normal(mean_y, sd_y * 100), class = Intercept),
prior(normal(0, sd_y), class = sigma)),
chains = 4, cores = 4,
stanvars = stanvars,
seed = 23,
file = "fits/fit23.03")
Check the results.
print(fit23.3)
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: Y ~ 1
## Data: my_data_1 (Number of observations: 100)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 1.95 0.14 1.68 2.23 1.00 3454 2723
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 1.41 0.10 1.23 1.62 1.00 3535 2700
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
As Kruschke indicated in the text, it yielded a distributional mean of about 1.95 and a standard deviation of about 1.41. Here we’ll use a posterior predictive check to compare histograms of data generated from this model to that of the original data.
pp_check(fit23.3, type = "hist", ndraws = 10, binwidth = 1) +
scale_x_continuous(breaks = seq(from = -3, to = 7, by = 2)) +
theme(legend.position = c(.9, .15))
Yeah, that’s not a good fit. We won’t be conducting a \(t\)-test like Kruschke did on page 681. But we might compromise and take a look at the marginal distribution of the intercept (i.e., for \(\mu\)) and its difference from 2, the reference value.
as_draws_df(fit23.3) %>%
mutate(`2 - b_Intercept` = 2 - b_Intercept,
`effect size` = (2 - b_Intercept) / sigma) %>%
pivot_longer(`2 - b_Intercept`:`effect size`) %>%
ggplot(aes(x = value, y = 0)) +
stat_halfeye(point_interval = mode_hdi, .width = .95,
fill = sl[4], color = sl[8], normalize = "panels") +
scale_y_continuous(NULL, breaks = NULL) +
facet_wrap(~ name, scales = "free")
Yes indeed, 2 is a credible value for the intercept. And as reported in the text, we got a very small \(d\) effect size. Now we repeat the process for the second data set.
<- mean(my_data_2$Y)
mean_y <- sd(my_data_2$Y)
sd_y
<-
stanvars stanvar(mean_y, name = "mean_y") +
stanvar(sd_y, name = "sd_y")
.4 <-
fit23update(fit23.3,
newdata = my_data_2,
chains = 4, cores = 4,
stanvars = stanvars,
seed = 23,
file = "fits/fit23.04")
## The desired updates require recompiling the model
Let’s just jump to the plot. This time we’re comparing the b_Intercept
to the value of 4.0.
as_draws_df(fit23.4) %>%
mutate(`2 - b_Intercept` = 4 - b_Intercept,
`effect size` = (4 - b_Intercept) / sigma) %>%
pivot_longer(`2 - b_Intercept`:`effect size`) %>%
ggplot(aes(x = value, y = 0)) +
stat_halfeye(point_interval = mode_hdi, .width = .95,
fill = sl[4], color = sl[8], normalize = "panels") +
scale_y_continuous(NULL, breaks = NULL) +
facet_wrap(~ name, scales = "free")
As in the text, our \(d\) is centered around 0.3. Let’s use a posterior predictive check to see how well fit23.4
summarized these data.
pp_check(fit23.4, type = "hist", ndraws = 10, binwidth = 1) +
scale_x_continuous(breaks = seq(from = -3, to = 7, by = 2)) +
theme(legend.position = c(.9, .15))
The histograms aren’t as awful as the ones from the previous model. But they’re still not great. We might further inspect the model misspecification with a cumulative distribution function overlay, this time comparing fit23.2
directly to fit23.4
.
<-
p1 pp_check(fit23.2, type = "ecdf_overlay", ndraws = 50) +
ggtitle("Cumulative-normal (fit23.2)")
<-
p2 pp_check(fit23.4, type = "ecdf_overlay", ndraws = 50) +
ggtitle("Conventional-normal (fit23.4)")
+ p2 &
(p1 scale_x_continuous(breaks = 0:7, limits = c(0, 7)) &
scale_y_continuous(expand = c(0, 0)) &
theme(title = element_text(size = 10.5))) +
plot_layout(guides = 'collect')
“Which of the analyses yields the more trustworthy conclusion? The one that describes the data better. In these cases, there is no doubt that the cumulative-normal model is the better description of the data” than the conventional Gaussian model (p. 682).
23.2.2.2 Ordinal outcomes versus Likert scales.
Just for fun,
rate how much you agree with the statement, “Bayesian estimation is more informative than null-hypothesis significance testing,” by selecting one option from the following: \(1\) = strongly disagree; \(2\) = disagree; \(3\) = undecided; \(4\) = agree; \(5\) = strongly agree. This sort of ordinal response interface is often called a Likert-type response (Likert, 1932), pronounced LICK-ert not LIKE-ert). Sometimes, it is called a Likert “scale” but the term “scale” in this context is more properly reserved for referring to an underlying metric variable that is indicated by the arithmetic mean of several meaningfully related Likert-type responses (Carifio & Perla, 2008; e.g., Carifio & Perla, 2007; Norman, 2010). (p. 681)
Kruschke then briefly introduced how one might combine several such meaningfully-related Likert-type responses with latent variable methods. He then clarified this text will not explore that approach, further. The current version of brms (i.e., 2.12.0) has very limited latent variable capacities. However, they are in the works. Interested modelers can follow Bürkner’s progress in GitHub issue #304. He also has a (2020) paper on how one might use brms to fit item response theory models, which can be viewed as a special family of latent variable models. One can also fit Bayesian latent variable models with the blavaan package.
23.3 The case of two groups
In both examples in the preceding text, the two groups of outcomes were on the same ordinal scale. In the first example, both questionnaire statements were answered on the same disagree–agree scale. In the second example, both groups responded on the same very unhappy–very happy scale. Therefore, we assume that both groups have the same underlying metric variable with the same thresholds. (p. 682)
23.3.1 Implementation in JAGS brms.
The brm()
syntax for adding a single categorical predictor to an ordered-probit model is much like that for any other likelihood. We just add the variable name to the right side of the ~
in the formula
argument. If you’re like me and like to use the verbose 1
syntax for your model intercepts–thresholds in these models–just use the +
operator between them. For example, this is what it’d look like for an ordered-categorical criterion y
and a single categorical predictor x
.
<-
fit brm(data = my_data,
family = cumulative(probit),
~ 1 + x,
y prior = c(prior(normal(0, 4), class = Intercept),
prior(normal(0, 4), class = b)))
Also of note, we’ve expanded the prior
section to include a line for class = b
. As with the thresholds, interpret this prior through the context of the underlying standard normal cumulative distribution, \(\Phi(z)\). Note the interpretation, though. By brms defaults, the underlying Gaussian for the reference category of x
will be \(\operatorname{Normal}(0, 1)\). Thus whatever parameter value you get for the other categories in x
, those will be standardized mean differences, making them a kind of effect size.
Note, the above all presumes you’re only interested in comparing means between groups. Things get more complicated if you want groups to vary by \(\sigma\), too. Hold on tight!
First, look back at the output from print(fit1)
or print(fit2)
. The second line for both reads: Links: mu = probit; disc = identity
. Hopefully the mu = probit
part is no surprise. Probit regression is the primary focus of this chapter. But check out the disc = identity
part and notice that nowhere in there is there any mention of sigma = identity
like we get when treating the criterion as metric as in conventional Gaussian models (i.e., execute print(fit3)
or print(fit4)
).
Yes, there is a relationship between disc
and sigma
. disc
is shorthand for discrimination. The term comes from the item response theory (IRT) literature and discrimination is the inverse of \(\sigma\) (see Bürkner’s Bayesian item response modelling in R with brms and Stan). In IRT, discrimination is often denoted \(a\) or \(\alpha\). Here I’ll adopt the latter, making \(\sigma = 1 / \alpha\). But focusing back on brms summary output, notice how both disc
and sigma
are modeled using the identity link. If you recall from earlier chapters, we switched to the log link to constrain the values to zero and above when we allowed \(\sigma\) to vary across groups. It’s the same thing for our discrimination parameter, \(\alpha\). Because \(\alpha\) should always be zero or above, brms defaults to the log link when modeling it with predictors.
As with \(\sigma\) in conventional Gaussian models, we’ll be using some version of the bf()
syntax when modeling the discrimination parameter in brms. For a general introduction to what Bürkner calls distributional modeling, see his (2022a) vignette, Estimating distributional models with brms. In the case of the discrimination parameter for the cumulative model, we’ll want more focused instructions. Happily, Bürkner & Vuorre (2019) have our backs. We read:
Conceptually, unequal variances are incorporated in the model by specifying an additional regression formula for the variance component of the latent variable \(\tilde Y\). In brms, the parameter related to latent variances is called disc (short for “discrimination”), following conventions in item response theory. Note that disc is not the variance itself, but the inverse of the standard deviation, \(s.\) That is, \(s = 1/ \text{disc}\). Further, because disc must be strictly positive, it is by default modeled on the log scale.
Predicting auxiliary parameters (parameters of the distribution other than the mean/location) in brms is accomplished by passing multiple regression formulas to the
brm()
function. Each formula must first be wrapped in another function,bf()
orlf()
(for “linear formula”)–depending on whether it is a main or an auxiliary formula, respectively. The formulas are then combined and passed to theformula
argument ofbrm()
. Because the standard deviation of the latent variable is fixed to 1 for the baseline [group, disc cannot be estimated for the baseline group]. We must therefore ensure that disc is estimated only for [non-baseline groups]. To do so, we omit the intercept from the model of disc by writing0 + ...
on the right-hand side of the regression formula. By default, R applies cell-mean coding to factors in formulas without an intercept. That would lead to disc being estimated for [all groups], so we must deactivate it via thecmc
argument oflf()
. (pp. 11–12)
Here’s what that might look like.
<-
fit brm(data = my_data,
family = cumulative(probit),
bf(y ~ 1 + x) +
lf(disc ~ 0 + x, cmc = F),
prior = c(prior(normal(0, 4), class = Intercept),
prior(normal(0, 4), class = b),
prior(normal(0, 4), class = b, dpar = disc)))
Note how when using the disc ~ 0 + ...
syntax, the disc
parameters are of class = b
within the prior()
function. If you’d like to assign them priors differing from the other b
parameters, you’ll need to specify dpar = disc
. Again, though the mean structure for this model is on the probit scale, the discrimination structure is on the log scale. Recalling that \(\sigma = 1/\alpha\), which means \(\alpha = 1/\sigma\), and also that we’re modeling \(\log (\alpha)\), the priors for the standard deviations of the non-reference category groups are on the scale of \(\log (1 / \sigma)\).
To get a better sense of how one might set a prior on such a scale, we might compare \(\sigma\), \(\alpha\), and \(\log (\alpha)\). Here are the density and cumulative density functions for \(\operatorname{Normal}(0, 0.5)\), \(\operatorname{Normal}(0, 1)\), and \(\operatorname{Normal}(0, 2)\).
tibble(mu = 0,
sigma = c(0.5, 1, 2)) %>%
expand_grid(y = seq(from = -5, to = 5, by = 0.1)) %>%
mutate(`p(y)` = dnorm(y, mu, sigma),
`Phi(y)` = pnorm(y, mu, sigma)) %>%
mutate(alpha = 1 / sigma,
loga = log(1 / sigma)) %>%
mutate(label = str_c("list(sigma==", sigma, ",alpha==", alpha, ",log(alpha)==", round(loga, 2), ")")) %>%
pivot_longer(`p(y)`:`Phi(y)`) %>%
ggplot(aes(x = y, y = value)) +
geom_line(linewidth = 1.5, color = sl[7]) +
scale_y_continuous(NULL, breaks = NULL) +
xlab(NULL) +
coord_cartesian(xlim = c(-4, 4)) +
facet_grid(name ~ label, labeller = label_parsed, switch = "y")
Put another way, here’s how \(\alpha\) and \(\log (\alpha)\) scale on values of \(\sigma\) ranging from 0.0001 to 10.
tibble(sigma = seq(from = 0.0001, to = 10, by = 0.01)) %>%
mutate(alpha = 1 / sigma,
`log(alpha)` = log(1 / sigma)) %>%
pivot_longer(-sigma, names_to = "labels") %>%
ggplot(aes(x = sigma, y = value)) +
geom_hline(yintercept = 0, color = sl[3], linetype = 2) +
geom_vline(xintercept = 0, color = sl[3], linetype = 2) +
geom_line(linewidth = 1.5, color = sl[7]) +
coord_cartesian(ylim = c(-2, 10)) +
facet_grid(~ labels, labeller = label_parsed)
When \(\sigma\) goes below 1, both explode upward. As \(\sigma\) increases, \(\alpha\) asymptotes at zero and \(\log (\alpha)\) slowly descends below zero. Put another way, here is how \(\sigma\) scales as a function of \(\log (\alpha)\).
tibble(`log(alpha)` = seq(from = -3, to = 3, by = 0.01)) %>%
mutate(sigma = 1 / exp(`log(alpha)`)) %>%
ggplot(aes(x = `log(alpha)`, y = sigma)) +
geom_hline(yintercept = 0, color = sl[3], linetype = 2) +
geom_line(linewidth = 1.5, color = sl[7]) +
labs(x = expression(log(alpha)),
y = expression(sigma)) +
coord_cartesian(xlim = c(-2.5, 2.5),
ylim = c(0, 10))
In the context where the underlying distribution for the reference category will be the standard normal, it seems like a \(\operatorname{Normal}(0, 1)\) prior would be fairly permissive for \(\log (\alpha)\). This is what I will use going forward. Choose your priors with care.
23.3.2 Examples: Not funny.
Load the data for the next model.
<- read_csv("data.R/OrdinalProbitData1.csv")
my_data
glimpse(my_data)
## Rows: 88
## Columns: 2
## $ X <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A"…
## $ Y <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,…
Fit the first ordinal probit model with group-specific \(\mu\) and \(\sigma\) values for the underlying normal distributions for the ordinal variable Y
.
.5 <-
fit23brm(data = my_data,
family = cumulative(probit),
bf(Y ~ 1 + X) +
lf(disc ~ 0 + X, cmc = FALSE),
prior = c(prior(normal(0, 4), class = Intercept),
prior(normal(0, 4), class = b),
prior(normal(0, 1), class = b, dpar = disc)),
iter = 3000, warmup = 1000, chains = 4, cores = 4,
seed = 23,
file = "fits/fit23.05")
Look over the summary.
print(fit23.5)
## Family: cumulative
## Links: mu = probit; disc = log
## Formula: Y ~ 1 + X
## disc ~ 0 + X
## Data: my_data (Number of observations: 88)
## Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
## total post-warmup draws = 8000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept[1] 0.49 0.20 0.11 0.88 1.00 9760 6028
## Intercept[2] 1.30 0.24 0.85 1.78 1.00 11289 5956
## Intercept[3] 2.20 0.38 1.52 3.00 1.00 5687 5511
## Intercept[4] 3.43 0.82 2.15 5.38 1.00 4453 5631
## XB 0.43 0.34 -0.31 1.04 1.00 4682 3784
## disc_XB -0.32 0.28 -0.86 0.23 1.00 3221 4543
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Because our fancy new parameter disc_XB
is on the \(\log (\alpha)\) scale, we can convert it to the \(\sigma\) scale with \(\frac{1}{\exp (\log \alpha)}\). For a quick and dirty example, here it is with the posterior mean.
1 / (exp(fixef(fit23.5)["disc_XB", 1]))
## [1] 1.372627
Before we follow along with Kruschke, let’s hammer the meaning of these model parameters home. Here is a density plot of the two underlying latent distributions for Y
, given X
. We’ll throw in the thresholds for good measure. To keep things simple, we’ll just express the distributions in terms of the posterior means of each parameter.
tibble(X = LETTERS[1:2],
mu = c(0, fixef(fit23.5)["XB", 1]),
sigma = c(1, 1 / (exp(fixef(fit23.5)["disc_XB", 1])))) %>%
expand_grid(y = seq(from = -5, to = 5, by = 0.1)) %>%
mutate(d = dnorm(y, mu, sigma)) %>%
ggplot(aes(x = y, y = d, fill = X)) +
geom_area(alpha = 2/3, position = "identity") +
geom_vline(xintercept = fixef(fit23.5)[1:4, 1], linetype = 3, color = sl[9]) +
scale_fill_scico_d(palette = "lajolla", begin = .33, end = .67) +
scale_x_continuous(sec.axis = dup_axis(breaks = fixef(fit23.5)[1:4, 1] %>% as.double(),
labels = parse(text = str_c("theta[", 1:4, "]")))) +
scale_y_continuous(NULL, breaks = NULL, expand = expansion(mult = c(0, 0.05))) +
labs(title = "Underlying latent scale for Y, given X",
x = NULL) +
theme(axis.ticks.x.top = element_blank())
Returning to our previous workflow, extract the posterior draws and wrangle.
<- as_draws_df(fit23.5)
draws
glimpse(draws)
## Rows: 8,000
## Columns: 11
## $ `b_Intercept[1]` <dbl> 0.4952251, 0.7549364, 0.7251187, 0.5801829, 0.3644839, 0.6963412, 0.2042224, 0.4628…
## $ `b_Intercept[2]` <dbl> 1.0987189, 1.2573930, 1.3004260, 1.2258207, 0.9281913, 1.2340367, 1.3980522, 1.1853…
## $ `b_Intercept[3]` <dbl> 1.701087, 1.806990, 1.651997, 2.159145, 1.776843, 1.795096, 2.845807, 2.064523, 1.9…
## $ `b_Intercept[4]` <dbl> 2.377239, 2.446180, 2.469663, 2.678498, 2.348979, 2.197770, 6.031723, 3.376425, 4.1…
## $ b_XB <dbl> 0.48385430, 0.85921784, 0.67312708, 0.38551309, 0.09678844, 0.45100318, 0.29035289,…
## $ b_disc_XB <dbl> 0.230385189, 0.288865047, 0.302123290, 0.220952201, 0.033662260, -0.001259104, -1.0…
## $ lprior <dbl> -12.71293, -12.72098, -12.73492, -12.82612, -12.73203, -12.68872, -14.37737, -12.90…
## $ lp__ <dbl> -108.1915, -109.1719, -110.3546, -112.2361, -109.8350, -110.3082, -112.2452, -107.9…
## $ .chain <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ .iteration <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, …
## $ .draw <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, …
Now, let’s use our handy compare_thresholds()
function to make an expanded version of the lower-left plot of Figure 23.4.
%>%
draws select(`b_Intercept[1]`:`b_Intercept[4]`, .draw) %>%
compare_thresholds(lb = 1.5, ub = 4.5)
It will no longer be straightforward to use the formulas from 23.2.2 to convert the output from our brms parameterization to match the way Kruschke parameterized his conditional means and standard deviations. I will leave the conversion up to the interested reader. Going forward, we will focus on the output from our brms parameterization.
%>%
draws # simple parameters
mutate(`mu[A]` = 0,
`mu[B]` = b_XB,
`sigma[A]` = 1,
`sigma[B]` = 1 / exp(b_disc_XB)) %>%
# simple differences
mutate(`mu[B]-mu[A]` = `mu[B]` - `mu[A]`,
`sigma[B]-sigma[A]` = `sigma[B]` - `sigma[A]`) %>%
# effect size
mutate(`(mu[B]-mu[A])/sqrt((sigma[A]^2+sigma[B]^2)/2)` = (`mu[B]-mu[A]`) / sqrt((`sigma[A]`^2 + `sigma[B]`^2) / 2)) %>%
# wrangle
pivot_longer(`mu[A]`:`(mu[B]-mu[A])/sqrt((sigma[A]^2+sigma[B]^2)/2)`) %>%
mutate(name = factor(name,
levels = c("mu[A]", "mu[B]", "mu[B]-mu[A]",
"sigma[A]", "sigma[B]", "sigma[B]-sigma[A]",
"(mu[B]-mu[A])/sqrt((sigma[A]^2+sigma[B]^2)/2)"))) %>%
# plot
ggplot(aes(x = value, y = 0)) +
stat_halfeye(point_interval = mode_hdi, .width = .95,
fill = sl[4], color = sl[8], normalize = "panels") +
scale_y_continuous(NULL, breaks = NULL) +
xlab(NULL) +
facet_wrap(~ name, scales = "free", labeller = label_parsed)
\(\mu_A\) and \(\sigma_A\) are both constants, which doesn’t show up well with our stat_halfeye()
approach with the scales freed across facets. If these plots really mattered for a scientific presentation or something for industry, you could experiment using either a common scale across all facets, or making the plots individually and then combining them with patchwork syntax. Returning to interpretation, because \(\mu_A = 0\), it turns out that \(\mu_B - \mu_A = \mu_B\), which is on display on the top row. Because \(\sigma_A = 1\), it turns out that \(\sigma_B - \sigma_A\) is just \(\sigma_B\) moved over one unit to the left, which is hopefully clear in the panels of the second row. Very happily, the effect size formula worked with our brms parameters the same way it did for Kruschke’s. Both yield an effect size of about 0.5, with 95% intervals extending about \(\pm 0.5\).
Here we make good use of the type = "bars_grouped"
and group = "X"
arguments to make the posterior predictive plots at the top right of Figure 23.4 with the brms::pp_check()
function.
set.seed(23)
pp_check(fit23.5, type = "bars_grouped", ndraws = 100, group = "X", fatten = 2) +
scale_x_continuous("y", breaks = 1:7) +
scale_y_continuous(NULL, breaks = NULL, expand = expansion(mult = c(0, 0.05))) +
ggtitle("Data with posterior predictions",
subtitle = expression(list(italic(N[A])==44, italic(N[B])==44))) +
theme(legend.background = element_blank(),
legend.position = c(.9, .75))
Using more tricks from back in Chapter 16, here’s the corresponding conventional Gaussian model for metric data.
<- mean(my_data$Y)
mean_y <- sd(my_data$Y)
sd_y
<-
stanvars stanvar(mean_y, name = "mean_y") +
stanvar(sd_y, name = "sd_y")
.6 <-
fit23brm(data = my_data,
family = gaussian,
bf(Y ~ 0 + X, sigma ~ 0 + X),
prior = c(prior(normal(mean_y, sd_y * 100), class = b),
prior(normal(0, 1), class = b, dpar = sigma)),
chains = 4, cores = 4,
stanvars = stanvars,
seed = 23,
file = "fits/fit23.06")
Check the summary.
print(fit23.6)
## Family: gaussian
## Links: mu = identity; sigma = log
## Formula: Y ~ 0 + X
## sigma ~ 0 + X
## Data: my_data (Number of observations: 88)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## XA 1.43 0.12 1.20 1.67 1.00 4176 2850
## XB 1.86 0.16 1.54 2.18 1.00 4514 2958
## sigma_XA -0.26 0.11 -0.46 -0.03 1.00 3919 2563
## sigma_XB 0.08 0.11 -0.12 0.30 1.00 4275 3139
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Here are the marginal posteriors, including the effect size (i.e., a standardized mean difference using the pooled standard deviation formula presuming equal sample sizes, \((\mu_2 - \mu_1) / \sqrt{(\sigma_1^2 + \sigma_2^2) / 2}\).
as_draws_df(fit23.6) %>%
mutate(`A mean` = b_XA,
`B mean` = b_XB,
`A Std. Dev.` = exp(b_sigma_XA),
`B Std. Dev.` = exp(b_sigma_XB)) %>%
mutate(`Difference of Means` = `B mean` - `A mean`,
`Difference of Std. Devs` = `B Std. Dev.` - `A Std. Dev.`) %>%
mutate(`Effect Size` = `Difference of Means` / sqrt((`A Std. Dev.`^2 + `B Std. Dev.`^2) / 2)) %>%
pivot_longer(`A mean`:`Effect Size`) %>%
ggplot(aes(x = value, y = 0)) +
stat_halfeye(point_interval = mode_hdi, .width = .95,
fill = sl[4], color = sl[8], normalize = "panels") +
scale_y_continuous(NULL, breaks = NULL) +
labs(subtitle = "These are based on the conventional Gaussian model, NOT the cumulative-normal\nmodel Kruschke displayed in Figure 23.4",
x = "Marginal posterior") +
facet_wrap(~ name, scales = "free", ncol = 2)
Compare those results to those Kruschke reported from an NHST analysis in the note below Figure 23.4:
\(M_1 = 1.43, M_2 = 1.86, t = 2.18, p = 0.032\), with effect size \(d = 0.466\) with \(95\%\) CI of \(0.036-0.895\). An \(F\) test of the variances concludes that the standard deviations are significantly different: \(S_1 = 0.76, S_2 = 1.07, p = 0.027\). Notice in this case that treating the values as metric greatly underestimates their variances, as well as erroneously concludes the variances are different. (p. 684)
As to the data in the analyses Kruschke reported in Figure 23.5 and the prose in the second paragraph on page 685, I’m not aware that Kruschke provided them. From his footnote #2, we read: “Data in Figure 23.5 are from an as-yet unpublished study I conducted with the collaboration of Allison Vollmer as part of her undergraduate honors project.” In place of the real data, I eyeballed the values based on the upper-right panels in Figure 23.5. Here they are.
<-
d tibble(x = rep(str_c("joke ", c(1, 6)), each = 177),
y = c(rep(1:7, times = c(95, 19, 18, 10, 17, 10, 8)),
rep(1:7, times = c(53, 33, 31, 22, 23, 14, 1))))
glimpse(d)
## Rows: 354
## Columns: 2
## $ x <chr> "joke 1", "joke 1", "joke 1", "joke 1", "joke 1", "joke 1", "joke 1", "joke 1", "joke 1", "joke 1"…
## $ y <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
My approximation to Kruschke’s data looks like this.
%>%
d ggplot(aes(x = y)) +
geom_bar(fill = sl[5]) +
scale_x_continuous(breaks = 1:7) +
scale_y_continuous(NULL, breaks = NULL, expand = expansion(mult = c(0, 0.05))) +
facet_wrap(~ x, ncol = 1)
Here we fit the cumulative-normal model based on our version of the data, continuing to allow both \(\mu\) and \(\sigma\) (i.e., \(1 / \exp(\log \alpha)\)) to differ across groups.
.7 <-
fit23brm(data = d,
family = cumulative(probit),
bf(y ~ 1 + x) +
lf(disc ~ 0 + x, cmc = FALSE),
prior = c(prior(normal(0, 4), class = Intercept),
prior(normal(0, 4), class = b),
prior(normal(0, 1), class = b, dpar = disc)),
iter = 3000, warmup = 1000, chains = 4, cores = 4,
seed = 23,
file = "fits/fit23.07")
Check the model summary.
print(fit23.7)
## Family: cumulative
## Links: mu = probit; disc = log
## Formula: y ~ 1 + x
## disc ~ 0 + x
## Data: d (Number of observations: 354)
## Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
## total post-warmup draws = 8000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept[1] 0.08 0.09 -0.11 0.25 1.00 5294 5858
## Intercept[2] 0.37 0.09 0.19 0.54 1.00 7251 6247
## Intercept[3] 0.65 0.09 0.47 0.83 1.00 8814 6164
## Intercept[4] 0.87 0.10 0.68 1.07 1.00 8039 5851
## Intercept[5] 1.26 0.12 1.03 1.49 1.00 6667 5937
## Intercept[6] 1.81 0.16 1.51 2.13 1.00 7075 5922
## xjoke6 0.39 0.10 0.20 0.59 1.00 7813 6439
## disc_xjoke6 0.49 0.12 0.27 0.72 1.00 4440 5751
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Save and wrangle the posterior draws, then use our compare_thresholds()
function to compare the brms parameterization of \(\theta_{[i]}\) with the parameterization in the text in an expanded version of the lower-left plot of Figure 23.5.
<- as_draws_df(fit23.7)
draws
%>%
draws select(`b_Intercept[1]`:`b_Intercept[6]`, .draw) %>%
compare_thresholds(lb = 1.5, ub = 6.5)
Given our data are only approximations of Kruschke’s, I think we did pretty good. Here are the histograms for our brms versions of the \(\mu\)- and \(\sigma\)-related parameters.
%>%
draws transmute(`mu[Joke~1]` = 0,
`mu[Joke~6]` = b_xjoke6,
`sigma[Joke~1]` = 1,
`sigma[Joke~6]` = 1 / exp(b_disc_xjoke6)) %>%
mutate(`mu[Joke~6]-mu[Joke~1]` = `mu[Joke~6]` - `mu[Joke~1]`,
`sigma[Joke~6]-sigma[Joke~1]` = `sigma[Joke~6]` - `sigma[Joke~1]`) %>%
mutate(`(mu[Joke~6]-mu[Joke~1])/sqrt((sigma[Joke~1]^2+sigma[Joke~6]^2)/2)` = (`mu[Joke~6]-mu[Joke~1]`) / sqrt((`sigma[Joke~1]`^2 + `sigma[Joke~6]`^2) / 2)) %>%
pivot_longer(everything()) %>%
mutate(name = factor(name,
levels = c("mu[Joke~1]", "mu[Joke~6]", "mu[Joke~6]-mu[Joke~1]",
"sigma[Joke~1]", "sigma[Joke~6]", "sigma[Joke~6]-sigma[Joke~1]",
"(mu[Joke~6]-mu[Joke~1])/sqrt((sigma[Joke~1]^2+sigma[Joke~6]^2)/2)"))) %>%
ggplot(aes(x = value, y = 0)) +
stat_halfeye(point_interval = mode_hdi, .width = .95,
fill = sl[4], color = sl[8],
normalize = "panels") +
scale_y_continuous(NULL, breaks = NULL) +
xlab(NULL) +
facet_wrap(~ name, scales = "free", labeller = label_parsed)
Here are our versions of the two panels in the upper right of Figure 23.5.
set.seed(23)
pp_check(fit23.7, type = "bars_grouped", ndraws = 100, group = "x", fatten = 2) +
scale_x_continuous("y", breaks = 1:7) +
scale_y_continuous(NULL, breaks = NULL, expand = expansion(mult = c(0, 0.05))) +
ggtitle("Data with posterior predictions",
subtitle = expression(list(italic(N["joke "*1])==177, italic(N["joke "*6])==177))) +
theme(legend.position = "none")
Now here’s the corresponding model where we treat the y
data as metric.
<- mean(d$y)
mean_y <- sd(d$y)
sd_y
<-
stanvars stanvar(mean_y, name = "mean_y") +
stanvar(sd_y, name = "sd_y")
.8 <-
fit23brm(data = d,
family = gaussian,
bf(y ~ 0 + x, sigma ~ 0 + x),
prior = c(prior(normal(mean_y, sd_y * 100), class = b),
prior(normal(0, exp(sd_y)), class = b, dpar = sigma)),
chains = 4, cores = 4,
stanvars = stanvars,
seed = 23,
file = "fits/fit23.08")
Check the summary.
print(fit23.8)
## Family: gaussian
## Links: mu = identity; sigma = log
## Formula: y ~ 0 + x
## sigma ~ 0 + x
## Data: d (Number of observations: 354)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## xjoke1 2.42 0.14 2.15 2.69 1.00 4201 3145
## xjoke6 2.86 0.13 2.60 3.10 1.00 4581 3030
## sigma_xjoke1 0.64 0.05 0.54 0.75 1.00 5039 3027
## sigma_xjoke6 0.52 0.05 0.42 0.63 1.00 4602 3110
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Make the marginal posteriors, including the effect size.
as_draws_df(fit23.8) %>%
mutate(`Joke 1 Mean` = b_xjoke1,
`Joke 6 Mean` = b_xjoke6,
`Joke 1 Std. Dev.` = exp(b_sigma_xjoke1),
`Joke 6 Std. Dev.` = exp(b_sigma_xjoke6)) %>%
mutate(`Difference of Means` = `Joke 6 Mean` - `Joke 1 Mean`,
`Difference of Std. Devs` = `Joke 6 Std. Dev.` - `Joke 1 Std. Dev.`) %>%
mutate(`Effect Size` = `Difference of Means` / sqrt((`Joke 1 Std. Dev.`^2 + `Joke 6 Std. Dev.`^2) / 2)) %>%
pivot_longer(`Joke 1 Mean`:`Effect Size`) %>%
mutate(name = factor(name,
levels = c("Joke 1 Mean", "Joke 1 Std. Dev.",
"Joke 6 Mean", "Joke 6 Std. Dev.",
"Difference of Means", "Difference of Std. Devs",
"Effect Size"))) %>%
ggplot(aes(x = value, y = 0)) +
stat_halfeye(point_interval = mode_hdi, .width = .95,
fill = sl[4], color = sl[8], normalize = "panels") +
scale_y_continuous(NULL, breaks = NULL) +
labs(subtitle = "These are based on the conventional Gaussian model, NOT the cumulative-normal\nmodel Kruschke displayed in Figure 23.5",
x = "Marginal posterior") +
facet_wrap(~ name, scales = "free", ncol = 2)
If you think you have a better approximation of Kruschke’s data, please share.
23.4 The Case of metric predictors
“This type of model is often referred to as ordinal probit regression or ordered probit regression because the probit function is the link function corresponding to the cumulative-normal inverse-link function” (p. 688, emphasis in the original).
23.4.1 Implementation in JAGS brms.
This model is easy to specify in brms. Just make sure to think clearly about your priors.
23.4.2 Example: Happiness and money.
Load the data for the next model.
<- read_csv("data.R/OrdinalProbitData-LinReg-2.csv")
my_data
glimpse(my_data)
## Rows: 200
## Columns: 2
## $ X <dbl> 1.386389, 1.223879, 1.454505, 1.112068, 1.222715, 1.545099, 1.360256, 1.533071, 1.501657, 1.426755…
## $ Y <dbl> 1, 1, 5, 5, 1, 4, 6, 2, 5, 4, 1, 4, 4, 4, 4, 6, 1, 1, 6, 2, 1, 7, 1, 3, 1, 1, 7, 5, 7, 1, 4, 6, 7,…
Take a quick look at the data.
%>%
my_data ggplot(aes(x = X, y = Y)) +
geom_point(alpha = 1/3, color = sl[9]) +
scale_y_continuous(breaks = 1:7)
Kruschke standardized his predictor within his model code. Here we’ll standardize X
before fitting the model.
<-
my_data %>%
my_data mutate(X_s = (X - mean(X)) / sd(X))
Fit the model.
.9 <-
fit23brm(data = my_data,
family = cumulative(probit),
~ 1 + X_s,
Y prior = c(prior(normal(0, 4), class = Intercept),
prior(normal(0, 4), class = b)),
iter = 3000, warmup = 1000, chains = 4, cores = 4,
seed = 23,
file = "fits/fit23.09")
Check the summary.
print(fit23.9)
## Family: cumulative
## Links: mu = probit; disc = identity
## Formula: Y ~ 1 + X_s
## Data: my_data (Number of observations: 200)
## Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
## total post-warmup draws = 8000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept[1] -1.18 0.12 -1.42 -0.94 1.00 5504 5515
## Intercept[2] -0.76 0.11 -0.99 -0.54 1.00 7562 6456
## Intercept[3] -0.29 0.11 -0.50 -0.08 1.00 8824 6672
## Intercept[4] 0.24 0.11 0.04 0.45 1.00 9672 6707
## Intercept[5] 0.73 0.11 0.51 0.96 1.00 9080 7114
## Intercept[6] 1.26 0.13 1.02 1.52 1.00 8399 7198
## X_s 1.16 0.10 0.97 1.35 1.00 6487 6073
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## disc 1.00 0.00 1.00 1.00 NA NA NA
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Extract the posterior draws and compare the brms parameterization of \(\theta_{[i]}\) with the parameterization in the text in an expanded version of the bottom panel of Figure 23.7.
<- as_draws_df(fit23.9)
draws
%>%
draws select(`b_Intercept[1]`:`b_Intercept[6]`, .draw) %>%
compare_thresholds(lb = 1.5, ub = 6.5)
Here’s the marginal distribution of b_X_s
, our effect size for the number of jokes.
%>%
draws ggplot(aes(x = b_X_s, y = 0)) +
stat_halfeye(point_interval = mode_hdi, .width = .95,
fill = sl[4], color = sl[8]) +
scale_y_continuous(NULL, breaks = NULL)
This differs from Kruschks’s \(\beta_1\), which is in an unstandardized metric based on the parameters in his version of the model. But unlike the effect sizes from previous models, this one is not in a Cohen’s-\(d\) metric. Rather, this is a fully-standardized regression coefficient. As to the large subplot at the top of Figure 23.7, we can make something like it by nesting conditional_effects()
within plot()
.
conditional_effects(fit23.9) %>%
plot(line_args = list(color = sl[7], fill = sl[3]))
Here’s a more elaborated version of the same plot, this time depicting the model with 100 fitted lines randomly drawn from the posterior.
set.seed(23)
conditional_effects(fit23.9,
spaghetti = TRUE,
ndraws = 100) %>%
plot(points = T,
line_args = list(size = 0),
point_args = list(alpha = 1/3, color = sl[9]),
spaghetti_args = list(colour = alpha(sl[6], .2)))
## Warning: Predictions are treated as continuous variables in 'conditional_effects' by default which is likely
## invalid for ordinal families. Please set 'categorical' to TRUE.
Note the warning message. There was a similar one in the first plot, which I suppressed for simplicity sake. The message suggests treating the fitted lines as “continuous variables” might lead to a deceptive plot. Here’s what happens if we follow the suggestion.
set.seed(23)
conditional_effects(fit23.9, categorical = T)
Recall that our thresholds, \(\theta_1,...,\theta_{K-1}\), in conjunction with the standard normal density, give us the probability of a given Y
value, given X_s
. That is, \(p(y = k | \mu, \sigma, \{ \theta_j \})\), where \(\mu\) is conditional on \(x\). This plot returned the fitted lines of those conditional probabilities, each depicted by the posterior mean and percentile-based 95% intervals. At lower values of X_s
, lower values of Y
are more probable. At higher values of X_s
, higher values of Y
are more probable.
It might be useful to get more practice in with this model. Here we’ll use fitted()
to make a similar plot, depicting the model with may fitted lines instead of summary statistics.
# how many fitted lines do you want?
<- 50
n_draws
# define the `X_s` values you want to condition on
# because the lines are nonlinear, you'll need many of them
<- tibble(X_s = seq(from = -2, to = 2, by = 0.05))
nd
<-
f fitted(fit23.9,
newdata = nd,
summary = F,
ndraws = n_draws)
# inspect the output
%>%
f str()
## num [1:50, 1:81, 1:7] 0.933 0.854 0.848 0.888 0.91 ...
## - attr(*, "dimnames")=List of 3
## ..$ : chr [1:50] "1" "2" "3" "4" ...
## ..$ : NULL
## ..$ : chr [1:7] "1" "2" "3" "4" ...
Our output came in three dimensions. We have 50 rows, corresponding to n_draws <- 50
(i.e., 50 posterior draws). There are 81 columns, based on how we defined the X_s
values within our nd
data (i.e., seq(from = -2, to = 2, by = 0.05)
). The third dimension has seven levels, one corresponding to each of the seven levels of our criterion variable Y
. Here’s a way to rearrange that output into a useful format for plotting.
# rearrange the output
rbind(
1],
f[, , 2],
f[, , 3],
f[, , 4],
f[, , 5],
f[, , 6],
f[, , 7]
f[, , %>%
) # wrangle
data.frame() %>%
set_names(nd %>% pull(X_s)) %>%
mutate(draw = rep(1:n_draws, times = 7),
rating = rep(1:7, each = n_draws)) %>%
pivot_longer(-c(draw, rating),
names_to = "X_s",
values_to = "probability") %>%
mutate(rating = str_c("Y: ", rating),
X_s = X_s %>% as.double()) %>%
# plot
ggplot(aes(x = X_s, y = probability,
group = interaction(draw, rating),
color = rating)) +
geom_line(linewidth = 1/4, alpha = 1/2) +
scale_color_scico_d(palette = "lajolla", begin = .25) +
scale_y_continuous(breaks = c(0, .5, 1), limits = c(0, 1), expand = c(0, 0)) +
theme(legend.position = "none") +
facet_wrap(~ rating, ncol = 7)
So far, we’ve been plotting our model with the context of the default scale = "response"
setting within fitted()
. Within the context of the response variable Y
, our model returns response probabilities. We can also look at the output within the context of scale = "linear"
. We’ll plot these fitted lines across our nd
values two different ways. For the first, p1
, we’ll use summary statistics. For the second, p2
, we’ll set summary = T
.
# adjust the nd
<- tibble(X_s = seq(from = -2.5, to = 2.5, by = 0.1))
nd
# use summary statistics
<-
p1 fitted(fit23.9,
scale = "linear",
newdata = nd) %>%
data.frame() %>%
bind_cols(nd) %>%
ggplot(aes(x = X_s, y = Estimate, ymin = Q2.5, ymax = Q97.5)) +
geom_smooth(stat = "identity",
alpha = 1/2, color = sl[7], fill = sl[3]) +
labs(title = "summary statistics",
y = "underlying standard normal")
# set `summary = F`
set.seed(23)
<-
p2 fitted(fit23.9,
scale = "linear",
newdata = nd,
summary = F,
ndraws = n_draws) %>%
data.frame() %>%
set_names(nd %>% pull(X_s)) %>%
mutate(draw = 1:n_draws) %>%
pivot_longer(-draw,
names_to = "X_s") %>%
mutate(X_s = X_s %>% as.double()) %>%
ggplot(aes(x = X_s, y = value, group = draw)) +
geom_line(alpha = 1/2, color = sl[7]) +
scale_y_continuous(NULL, breaks = NULL) +
ggtitle("50 posterior draws")
# combine and plot!
+ p2 &
p1 coord_cartesian(ylim = c(-2, 2))
Both methods returned the fitted lines in the metric of the underlying standard normal distribution. The fitted lines are nonlinear in the metric of the raw data Y
, but they’re linear in the metric of the presumed underlying distribution. If it helps, we’ll make a marginal plot of the standard normal distribution and tack it onto the right.
# make Phi
<-
p3 tibble(x = seq(from = -3, to = 3, by = .01)) %>%
mutate(d = dnorm(x)) %>%
ggplot(aes(x = x, y = d)) +
geom_area(fill = sl[5]) +
# add the thresholds!
geom_vline(xintercept = posterior_summary(fit23.9)[1:6, 1],
color = sl[9], linetype = 3) +
# mark the thresholds with the axis breaks
scale_x_reverse(NULL, breaks = fixef(fit23.9)[1:6, 1], position = "top",
labels = parse(text = str_c("theta[", 1:6, "]"))) +
scale_y_continuous(NULL, breaks = NULL) +
ggtitle(expression(italic(N)(0,1))) +
coord_flip(xlim = c(-2, 2))
# combine, format a bit, and plot
(| p2 ) &
((p1 geom_hline(yintercept = posterior_summary(fit23.9)[1:6, 1],
color = sl[9], linetype = 3) &
coord_cartesian(ylim = c(-2, 2)) |
p3)+
) plot_layout(widths = c(4, 4, 1))
The lines intersecting the plots are the posterior means for thresholds, \(\theta_1, \dots ,\theta_6\).
But these still aren’t faithful depictions of the top panel of Figure 23.7, you say. Okay, fine. One of the distinctive elements of that panel is the left-tilted bar-and-error plots. If you look closely at the vertical lines at their bases, you’ll see that the leftmost subplot starts at the minimum value of X
and the rightmost subplot starts at the maximum value of X
. Since our plots, so far, have been based on X_s
, we’ll use the minimum and maximum values from that. Here are those values.
<- range(my_data$X_s)) (r
## [1] -1.774444 1.750168
To my eye, it appears that the three middle subplots are equally distributed between those at the ends. If we proceed under that assumption, here’s how we might use fitted()
to get us rolling on computing the relevant values.
<- tibble(X_s = seq(from = r[1], to = r[2], length.out = 5))
nd
<-
f fitted(fit23.9,
newdata = nd)
# inspect the output
%>%
f str()
## num [1:5, 1:4, 1:7] 0.8078 0.4446 0.124 0.0155 0.0009 ...
## - attr(*, "dimnames")=List of 3
## ..$ : NULL
## ..$ : chr [1:4] "Estimate" "Est.Error" "Q2.5" "Q97.5"
## ..$ : chr [1:7] "P(Y = 1)" "P(Y = 2)" "P(Y = 3)" "P(Y = 4)" ...
Here we’ll rearrange the output to make it useful for plotting.
# rearrange the output
<-
f rbind(
1],
f[, , 2],
f[, , 3],
f[, , 4],
f[, , 5],
f[, , 6],
f[, , 7]
f[, , %>%
) # wrangle
data.frame() %>%
bind_cols(expand_grid(nd, Y = 1:7) %>% arrange(Y, X_s))
head(f)
## Estimate Est.Error Q2.5 Q97.5 X_s Y
## 1 0.8078061176 0.0473570337 0.706100621 0.890845927 -1.77444380 1
## 2 0.4445858032 0.0489066995 0.349544950 0.541241792 -0.89329094 1
## 3 0.1239922037 0.0251036999 0.079693316 0.177207790 -0.01213809 1
## 4 0.0155430683 0.0067798504 0.005795549 0.031534672 0.86901477 1
## 5 0.0009001091 0.0007991004 0.000116457 0.003028196 1.75016762 1
## 6 0.0920486261 0.0253704586 0.049029670 0.146783451 -1.77444380 2
Now we can make those bar-and-error plots.
%>%
f mutate(X_s = round(X_s, digits = 3)) %>%
ggplot(aes(x = Y, y = Estimate,
ymin = Q2.5, ymax = Q97.5)) +
geom_col(fill = sl[4]) +
geom_pointrange(fatten = 1.5, linewidth = 1, color = sl[7]) +
scale_x_continuous(breaks = 1:7) +
scale_y_reverse(NULL, position = "right", limits = c(1, 0), expand = c(0, 0),
breaks = c(1, .5, 0), labels = c("1", ".5", "0")) +
coord_flip() +
facet_wrap(~ X_s, ncol = 7, strip.position = "bottom")
The X_s
values are depicted in the panel strips on the bottom. The response probabilities are scaled based on the axis on the top. The points and leftmost sides of the bars are the posterior means. The thin, dark horizontal lines are the percentile-based 95% intervals. Here we reformat f
a little more to put those bar-and-error plots in a format more similar to that of Figure 23.7.
%>%
f select(-Est.Error) %>%
# rescale the probability summaries
mutate_at(vars(Estimate:Q97.5), ~. / 2) %>%
# plot!
ggplot() +
geom_vline(xintercept = seq(from = r[1], to = r[2], length.out = 5),
color = sl[2]) +
# bar marking the Estimate
geom_segment(aes(x = X_s, xend = X_s - Estimate,
y = Y + 0.1, yend = Y + 0.1),
linewidth = 8, color = sl[4]) +
# bar marking the 95% interval
geom_segment(aes(x = X_s - Q2.5, xend = X_s - Q97.5,
y = Y + 0.2, yend = Y + 0.2),
linewidth = 1, color = sl[7]) +
# data
geom_point(data = my_data,
aes(x = X_s, y = Y),
shape = 1, size = 2, color = sl[9]) +
scale_y_continuous("Y", breaks = 1:7, limits = c(0.5, 7.5)) +
coord_cartesian(xlim = c(-2.4, 2.4))
I’m not going to attempt superimposing fitted lines on this plot the way Kruschke did. Given that our ordered-probit model is nonlinear on the scale of the criterion, it seems misleading to present linear fitted lines atop the raw data. If you’d like to do so, you’re on your own.
Now here’s the corresponding model is we treat the y
data as metric with tricks from Chapter 17.
<- sd(my_data$X)
sd_x <- sd(my_data$Y)
sd_y <- mean(my_data$X)
m_x <- mean(my_data$Y)
m_y
<- 10 * abs(m_x * sd_y / sd_x)
beta_0_sigma <- 10 * abs(sd_y / sd_x)
beta_1_sigma
<-
stanvars stanvar(beta_0_sigma, name = "beta_0_sigma") +
stanvar(beta_1_sigma, name = "beta_1_sigma") +
stanvar(sd_y, name = "sd_y")
.10 <-
fit23brm(data = my_data,
family = gaussian,
~ 1 + X,
Y prior = c(prior(normal(0, beta_0_sigma), class = Intercept),
prior(normal(0, beta_1_sigma), class = b),
prior(normal(0, sd_y), class = sigma)),
chains = 4, cores = 4,
stanvars = stanvars,
seed = 23,
file = "fits/fit23.10")
It may not have been readily apparent from Kruschke’s prose in the note for Figure 23.7, but his OLS model was based on the fully unstandardized data (i.e., using X
as the predictor), not the partially standardized data he used in his JAGS code from 23.4.1. We followed the same sensibilities for fit23.10
. Speaking of which, here are the summaries for the marginal posteriors.
posterior_summary(fit23.10)[1:3, ] %>% round(digits = 3)
## Estimate Est.Error Q2.5 Q97.5
## b_Intercept -5.424 0.623 -6.645 -4.219
## b_X 6.711 0.440 5.857 7.557
## sigma 1.530 0.077 1.389 1.693
These values are very close to those he reported at the bottom of page 690. Here are what the fitted lines from that model look like when superimposed on the data, when presuming both variables are metric.
set.seed(23)
conditional_effects(fit23.10,
spaghetti = TRUE,
ndraws = 100) %>%
plot(points = T,
line_args = list(size = 0),
point_args = list(alpha = 1/3, color = sl[9]),
spaghetti_args = list(colour = alpha(sl[6], .2)))
For the next example, we’ll load the HappinessAssetsDebt.csv
data from Shi (2009).
<- read_csv("data.R/HappinessAssetsDebt.csv")
my_data
glimpse(my_data)
## Rows: 6,759
## Columns: 3
## $ Happiness <dbl> 3, 3, 3, 3, 1, 3, 2, 2, 4, 2, 3, 5, 3, 3, 4, 3, 3, 2, 4, 3, 4, 3, 3, 3, 5, 4, 3, 4, 4, 4, …
## $ Assets <dbl> 0, 10000, 30000, 40000, 21000, 20000, 20000, 0, 0, 20000, 5000, 30000, 40000, 5500, 50000,…
## $ Debt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5000, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 32000, 0, 0, 0,…
Here’s a quick scatter plot of the data. To help with the overplotting, the points have been horizontally jittered a bit. But as indicated in the text, Happiness
is a discrete variable.
set.seed(23)
%>%
my_data ggplot(aes(x = Assets, y = Happiness)) +
geom_jitter(alpha = 1/4, height = .25, color = sl[9]) +
scale_x_continuous(expand = expansion(0, 0.05))
Standardize our predictor.
<-
my_data %>%
my_data mutate(Assets_s = (Assets - mean(Assets)) / sd(Assets))
Fit the model like before.
.11 <-
fit23brm(data = my_data,
family = cumulative(probit),
~ 1 + Assets_s,
Happiness prior = c(prior(normal(0, 4), class = Intercept),
prior(normal(0, 4), class = b)),
iter = 3000, warmup = 1000, chains = 4, cores = 4,
seed = 23,
file = "fits/fit23.11")
Check the summary.
print(fit23.11)
## Family: cumulative
## Links: mu = probit; disc = identity
## Formula: Happiness ~ 1 + Assets_s
## Data: my_data (Number of observations: 6759)
## Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
## total post-warmup draws = 8000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept[1] -2.03 0.03 -2.10 -1.96 1.00 5775 5193
## Intercept[2] -1.17 0.02 -1.20 -1.13 1.00 8414 6710
## Intercept[3] -0.15 0.02 -0.18 -0.12 1.00 8713 6906
## Intercept[4] 1.48 0.02 1.44 1.53 1.00 8483 7241
## Assets_s 0.15 0.01 0.12 0.17 1.00 7875 4749
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## disc 1.00 0.00 1.00 1.00 NA NA NA
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Extract the posterior draws and compare the brms parameterization of \(\theta_{[i]}\) with the parameterization in the text in an expanded version of the bottom panel of Figure 23.8.
<- as_draws_df(fit23.11)
draws
%>%
draws select(`b_Intercept[1]`:`b_Intercept[4]`, .draw) %>%
compare_thresholds(lb = 1.5, ub = 4.5)
Behold the marginal distribution of b_Assets_s
, our effect size for Assets
.
%>%
draws ggplot(aes(x = b_Assets_s, y = 0)) +
stat_halfeye(point_interval = mode_hdi, .width = .95,
fill = sl[4], color = sl[8]) +
scale_y_continuous(NULL, breaks = NULL)
Here’s the fitted()
-oriented preparatory work for our version of the top panel of Figure 23.8.
# define the range for the predictor
<- range(my_data$Assets_s)
r
# re-define the new data
<- tibble(Assets_s = seq(from = r[1], to = r[2], length.out = 5))
nd
# compute the fitted summaries
<-
f fitted(fit23.11,
newdata = nd)
# rearrange the output
<-
f rbind(
1],
f[, , 2],
f[, , 3],
f[, , 4],
f[, , 5]
f[, , %>%
) # wrangle
data.frame() %>%
bind_cols(expand_grid(nd, Happiness = 1:5) %>% arrange(Happiness, Assets_s))
# examine
head(f)
## Estimate Est.Error Q2.5 Q97.5 Assets_s Happiness
## 1 2.615146e-02 2.121289e-03 2.216336e-02 3.046707e-02 -0.5971712 1
## 2 3.166490e-03 7.149115e-04 1.979798e-03 4.776908e-03 4.7565107 1
## 3 2.339116e-04 1.311017e-04 7.097467e-05 5.596200e-04 10.1101925 1
## 4 1.142924e-05 1.324943e-05 1.089479e-06 4.498606e-05 15.4638743 1
## 5 4.138155e-07 1.012777e-06 7.215254e-09 2.390645e-06 20.8175561 1
## 6 1.146798e-01 4.187883e-03 1.066564e-01 1.230192e-01 -0.5971712 2
Like with the same variant from Figure 23.7, we will not be superimposing linear fitted lines. The model is nonlinear on the scale of the data and I don’t want to confuse readers by pretending otherwise.
%>%
f select(-Est.Error) %>%
# rescale the probability summaries
mutate_at(vars(Estimate:Q97.5), ~. * 2.5) %>%
# plot!
ggplot() +
geom_vline(xintercept = seq(from = r[1], to = r[2], length.out = 5),
color = sl[2]) +
# bar marking the Estimate
geom_segment(aes(x = Assets_s, xend = Assets_s - Estimate,
y = Happiness + 0.1, yend = Happiness + 0.1),
linewidth = 8, color = sl[4]) +
# bar marking the 95% interval
geom_segment(aes(x = Assets_s - Q2.5, xend = Assets_s - Q97.5,
y = Happiness + 0.2, yend = Happiness + 0.2),
linewidth = 1, color = sl[7]) +
# data
geom_point(data = my_data,
aes(x = Assets_s, y = Happiness),
shape = 1, size = 2, color = sl[9]) +
scale_y_continuous("Happiness", breaks = 1:5, limits = c(0.5, 5.5)) +
coord_cartesian(xlim = c(-4, 24))
Now here’s the corresponding model is we treat Happiness
as metric. Unlike our method for the corresponding model from Figure 23.7, fit10
, we will use the standardized version of the predictor, Assets_s
. The unstandardized values for Happiness
and Assets
are on vastly different scales, which can be difficulty for HMC with broad priors of the type Kruschke often uses. Standardizing the predictor helps.
<- sd(my_data$Happiness)
sd_y <-
stanvars stanvar(sd_y, name = "sd_y")
.12 <-
fit23brm(data = my_data,
family = gaussian,
~ 1 + Assets_s,
Happiness prior = c(prior(normal(3.5, 5), class = Intercept),
prior(normal(0, 5), class = b),
prior(normal(0, sd_y), class = sigma)),
chains = 4, cores = 4,
stanvars = stanvars,
seed = 23,
file = "fits/fit23.12")
Here are the summaries for the marginal posteriors.
posterior_summary(fit23.12)[1:3, ] %>% round(digits = 6)
## Estimate Est.Error Q2.5 Q97.5
## b_Intercept 3.484241 0.010519 3.463676 3.505171
## b_Assets_s 0.116069 0.010209 0.095848 0.136378
## sigma 0.847126 0.007192 0.833087 0.861646
They’re just a bit different from those produced by OLS. Here are what the fitted lines from that model look like when superimposed on the data, when presuming both variables are metric.
set.seed(23)
conditional_effects(fit23.12,
spaghetti = TRUE,
ndraws = 100) %>%
plot(points = T,
line_args = list(size = 0),
point_args = list(alpha = 1/3, color = sl[9]),
spaghetti_args = list(colour = alpha(sl[6], .2)))
23.4.3 Example: Movies–They don’t make ’em like they used to.
For this section, we’ll load the Moore (2006) Movies.csv
data.
<- read_csv("data.R/Movies.csv")
my_data
glimpse(my_data)
## Rows: 100
## Columns: 6
## $ Title <chr> "A_Ticklish_Affair", "Action_in_the_North_Atlantic", "And_the_Ship_Sails_On", "Autumn_So…
## $ Year <dbl> 1963, 1943, 1984, 1978, 1931, 1930, 1950, 1989, 1940, 1947, 1970, 1940, 1976, 1985, 1945…
## $ Length <dbl> 89, 127, 138, 97, 77, 69, 93, 119, 70, 69, 101, 62, 97, 85, 62, 86, 112, 97, 93, 89, 90,…
## $ Cast <dbl> 5, 7, 7, 5, 6, 8, 5, 8, 9, 9, 9, 6, 10, 10, 9, 6, 10, 6, 12, 7, 5, 9, 6, 7, 6, 6, 12, 11…
## $ Rating <dbl> 2.0, 3.0, 3.0, 3.0, 2.5, 2.5, 3.0, 2.5, 2.5, 2.0, 3.0, 2.0, 2.5, 1.0, 1.5, 2.5, 3.0, 2.0…
## $ Description <dbl> 7, 9, 15, 11, 7, 10, 8, 15, 8, 8, 11, 10, 12, 13, 9, 7, 10, 11, 11, 8, 9, 9, 13, 9, 7, 7…
In footnote #5 at the bottom of page 693, Kruschke explained that whereas the original Ratings
data ranged from 1.0
to 4.0
in half-unit increments, he recoded them to range from 1
to 7
. Here we recode Ratings
in the same way using dplyr::recode()
. While we’re at it, we’ll make standardized versions of the predictors, too.
<-
my_data %>%
my_data mutate(Rating = recode(Rating,
`1.0` = 1,
`1.5` = 2,
`2.0` = 3,
`2.5` = 4,
`3.0` = 5,
`3.5` = 6,
`4.0` = 7),
Year_s = (Year - mean(Year)) / sd(Year),
Length_s = (Length - mean(Length)) / sd(Length))
Here’s a scatter plot of the data, with points colored by Rating
.
%>%
my_data mutate(Rating = factor(Rating)) %>%
ggplot(aes(x = Year, y = Length, label = Rating)) +
geom_point(aes(color = Rating),
size = 3) +
geom_text(size = 3) +
scale_color_scico_d(palette = "lajolla", begin = .15, end = .7)
Fitting the multivariable ordered-probit model with brms is about as simple as fitting any other multivariable model. Just tack on predictors with the +
operator.
.13 <-
fit23brm(data = my_data,
family = cumulative(probit),
~ 1 + Year_s + Length_s,
Rating prior = c(prior(normal(0, 4), class = Intercept),
prior(normal(0, 4), class = b)),
iter = 3000, warmup = 1000, chains = 4, cores = 4,
seed = 23,
file = "fits/fit23.13")
Check the model summary.
print(fit23.13)
## Family: cumulative
## Links: mu = probit; disc = identity
## Formula: Rating ~ 1 + Year_s + Length_s
## Data: my_data (Number of observations: 100)
## Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
## total post-warmup draws = 8000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept[1] -1.68 0.21 -2.10 -1.30 1.00 6902 6069
## Intercept[2] -0.91 0.15 -1.22 -0.62 1.00 9102 7020
## Intercept[3] -0.22 0.13 -0.47 0.04 1.00 8825 7107
## Intercept[4] 0.61 0.14 0.34 0.88 1.00 8576 7290
## Intercept[5] 1.69 0.20 1.31 2.09 1.00 8454 7078
## Intercept[6] 2.58 0.38 1.93 3.42 1.00 9942 6206
## Year_s -0.48 0.13 -0.73 -0.24 1.00 7125 6109
## Length_s 0.62 0.13 0.36 0.87 1.00 7056 5731
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## disc 1.00 0.00 1.00 1.00 NA NA NA
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Extract the posterior draws and compare the brms parameterization of \(\theta_{[i]}\) with the parameterization in the text in an expanded version of the bottom panel of Figure 23.9.
<- as_draws_df(fit23.13)
draws
%>%
draws select(`b_Intercept[1]`:`b_Intercept[6]`, .draw) %>%
compare_thresholds(lb = 1.5, ub = 6.5)
To sate any curiosity, here are the correlations among the parameters.
# wrangle
as_draws_df(fit23.13) %>%
select(contains("Intercept"), b_Year_s, b_Length_s) %>%
set_names(str_c("theta[", 1:6, "]"), str_c("beta[", 1:2, "]")) %>%
# plot!
ggpairs(upper = list(continuous = my_upper),
diag = list(continuous = my_diag),
lower = list(continuous = my_lower),
labeller = label_parsed) +
ggtitle("These parameters have milder correlations")
Now focus on the marginal distribution of our two effect-size parameters.
%>%
draws pivot_longer(ends_with("_s")) %>%
mutate(name = factor(name,
levels = c("b_Year_s", "b_Length_s"))) %>%
ggplot(aes(x = value, y = 0)) +
stat_halfeye(point_interval = mode_hdi, .width = .95,
fill = sl[4], color = sl[8], normalize = "panels") +
scale_y_continuous(NULL, breaks = NULL) +
xlab("effect size") +
facet_wrap(~ name, scales = "free")
Before we make the top panel from Figure 23.9, I’d like to wander a bit and look at something related. We’ll use fitted()
.
# define the new data
<- crossing(Year_s = seq(from = -3, to = 3, by = 0.25),
nd Length_s = seq(from = -3, to = 3, by = 0.25))
# compute the `Response` probabilities
<-
f fitted(fit23.13,
newdata = nd)
# rearrange the output
<-
f rbind(
1],
f[, , 2],
f[, , 3],
f[, , 4],
f[, , 5],
f[, , 6],
f[, , 7]
f[, , %>%
) # wrangle
data.frame() %>%
bind_cols(
%>%
nd expand_grid(Rating = 1:7) %>%
arrange(Rating, Year_s)
)
# what did we do?
head(f)
## Estimate Est.Error Q2.5 Q97.5 Year_s Length_s Rating
## 1 0.11672604 0.07879391 0.018835945 0.31606704 -3 -3.00 1
## 2 0.09038339 0.06469596 0.012824934 0.25784313 -3 -2.75 1
## 3 0.06877638 0.05226785 0.008791097 0.20515430 -3 -2.50 1
## 4 0.05146467 0.04162030 0.005882207 0.16194457 -3 -2.25 1
## 5 0.03790478 0.03273187 0.003797599 0.12640909 -3 -2.00 1
## 6 0.02751017 0.02548193 0.002330570 0.09644921 -3 -1.75 1
We just computed the response probabilities across the two-dimensional grid of the predictor values. Now plot using the posterior means.
%>%
f mutate(strip = str_c("Rating: ", Rating)) %>%
ggplot(aes(x = Year_s, y = Length_s)) +
geom_raster(aes(fill = Estimate),
interpolate = T) +
geom_text(data = my_data %>% mutate(strip = str_c("Rating: ", Rating)),
aes(label = Rating),
color = "white", size = 2.5) +
scale_fill_scico(palette = 'lajolla', direction = -1, limits = c(0, 1),
breaks = c(0, .5, 1), labels = c("0", ".5", "1")) +
scale_x_continuous(breaks = seq(from = -2, to = 2, by = 2),
expand = c(0, 0)) +
scale_y_continuous(breaks = seq(from = -2, to = 2, by = 2),
expand = c(0, 0)) +
theme(legend.position = "bottom") +
facet_wrap(~ strip, nrow = 1)
This model didn’t do a great job capturing the Response
probabilities. If you’re curious, you’ll find you can do a little bit better if you allow the two predictors to interact (i.e., add + Year_s:Length_s
to the formula
line). Even then, the model isn’t great. I leave that as an exercise for the interested reader.
For this model, however, we will follow Kruschke and make a more faithful version of the top panel of Figure 23.9. We’ll need to wrangle our draws
data a bit to get things ready. Here’s the work.
<-
draws %>%
draws # we just need the data from three steps in the HMC chain
slice(1:3) %>%
mutate(.draw = .draw %>% as.factor()) %>%
rename(b1 = b_Year_s,
b2 = b_Length_s) %>%
# because these are straight lines, two extreme x1-values are all we need
expand_grid(x1 = c(-10, 10)) %>%
pivot_longer(contains("["),
names_to = "theta") %>%
# use Kruschke's Formula 23.5
mutate(x2 = (value / b2) + (-b1 / b2) * x1,
# this just renames our x variables for easy plotting
Year_s = x1,
Length_s = x2)
glimpse(draws)
## Rows: 36
## Columns: 14
## $ b1 <dbl> -0.5207952, -0.5207952, -0.5207952, -0.5207952, -0.5207952, -0.5207952, -0.5207952, -0.52…
## $ b2 <dbl> 0.6693850, 0.6693850, 0.6693850, 0.6693850, 0.6693850, 0.6693850, 0.6693850, 0.6693850, 0…
## $ disc <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ lprior <dbl> -18.89365, -18.89365, -18.89365, -18.89365, -18.89365, -18.89365, -18.89365, -18.89365, -…
## $ lp__ <dbl> -181.3174, -181.3174, -181.3174, -181.3174, -181.3174, -181.3174, -181.3174, -181.3174, -…
## $ .chain <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ .iteration <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,…
## $ .draw <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,…
## $ x1 <dbl> -10, -10, -10, -10, -10, -10, 10, 10, 10, 10, 10, 10, -10, -10, -10, -10, -10, -10, 10, 1…
## $ theta <chr> "b_Intercept[1]", "b_Intercept[2]", "b_Intercept[3]", "b_Intercept[4]", "b_Intercept[5]",…
## $ value <dbl> -1.4117774, -0.9005644, -0.3276923, 0.6500697, 1.7471252, 2.7113437, -1.4117774, -0.90056…
## $ x2 <dbl> -9.889270, -9.125564, -8.269746, -6.809059, -5.170158, -3.729704, 5.671137, 6.434843, 7.2…
## $ Year_s <dbl> -10, -10, -10, -10, -10, -10, 10, 10, 10, 10, 10, 10, -10, -10, -10, -10, -10, -10, 10, 1…
## $ Length_s <dbl> -9.889270, -9.125564, -8.269746, -6.809059, -5.170158, -3.729704, 5.671137, 6.434843, 7.2…
Now just plot.
%>%
draws ggplot(aes(x = Year_s, y = Length_s)) +
geom_line(aes(group = interaction(.draw, theta), color = theta, linetype = .draw)) +
geom_text(data = my_data,
aes(label = Rating)) +
scale_color_scico_d(expression(theta), palette = "lajolla", begin = .25, labels = 1:6) +
coord_cartesian(xlim = range(my_data$Year_s),
ylim = range(my_data$Length_s))
It might be easier to see Kruschke’s main point if we facet by .draw
.
%>%
draws ggplot(aes(x = Year_s, y = Length_s)) +
geom_line(aes(group = interaction(.draw, theta), color = theta, linetype = .draw)) +
geom_text(data = my_data,
aes(label = Rating)) +
scale_color_scico_d(expression(theta), palette = "lajolla", begin = .25, labels = 1:6) +
coord_cartesian(xlim = range(my_data$Year_s),
ylim = range(my_data$Length_s)) +
theme(legend.direction = "horizontal",
legend.position = c(.75, .25)) +
facet_wrap(~ .draw, ncol = 2)
Both our versions of the plot show what Kruschke pointed out in the text:
Threshold lines from the same step in the chain must be parallel because the regression coefficients are constant at that step, but are different at another step. The threshold lines in Figure 23.9 are level contours on the underlying metric planar surface, and the lines reveal that the ratings increase toward the top left, that is, as \(x_1\) decreases and \(x_2\) increases. (p. 693)
Before we move on to the next section, what these diagonal 2-dimensional threshold lines also hint at is that when we use two predictors to describe ordinal data as having been produces by an underlying unit Gaussian distribution, that underlying distribution is actually bivariate Gaussian. Here we’ll use fitted()
one more time to depict that bivariate-Gaussian distribution with a little geom_raster()
.
# define the new data
<- crossing(Year_s = seq(from = -5, to = 5, by = 0.1),
nd Length_s = seq(from = -5, to = 5, by = 0.1))
fitted(fit23.13,
newdata = nd,
# this will yield z-scores
scale = "linear") %>%
data.frame() %>%
bind_cols(nd) %>%
# convert the z-scores to density values
mutate(density = dnorm(Estimate, 0, 1)) %>%
ggplot(aes(x = Year_s, y = Length_s)) +
geom_raster(aes(fill = density),
interpolate = T) +
geom_text(data = my_data,
aes(label = Rating),
size = 2.5) +
scale_fill_scico("density", palette = 'lajolla', direction = -1,
limits = c(0, 0.4)) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0))
As with many of our previous approaches with geom_raster()
, this plot is based on the posterior means in each cell and, therefore, does a poor job depicting the uncertainty in the posterior distribution.
23.4.4 Why are some thresholds outside the data?
Now load Kruschke’s simulated data.
<- read_csv("data.R/OrdinalProbitData-Movies.csv")
my_data
glimpse(my_data)
## Rows: 400
## Columns: 3
## $ Year <dbl> 1959, 1946, 1964, 1938, 1946, 1971, 1957, 1970, 1968, 1962, 1929, 1962, 1978, 1972, 1964, 195…
## $ Length <dbl> 88, 117, 130, 85, 111, 105, 93, 119, 78, 77, 138, 107, 70, 60, 138, 122, 72, 109, 71, 61, 60,…
## $ Rating <dbl> 4, 5, 5, 4, 5, 4, 3, 4, 3, 3, 7, 4, 2, 2, 5, 5, 3, 4, 3, 2, 2, 3, 5, 3, 5, 5, 5, 4, 5, 4, 4, …
These data imitate the movie ratings, but with two key differences. First and foremost, the artificial data have much smaller noise, with \(\sigma = 0.20\) as opposed to \(\sigma \approx 1.25\) in the real data. Second, the artificial data have points that span the entire range of both predictors, unlike the real data which had points mostly in the central region. (p. 695)
Like with the real movie data, we’ll inspect these data with a colored scatter plot.
%>%
my_data mutate(Rating = factor(Rating)) %>%
ggplot(aes(x = Year, y = Length, label = Rating)) +
geom_point(aes(color = Rating),
size = 3) +
geom_text(size = 3) +
scale_color_scico_d(palette = "lajolla", begin = .15, end = .7)
Unlike in last section, there appears to be a clear trend in Kruschke’s simulated data. Kruschke’s simulated critic liked his movies old and long. Time to standardize the predictors.
<-
my_data %>%
my_data mutate(Year_s = (Year - mean(Year)) / sd(Year),
Length_s = (Length - mean(Length)) / sd(Length))
Fitting the multivariable ordered-probit model with brms is about as simple as fitting any other multivariable model. Just tack on predictors with the +
operator.
.14 <-
fit23update(fit23.13,
newdata = my_data,
iter = 3000, warmup = 1000, chains = 4, cores = 4,
seed = 23,
file = "fits/fit23.14")
## The desired updates require recompiling the model
Check the model summary.
print(fit23.14)
## Family: cumulative
## Links: mu = probit; disc = identity
## Formula: Rating ~ 1 + Year_s + Length_s
## Data: my_data (Number of observations: 400)
## Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
## total post-warmup draws = 8000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept[1] -9.75 0.66 -11.06 -8.50 1.00 1658 2444
## Intercept[2] -5.95 0.42 -6.79 -5.16 1.00 1989 2629
## Intercept[3] -2.24 0.21 -2.67 -1.83 1.00 3268 4870
## Intercept[4] 2.42 0.21 2.01 2.83 1.00 2655 4703
## Intercept[5] 7.77 0.54 6.73 8.88 1.00 1797 2757
## Intercept[6] 10.85 0.76 9.42 12.42 1.00 1789 2496
## Year_s -2.79 0.20 -3.19 -2.42 1.00 1826 2438
## Length_s 4.72 0.31 4.14 5.35 1.00 1595 2515
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## disc 1.00 0.00 1.00 1.00 NA NA NA
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Extract the posterior draws and use compare_thresholds()
to make our expanded version of the bottom panel of Figure 23.10.
<- as_draws_df(fit23.14)
draws
%>%
draws select(`b_Intercept[1]`:`b_Intercept[6]`, .draw) %>%
compare_thresholds(lb = 1.5, ub = 6.5)
Make the marginal distribution of our two effect-size parameters.
%>%
draws pivot_longer(ends_with("_s")) %>%
mutate(name = factor(name, levels = c("b_Year_s", "b_Length_s"))) %>%
ggplot(aes(x = value, y = 0)) +
stat_halfeye(point_interval = mode_hdi, .width = .95,
fill = sl[4], color = sl[8], normalize = "panels") +
scale_y_continuous(NULL, breaks = NULL) +
xlab("effect size") +
facet_wrap(~ name, scales = "free")
Make the top panel for Figure 23.10 just like we did for its analogue in Figure 23.9.
# extract the posterior draws and wrangle
as_draws_df(fit23.14) %>%
slice(1:3) %>%
mutate(.draw = .draw %>% as.factor()) %>%
rename(b1 = b_Year_s,
b2 = b_Length_s) %>%
expand_grid(x1 = c(-10, 10)) %>%
pivot_longer(contains("["),
names_to = "theta") %>%
mutate(x2 = (value / b2) + (-b1 / b2) * x1,
Year_s = x1,
Length_s = x2) %>%
# plot!
ggplot(aes(x = Year_s, y = Length_s)) +
geom_line(aes(group = interaction(.draw, theta), color = theta, linetype = .draw)) +
geom_text(data = my_data,
aes(label = Rating)) +
scale_color_scico_d(expression(theta), palette = "lajolla", begin = .25, labels = 1:6) +
coord_cartesian(xlim = range(my_data$Year_s),
ylim = range(my_data$Length_s))
Those are some tight thresholds. They “very clearly cleave parallel regions of distinct ordinal values. The example demonstrates that the threshold lines do make intuitive sense when there is low noise and a broad range of data” (p. 695, emphasis in the original).
With our various bonus plots, we’ve been anticipating Figure 23.11 for some time, now. The thresholds from fit23.14
result in beautifully nonlinear probability curves for the Rating
levels. Take a quick look with conditional_effects()
.
<- conditional_effects(fit23.14, categorical = T)
ce
plot(ce, plot = FALSE)[[1]] +
scale_fill_scico_d(palette = "lajolla", begin = .25) +
scale_color_scico_d(palette = "lajolla", begin = .25) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0), limits = c(0, 1))
plot(ce, plot = FALSE)[[2]] +
scale_fill_scico_d(palette = "lajolla", begin = .25) +
scale_color_scico_d(palette = "lajolla", begin = .25) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0), limits = c(0, 1))
Because the model had two predictors, we got two plots. What brms::conditional_effects()
called Probability
on the \(y\)-axis is the same as what Kruschke called \(p(y)\) on his. Rather than generic predictors \(x\) on the \(x\)-axis, our plots had either Year_s
or Length_s
. Whereas Kruschke marked off his different outcomes by line styles, ours were marked by color. Since we don’t have the data Kruschke used to make Figure 23.11, we won’t be able to reproduce it exactly. However, you’ll note that our plot for Length_s
corresponded nicely with his subplot on the top (i.e., the one for which \(\sigma = 0.1\)). If we set effects = "Length_s"
, we can use conditional_effects()
to make a similar plot to Kruschke’s subplot for which \(\sigma = 1\).
<- conditional_effects(fit23.13,
ce categorical = T,
effects = "Length_s")
plot(ce, plot = FALSE)[[1]] +
scale_fill_scico_d(palette = "lajolla", begin = .25) +
scale_color_scico_d(palette = "lajolla", begin = .25) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = expansion(mult = c(0, 0.05)), limits = c(0, NA))
“You can see that each outcome has maximum probability within its corresponding interval, but there is considerable smearing of outcomes into adjacent intervals” (p. 695).
Finishing off, Kruschke’s discussion in the text
referred to \(\sigma\) as “noise” merely for linguistic ease. Calling the outcomes “noisy” does not mean the underlying generator of the outcomes is inherently wildly random. The “noise” is merely variation in the outcome that cannot be accounted for by the particular model we have chosen with the particular predictors we have chosen. A different model and/or different predictors might account for the outcomes well with little residual noise. In this sense, the noise is in the model, not in the data. (p. 698, emphasis in the original)
Through this lens, noisy-looking data are a symptom of weak theory and/or poor data-collection methods.
23.5 Posterior prediction
The cumulative-normal model makes posterior predictions for the probabilities of the \(K\) categories in the criterion variable by computing \(p (y | \mu (x), \sigma, \{ \theta_k \} )\) in each step in the HMC chain. In this equation, \(\mu (x) = \beta_0 + \sum_j \beta_j x_j\). Though recall that with our brms parameterization, we have \(\beta_0\) fixed at 0. Kruschke framed part of his discussion in this chapter in terms of a single-predictor model, such as was entertained in Figure 23.8. Recall that corresponds to our fit11
. Here’s that formula
.
.11$formula fit23
## Happiness ~ 1 + Assets_s
With brms, you can get this information with fitted()
. Let’s say we wanted to focus on response probabilities for Assets_s = -1)
. Here’s what we get.
fitted(fit23.11,
newdata = tibble(Assets_s = -1))
## , , P(Y = 1)
##
## Estimate Est.Error Q2.5 Q97.5
## [1,] 0.03000513 0.002463134 0.02536485 0.0349928
##
## , , P(Y = 2)
##
## Estimate Est.Error Q2.5 Q97.5
## [1,] 0.1246466 0.00481312 0.1154048 0.1341927
##
## , , P(Y = 3)
##
## Estimate Est.Error Q2.5 Q97.5
## [1,] 0.3434908 0.006255248 0.331182 0.3557538
##
## , , P(Y = 4)
##
## Estimate Est.Error Q2.5 Q97.5
## [1,] 0.450522 0.006947555 0.4367891 0.4641612
##
## , , P(Y = 5)
##
## Estimate Est.Error Q2.5 Q97.5
## [1,] 0.0513355 0.002927854 0.04576673 0.05707253
As is typical of brms, those probability summaries were in terms of the posterior mean and percentile-based 95% intervals. If you’re like Kruschke and prefer posterior modes and HDIs, you’ll need to set summary = F
and wrangle a bit.
<-
f fitted(fit23.11,
newdata = tibble(Assets_s = -1),
summary = F)
cbind(
1],
f[, , 2],
f[, , 3],
f[, , 4],
f[, , 5]
f[, , %>%
) data.frame() %>%
set_names(str_c("p(Happiness = ", 1:5, " | Assets_s = -1)")) %>%
pivot_longer(everything()) %>%
group_by(name) %>%
mode_hdi(value) %>%
mutate_if(is.double, round, digits = 4)
## # A tibble: 5 × 7
## name value .lower .upper .width .point .interval
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 p(Happiness = 1 | Assets_s = -1) 0.03 0.0254 0.035 0.95 mode hdi
## 2 p(Happiness = 2 | Assets_s = -1) 0.124 0.116 0.134 0.95 mode hdi
## 3 p(Happiness = 3 | Assets_s = -1) 0.343 0.331 0.356 0.95 mode hdi
## 4 p(Happiness = 4 | Assets_s = -1) 0.450 0.437 0.464 0.95 mode hdi
## 5 p(Happiness = 5 | Assets_s = -1) 0.0514 0.0456 0.0568 0.95 mode hdi
23.6 Generalizations and extensions
In this section, Kruschke mentioned extensions of this class of models might include using the cumulative \(t\) function to handle outliers or adding a guessing parameter. Full disclosure: I have not fit models like these. Based on my knowledge of brms, I suspect they’re possible. For insights how, you might review Bürkner’s (2022d) Define custom response distributions with brms and his (2022e) Estimating non-linear models with brms vignettes.
In addition, there are other likelihoods one might use to model ordinal data using brms. Your first stop should be Bürkner and Vourre’s (2019) Ordinal regression models in psychology: A Tutorial, where, in addition to the cumulative normal model, they cover the sequential and adjacent category models. You might also check out Chapter 11 of my (2023a) ebook recoding McElreath’s Statistical rethinking, wherein I show how one might use the logit link (i.e., family = cumulative(logit)
) to fit ordered-categorical models with brms.
If you’d just like more practice with the cumulative probit model, you might check out my blog post called Notes on the Bayesian cumulative probit.
Session info
sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] GGally_2.1.2 tidybayes_3.0.2 brms_2.18.0 Rcpp_1.0.9 patchwork_1.1.2 scico_1.3.1
## [7] forcats_0.5.1 stringr_1.4.1 dplyr_1.0.10 purrr_1.0.1 readr_2.1.2 tidyr_1.2.1
## [13] tibble_3.1.8 ggplot2_3.4.0 tidyverse_1.3.2
##
## loaded via a namespace (and not attached):
## [1] readxl_1.4.1 backports_1.4.1 plyr_1.8.7 igraph_1.3.4 svUnit_1.0.6
## [6] splines_4.2.2 crosstalk_1.2.0 TH.data_1.1-1 rstantools_2.2.0 inline_0.3.19
## [11] digest_0.6.31 htmltools_0.5.3 fansi_1.0.3 magrittr_2.0.3 checkmate_2.1.0
## [16] googlesheets4_1.0.1 tzdb_0.3.0 modelr_0.1.8 RcppParallel_5.1.5 matrixStats_0.63.0
## [21] vroom_1.5.7 xts_0.12.1 sandwich_3.0-2 prettyunits_1.1.1 colorspace_2.0-3
## [26] rvest_1.0.2 ggdist_3.2.1 haven_2.5.1 xfun_0.35 callr_3.7.3
## [31] crayon_1.5.2 jsonlite_1.8.4 lme4_1.1-31 survival_3.4-0 zoo_1.8-10
## [36] glue_1.6.2 gtable_0.3.1 gargle_1.2.0 emmeans_1.8.0 distributional_0.3.1
## [41] pkgbuild_1.3.1 rstan_2.21.8 abind_1.4-5 scales_1.2.1 mvtnorm_1.1-3
## [46] DBI_1.1.3 miniUI_0.1.1.1 xtable_1.8-4 HDInterval_0.2.4 bit_4.0.4
## [51] stats4_4.2.2 StanHeaders_2.21.0-7 DT_0.24 htmlwidgets_1.5.4 httr_1.4.4
## [56] threejs_0.3.3 RColorBrewer_1.1-3 arrayhelpers_1.1-0 posterior_1.3.1 ellipsis_0.3.2
## [61] reshape_0.8.9 pkgconfig_2.0.3 loo_2.5.1 farver_2.1.1 sass_0.4.2
## [66] dbplyr_2.2.1 utf8_1.2.2 tidyselect_1.2.0 labeling_0.4.2 rlang_1.0.6
## [71] reshape2_1.4.4 later_1.3.0 munsell_0.5.0 cellranger_1.1.0 tools_4.2.2
## [76] cachem_1.0.6 cli_3.6.0 generics_0.1.3 broom_1.0.2 evaluate_0.18
## [81] fastmap_1.1.0 processx_3.8.0 knitr_1.40 bit64_4.0.5 fs_1.5.2
## [86] nlme_3.1-160 projpred_2.2.1 mime_0.12 xml2_1.3.3 compiler_4.2.2
## [91] bayesplot_1.10.0 shinythemes_1.2.0 rstudioapi_0.13 gamm4_0.2-6 reprex_2.0.2
## [96] bslib_0.4.0 stringi_1.7.8 highr_0.9 ps_1.7.2 Brobdingnag_1.2-8
## [101] lattice_0.20-45 Matrix_1.5-1 nloptr_2.0.3 markdown_1.1 shinyjs_2.1.0
## [106] tensorA_0.36.2 vctrs_0.5.1 pillar_1.8.1 lifecycle_1.0.3 jquerylib_0.1.4
## [111] bridgesampling_1.1-2 estimability_1.4.1 httpuv_1.6.5 R6_2.5.1 bookdown_0.28
## [116] promises_1.2.0.1 gridExtra_2.3 codetools_0.2-18 boot_1.3-28 MASS_7.3-58.1
## [121] colourpicker_1.1.1 gtools_3.9.4 assertthat_0.2.1 withr_2.5.0 shinystan_2.6.0
## [126] multcomp_1.4-20 mgcv_1.8-41 parallel_4.2.2 hms_1.1.1 grid_4.2.2
## [131] minqa_1.2.5 coda_0.19-4 rmarkdown_2.16 googledrive_2.0.0 shiny_1.7.2
## [136] lubridate_1.8.0 base64enc_0.1-3 dygraphs_1.1.1.6