4 Geocentric Models

Linear regression is the geocentric model of applied statistics. By “linear regression,” we will mean a family of simple statistical golems that attempt to learn about the mean and variance of some measurement, using an additive combination of other measurements. Like geocentrism, linear regression can usefully describe a very large variety of natural phenomena. Like geocentrism, linear regression is a descriptive model that corresponds to many different process models. If we read its structure too literally, we’re likely to make mistakes. But used wisely, these little linear golems continue to be useful. (McElreath, 2020a, p. 71, emphasis, in the original)

4.1 Why normal distributions are normal

After laying out his soccer field coin toss shuffle premise, McElreath wrote:

It’s hard to say where any individual person will end up, but you can say with great confidence what the collection of positions will be. The distances will be distributed in approximately normal, or Gaussian, fashion. This is true even though the underlying distribution is binomial. It does this because there are so many more possible ways to realize a sequence of left-right steps that sums to zero. There are slightly fewer ways to realize a sequence that ends up one step left or right of zero, and so on, with the number of possible sequences declining in the characteristic bell curve of the normal distribution. (p. 72)

4.1.1 Normal by addition.

Here’s a way to do the simulation necessary for the plot in the top panel of Figure 4.2.

library(tidyverse)

# we set the seed to make the results of `runif()` reproducible.
set.seed(4)

pos <- 
  # make data with 100 people, 16 steps each with a starting point of `step == 0` (i.e., 17 rows per person)
  crossing(person = 1:100,
           step   = 0:16) %>% 
  # for all steps above `step == 0` simulate a `deviation`
  mutate(deviation = map_dbl(step, ~if_else(. == 0, 0, runif(1, -1, 1)))) %>% 
  # after grouping by `person`, compute the cumulative sum of the deviations, then `ungroup()`
  group_by(person) %>%
  mutate(position = cumsum(deviation)) %>% 
  ungroup() 

That map_dbl() code within the first mutate() line might look odd. Go here to learn more about iterating with purrr::map_dbl().

We might glimpse() at the data.

glimpse(pos)
## Rows: 1,700
## Columns: 4
## $ person    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ step      <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 0, 1, 2, 3, 4, 5, 6, 7…
## $ deviation <dbl> 0.00000000, -0.98210841, -0.41252078, -0.44525008, 0.62714843, -0.47914446, 0.44…
## $ position  <dbl> 0.0000000, -0.9821084, -1.3946292, -1.8398793, -1.2127308, -1.6918753, -1.243063…

Here’s the code to make the top panel of Figure 4.2.

ggplot(data = pos, 
       aes(x = step, y = position, group = person)) +
  geom_vline(xintercept = c(4, 8, 16), linetype = 2) +
  geom_line(aes(color = person < 2, alpha  = person < 2)) +
  scale_color_manual(values = c("skyblue4", "black")) +
  scale_alpha_manual(values = c(1/5, 1)) +
  scale_x_continuous("step number", breaks = c(0, 4, 8, 12, 16)) +
  theme(legend.position = "none")

Here’s the code for the bottom three plots of Figure 4.2.

# Figure 4.2.a.
p1 <-
  pos %>%
  filter(step == 4) %>%
  ggplot(aes(x = position)) +
  geom_line(stat = "density", color = "dodgerblue1") +
  labs(title = "4 steps")

# Figure 4.2.b.
p2 <-
  pos %>%
  filter(step == 8) %>%
  ggplot(aes(x = position)) +
  geom_density(color = "dodgerblue2", outline.type = "full") +
  labs(title = "8 steps")

# this is an intermediary step to get an SD value
sd <-
  pos %>%
  filter(step == 16) %>%
  summarise(sd = sd(position)) %>% 
  pull(sd)

# Figure 4.2.c.
p3 <-
  pos %>%
  filter(step == 16) %>%
  ggplot(aes(x = position)) +
  stat_function(fun = dnorm, 
                args = list(mean = 0, sd = sd),
                linetype = 2) +
  geom_density(color = "transparent", fill = "dodgerblue3", alpha = 1/2) +
  labs(title = "16 steps",
       y = "density")

library(patchwork)

# combine the ggplots
(p1 | p2 | p3) & coord_cartesian(xlim = c(-6, 6))

While we were at it, we explored a few ways to express densities. The main action was with the geom_line(), geom_density(), and stat_function() functions, respectively.

Any process that adds together random values from the same distribution converges to a normal. But it’s not easy to grasp why addition should result in a bell curve of sums. Here’s a conceptual way to think of the process. Whatever the average value of the source distribution, each sample from it can be thought of as a fluctuation from that average value. When we begin to add these fluctuations together, they also begin to cancel one another out. A large positive fluctuation will cancel a large negative one. (p. 73)

4.1.2 Normal by multiplication.

Here’s McElreath’s simple random growth rate.

set.seed(4)

prod(1 + runif(12, 0, 0.1))
## [1] 1.774719

In the runif() part of that code, we generated 12 random draws from the uniform distribution with bounds \([0, 0.1]\). Within the prod() function, we first added 1 to each of those values and then computed their product. Consider a more explicit variant of the code.

set.seed(4)

tibble(a = 1,
       b = runif(12, 0, 0.1)) %>% 
  mutate(c = a + b) %>% 
  summarise(p = prod(c))
## # A tibble: 1 x 1
##       p
##   <dbl>
## 1  1.77

Same result. Rather than using base R replicate() to do this many times, let’s practice with purrr::map_dbl() like before.

set.seed(4)

growth <- 
  tibble(growth = map_dbl(1:10000, ~ prod(1 + runif(12, 0, 0.1))))

ggplot(data = growth, aes(x = growth)) +
  geom_density()

“The smaller the effect of each locus, the better this additive approximation will be” (p. 74). Let’s compare big and small.

# simulate
set.seed(4)

samples <-
  tibble(big   = map_dbl(1:10000, ~ prod(1 + runif(12, 0, 0.5))),
         small = map_dbl(1:10000, ~ prod(1 + runif(12, 0, 0.01))))

# wrangle
samples %>% 
  pivot_longer(everything(), values_to = "samples") %>% 
  
  # plot
  ggplot(aes(x = samples)) +
  geom_density(fill = "black") +
  facet_wrap(~ name, scales = "free") 

Yep, the small samples were more Gaussian. “The interacting growth deviations, as long as they are sufficiently small, converge to a Gaussian distribution. In this way, the range of causal forces that tend towards Gaussian distributions extends well beyond purely additive interactions” (p. 74).

4.1.3 Normal by log-multiplication.

Instead of saving our tibble, we’ll just feed it directly into our plot.

samples %>% 
  mutate(log_big = log(big)) %>% 
  
  ggplot(aes(x = log_big)) +
  geom_density(fill = "gray33") +
  xlab("the log of the big")

Yet another Gaussian distribution. We get the Gaussian distribution back, because adding logs is equivalent to multiplying the original numbers. So even multiplicative interactions of large deviations can produce Gaussian distributions, once we measure the outcomes on the log scale. (p. 75)

4.1.4 Using Gaussian distributions.

“The justifications for using the Gaussian distribution fall into two broad categories: (1) ontological and (2) epistemological” (p. 75). I’m a fan of the justifications to follow.

4.1.4.1 Ontological justification.

The Gaussian is

a widespread pattern, appearing again and again at different scales and in different domains. Measurement errors, variations in growth, and the velocities of molecules all tend towards Gaussian distributions. These processes do this because at their heart, these processes add together fluctuations. And repeatedly adding finite fluctuations results in a distribution of sums that have shed all information about the underlying process, aside from mean and spread.

[However,] one consequence of this is that statistical models based on Gaussian distributions cannot reliably identify micro-process. (p. 75)

But like Ptolemy’s circles within circles, the Gaussian can still be useful even if it sheds information.

4.1.4.2 Epistemological justification.

By the epistemological justification, the Gaussian represents a particular state of ignorance. When all we know or are willing to say about a distribution of measures (measures are continuous values on the real number line) is their mean and variance, then the Gaussian distribution arises as the most consistent with our assumptions.

That is to say that the Gaussian distribution is the most natural expression of our state of ignorance, because if all we are willing to assume is that a measure has finite variance, the Gaussian distribution is the shape that can be realized in the largest number of ways and does not introduce any new assumptions. It is the least surprising and least informative assumption to make. In this way, the Gaussian is the distribution most consistent with our assumptions. Or rather, it is the most consistent with our golem’s assumptions. If you don’t think the distribution should be Gaussian, then that implies that you know something else that you should tell your golem about, something that would improve inference. (p. 75)

We’ll dive deeper into why the Gaussian is such a natural expression of ignorance in these contexts when we cover maximum entropy in Chapter 7.

4.1.4.3 Rethinking: Heavy tails.

The Gaussian distribution is common in nature and has some nice properties. But there are some risks in using it as a default data model. The extreme ends of a distribution are known as its tails. And the Gaussian distribution has some very thin tails–there is very little probability in them. Instead most of the mass in the Gaussian lies within one standard deviation of the mean. Many natural (and unnatural) processes have much heavier tails. (p. 76)

You have no idea how excited I am that we’ll be covering some of these heavy-tailed alternatives!

4.1.4.4 Overthinking: Gaussian distribution.

In this section McElreath gave the formula for the Gaussian probability density function. Let \(y\) be the criterion, \(\mu\) be the mean, and \(\sigma\) be the standard deviation. Then the probability density of some Gaussian value \(y\) is

\[p(y|\mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left (- \frac{(y - \mu)^2}{2 \sigma^2} \right).\]

McElreath’s right. “This looks monstrous” (p. 76). Why not demystify that monster with a little R code? For simplicity, we’ll compute \(p(y|\mu, \sigma)\) for a series of \(y\)-values ranging from -1 to 1, holding \(\mu = 0\) and \(\sigma = 0.1\). Then we’ll plot.

tibble(y     = seq(from = -1, to = 1, by = .01),
       mu    = 0,
       sigma = 0.1) %>% 
  # compute p(y) with a hand-made Gaussian probability density function
  mutate(p = (1 / sqrt(2 * pi * sigma^2)) * exp(-((y - mu)^2 / (2 * sigma^2)))) %>% 
  
  ggplot(aes(x = y, y = p)) +
  geom_line() + 
  ylab(expression(italic(p)(italic("y|")*mu==0*","~sigma==0.1)))

Notice how \(p(y|\mu, \sigma)\) peaks around 4 when \(y = 0\). We can also get that value with the dnorm() function, which will return the \(p(y)\) value for a given combination of \(y\), \(\mu\), and \(\sigma\).

dnorm(0, mean = 0, sd = 0.1)
## [1] 3.989423

The answer, about 4, is no mistake. Probability density is the rate of change in cumulative probability. So where cumulative probability is increasing rapidly, density can easily exceed 1. But if we calculate the area under the density function, it will never exceed 1. Such areas are also called probability mass. You can usually ignore these density/mass details while doing computational work. But it’s good to be aware of the distinction. (p. 76, emphasis in the original)

4.2 A language for describing models

Our mathy ways of summarizing models will be something like

\[\begin{align*} \text{criterion}_i & \sim \operatorname{Normal}(\mu_i, \sigma) \\ \mu_i & = \beta \times \text{predictor}_i \\ \beta & \sim \operatorname{Normal}(0, 10) \\ \sigma & \sim \operatorname{Exponential}(1) \\ x_i & \sim \operatorname{Normal}(0, 1). \end{align*}\]

“If that doesn’t make much sense, good. That indicates that you are holding the right textbook” (p. 77). Welcome applied statistics! 🤓

4.2.1 Re-describing the globe tossing model.

For the globe tossing model, the probability \(p\) of a count of water \(w\) based on \(n\) trials was

\[\begin{align*} w & \sim \operatorname{Binomial}(n, p) \\ p & \sim \operatorname{Uniform}(0, 1), \end{align*}\]

where the top line indicates we’re using the Binomial likelihood function to model \(w\) given unique combinations of \(n\) and \(p\). In our example, \(w\) and \(n\) were already defined in the data, so all we need to do is compute \(p\). Since \(p\) is the only parameter, it’s the only element that gets a prior, which is what that second line was.

4.2.1.1 Overthinking: From model definition to Bayes’ theorem.

We can use grid approximation to work through our globe tossing model.

# how many `p_grid` points would you like?
n_points <- 100

d <-
  tibble(p_grid = seq(from = 0, to = 1, length.out = n_points),
         w      = 6, 
         n      = 9) %>% 
  mutate(prior      = dunif(p_grid, 0, 1),
         likelihood = dbinom(w, n, p_grid)) %>% 
  mutate(posterior = likelihood * prior / sum(likelihood * prior))

head(d)
## # A tibble: 6 x 6
##   p_grid     w     n prior likelihood posterior
##    <dbl> <dbl> <dbl> <dbl>      <dbl>     <dbl>
## 1 0          6     9     1   0.        0.      
## 2 0.0101     6     9     1   8.65e-11  8.74e-12
## 3 0.0202     6     9     1   5.37e- 9  5.43e-10
## 4 0.0303     6     9     1   5.93e- 8  5.99e- 9
## 5 0.0404     6     9     1   3.23e- 7  3.26e- 8
## 6 0.0505     6     9     1   1.19e- 6  1.21e- 7

In case you were curious, here’s what they look like.

d %>% 
  pivot_longer(prior:posterior) %>% 
  # this line allows us to dictate the order in which the panels will appear
  mutate(name = factor(name, levels = c("prior", "likelihood", "posterior"))) %>% 
  
  ggplot(aes(x = p_grid, y = value, fill = name)) +
  geom_area() +
  scale_fill_manual(values = c("blue", "red", "purple")) +
  scale_y_continuous(NULL, breaks = NULL) +
  theme(legend.position = "none") +
  facet_wrap(~ name, scales = "free")

The posterior is a combination of the prior and the likelihood. When the prior is flat across the parameter space, the posterior is just the likelihood re-expressed as a probability. As we go along, you’ll see we almost never use flat priors in practice. Be warned that eschewing flat priors is a recent development, however. You only have to look at the literature from a couple decades ago to see mounds and mounds of flat priors.

4.3 A Gaussian model of height

There are an infinite number of possible Gaussian distributions. Some have small means. Others have large means. Some are wide, with a large \(\sigma\). Others are narrow. We want our Bayesian machine to consider every possible distribution, each defined by a combination of \(\mu\) and \(\sigma\), and rank them by posterior plausibility. Posterior plausibility provides a measure of the logical compatibility of each possible distribution with the data and model. (p. 79)

4.3.1 The data.

Let’s load the Howell (2001, 2010) data from McElreath’s (2020b) rethinking package.

library(rethinking)
data(Howell1)
d <- Howell1

Here we open our focal statistical package, Bürkner’s brms. But before we do, we’ll want to detach the rethinking package. R will not allow us to use a function from one package that shares the same name as a different function from another package if both packages are open at the same time. The rethinking and brms packages are designed for similar purposes and, unsurprisingly, overlap in some of their function names. To prevent problems, we will always make sure rethinking is detached before using brms. To learn more on the topic, see this R-bloggers post.

rm(Howell1)
detach(package:rethinking, unload = T)
library(brms)

Go ahead and investigate the data with str(), the tidyverse analogue for which is glimpse().

d %>%
  str()
## 'data.frame':    544 obs. of  4 variables:
##  $ height: num  152 140 137 157 145 ...
##  $ weight: num  47.8 36.5 31.9 53 41.3 ...
##  $ age   : num  63 63 65 41 51 35 32 27 19 54 ...
##  $ male  : int  1 0 0 1 0 1 0 1 0 1 ...

The brms package does not have a function that works like rethinking::precis() for providing numeric and graphical summaries of variables, as see in McElreath’s R code 4.9. We can get some of that information with summary().

d %>%
  summary()
##      height           weight            age             male       
##  Min.   : 53.98   Min.   : 4.252   Min.   : 0.00   Min.   :0.0000  
##  1st Qu.:125.09   1st Qu.:22.008   1st Qu.:12.00   1st Qu.:0.0000  
##  Median :148.59   Median :40.058   Median :27.00   Median :0.0000  
##  Mean   :138.26   Mean   :35.611   Mean   :29.34   Mean   :0.4724  
##  3rd Qu.:157.48   3rd Qu.:47.209   3rd Qu.:43.00   3rd Qu.:1.0000  
##  Max.   :179.07   Max.   :62.993   Max.   :88.00   Max.   :1.0000

We might make the histograms like this.

d %>%
  pivot_longer(everything()) %>% 
  mutate(name = factor(name, levels = c("height", "weight", "age", "male"))) %>% 
  
  ggplot(aes(x = value)) +
  geom_histogram(bins = 10) +
  facet_wrap(~ name, scales = "free", ncol = 1)

If you’re curious, McElreath made those tiny histograms with help from Wickham’s histospark() function. Here’s the code.

sparks <- c("\u2581", "\u2582", "\u2583", "\u2585", "\u2587")

histospark <- function(x, width = 10) {
  bins <- graphics::hist(x, breaks = width, plot = FALSE)

  factor <- cut(
    bins$counts / max(bins$counts),
    breaks = seq(0, 1, length = length(sparks) + 1),
    labels = sparks,
    include.lowest = TRUE
  )

  paste0(factor, collapse = "")
}

Here’s how it works.

histospark(d$weight)
## [1] "▁▂▃▂▂▂▂▅▇▇▃▂▁"

One of the neat things about the histospark() function is you can insert the output right in your R Markdown prose. For example, we can use is to casually show how left skewed the our height variable is: ▁▁▁▁▁▁▁▂▁▇▇▅▁. But it’s time to get back on track. You can isolate height values with the dplyr::select() function.

d %>%
  select(height) %>% 
  glimpse()
## Rows: 544
## Columns: 1
## $ height <dbl> 151.7650, 139.7000, 136.5250, 156.8450, 145.4150, 163.8300, 149.2250, 168.9100, 147…

If you want the values in a numeric vector rather than in a data fame, try pull(d, height).

We can use the dplyr::filter() function to make an adults-only data frame.

d2 <- 
  d %>%
  filter(age >= 18)

Our reduced d2 does indeed have \(n = 352\) cases.

d2 %>% 
  count()
##     n
## 1 352

4.3.1.1 Overthinking: Data frames and indexes.

For more on indexing, check out Chapter 9 of Peng’s (2019) R programming for data science, or even the Subsetting subsection in R4DS.

This probably reflects my training history, but the structure of a data frame seems natural and inherently appealing. If you’re in the other camp, do check out either of these two data wrangling talks (here and here) by the ineffable Jenny Bryan.

4.3.2 The model.

Please heed McElreath’s warnings to

be careful about choosing the Gaussian distribution only when the plotted outcome variable looks Gaussian to you. Gawking at the raw data, to try to decide how to model them, is usually not a good idea. The data could be a mixture of different Gaussian distributions, for example, and in that case you won’t be able to detect the underlying normality just by eyeballing the outcome distribution. Furthermore, as mentioned earlier in this chapter, the empirical distribution needn’t be actually Gaussian in order to justify using a Gaussian probability distribution. (p. 81)

Anyway, the likelihood for our model is

\[\text{heights}_i \sim \operatorname{Normal}(\mu, \sigma),\]

where the \(i\) subscript indexes the individual cases in the data. Our two parameters are \(\mu\) and \(\sigma\), which we will estimate using Bayes’ formula. Our prior for \(\mu\) will be

\[\mu \sim \operatorname{Normal}(178, 20)\]

and our prior for \(\sigma\) will be

\[\sigma \sim \operatorname{Uniform}(0, 50).\]

Here’s the shape of the prior for \(\mu\), \(\mathcal N(178, 20)\).

p1 <-
  tibble(x = seq(from = 100, to = 250, by = .1)) %>% 
  
  ggplot(aes(x = x, y = dnorm(x, mean = 178, sd = 20))) +
  geom_line() +
  scale_x_continuous(breaks = seq(from = 100, to = 250, by = 75)) +
  labs(title = "mu ~ dnorm(178, 20)",
       y = "density")

p1

And here’s the ggplot2 code for our prior for \(\sigma\), a uniform distribution with a minimum value of 0 and a maximum value of 50. We don’t really need the \(y\)-axis when looking at the shapes of a density, so we’ll just remove it with scale_y_continuous().

p2 <-
  tibble(x = seq(from = -10, to = 60, by = .1)) %>%
  
  ggplot(aes(x = x, y = dunif(x, min = 0, max = 50))) +
  geom_line() +
  scale_x_continuous(breaks = c(0, 50)) +
  scale_y_continuous(NULL, breaks = NULL) +
  ggtitle("sigma ~ dunif(0, 50)")

p2

We can simulate from both priors at once to get a prior probability distribution of heights.

n <- 1e4

set.seed(4)

sim <-
  tibble(sample_mu    = rnorm(n, mean = 178, sd  = 20),
         sample_sigma = runif(n, min = 0, max = 50)) %>% 
  mutate(height = rnorm(n, mean = sample_mu, sd = sample_sigma))
  
p3 <- sim %>% 
  ggplot(aes(x = height)) +
  geom_density(fill = "grey33") +
  scale_x_continuous(breaks = c(0, 73, 178, 283)) +
  scale_y_continuous(NULL, breaks = NULL) +
  ggtitle("height ~ dnorm(mu, sigma)") +
  theme(panel.grid = element_blank())

p3

If you look at the \(x\)-axis breaks on the plot in McElreath’s lower left panel in Figure 4.3, you’ll notice they’re intentional. To compute the mean and 3 standard deviations above and below, you might do this.

sim %>% 
  summarise(ll   = mean(height) - sd(height) * 3,
            mean = mean(height),
            ul   = mean(height) + sd(height) * 3) %>% 
  mutate_all(round, digits = 1)
## # A tibble: 1 x 3
##      ll  mean    ul
##   <dbl> <dbl> <dbl>
## 1  73.9  177.  281.

Our values are very close to his, but are off by just a bit due to simulation variation.

Here’s the work to make the lower right panel of Figure 4.3. Watch out; we’re starting to get fancy.

# simulate
set.seed(4)

sim <-
  tibble(sample_mu    = rnorm(n, mean = 178, sd = 100),
         sample_sigma = runif(n, min = 0, max = 50)) %>% 
  mutate(height = rnorm(n, mean = sample_mu, sd = sample_sigma))

# compute the values we'll use to break on our x axis
breaks <-
  c(mean(sim$height) - 3 * sd(sim$height), 0, mean(sim$height), mean(sim$height) + 3 * sd(sim$height)) %>% 
  round(digits = 0)

# this is just for aesthetics
text <-
  tibble(height = 272 - 25,
         y      = .0013,
         label  = "tallest man",
         angle  = 90)

# plot
p4 <-
  sim %>% 
  ggplot(aes(x = height)) +
  geom_density(fill = "black", size = 0) +
  geom_vline(xintercept = 0, color = "grey92") +
  geom_vline(xintercept = 272, color = "grey92", linetype = 3) +
  geom_text(data = text,
            aes(y = y, label = label, angle = angle),
            color = "grey92") +
  scale_x_continuous(breaks = breaks) +
  scale_y_continuous(NULL, breaks = NULL) +
  ggtitle("height ~ dnorm(mu, sigma)\nmu ~ dnorm(178, 100)") +
  theme(panel.grid = element_blank())

p4

You may have noticed how we were saving each of the four last plots as p1 through p4. Let’s combine the four to make our version of McElreath’s Figure 4.3.

(p1 + xlab("mu") | p2 + xlab("sigma")) / (p3 | p4)

On page 84, McElreath said his prior simulation indicated 4% of the heights would be below zero. Here’s how we might determe that percentage for our simulation.

sim %>% 
  count(height < 0) %>% 
  mutate(percent = 100 * n / sum(n))
## # A tibble: 2 x 3
##   `height < 0`     n percent
##   <lgl>        <int>   <dbl>
## 1 FALSE         9571   95.7 
## 2 TRUE           429    4.29

Here’s the break down compared to the tallest man on record, Robert Pershing Wadlow (1918–1940).

sim %>% 
  count(height < 272) %>% 
  mutate(percent = 100 * n / sum(n))
## # A tibble: 2 x 3
##   `height < 272`     n percent
##   <lgl>          <int>   <dbl>
## 1 FALSE           1761    17.6
## 2 TRUE            8239    82.4

Does this matter? In this case, we have so much data that the silly prior is harmless. But that won’t always be the case. There are plenty of inference problems for which the data alone are not sufficient, no matter how numerous. Bayes lets us proceed in these cases. But only if we use our scientific knowledge to construct sensible priors. Using scientific knowledge to build priors is not cheating. The important thing is that your prior not be based on the values in the data, but only on what you know about the data before you see it. (p. 84)

4.3.2.1 Rethinking: A farewell to epsilon.

Some readers will have already met an alternative notation for a Gaussian linear model:

\[\begin{align*} h_i & = \mu + \epsilon_i \\ \epsilon_i & \sim \operatorname{Normal}(0, \sigma) \end{align*}\]

This is equivalent to the \(h_i \sim \operatorname{Normal}(\mu, \sigma)\) form, with the \(\epsilon\) standing in for the Gaussian density. But this \(\epsilon\) form is poor form. The reason is that it does not usually generalize to other types of models. This means it won’t be possible to express non-Gaussian models using tricks like \(\epsilon\). Better to learn one system that does generalize. (p. 84)

Agreed.

4.3.3 Grid approximation of the posterior distribution.

As McElreath explained, you’ll never use this for practical data analysis. But I found this helped me better understanding what exactly we’re doing with Bayesian estimation. So let’s play along.

n <- 200

d_grid <-
  # we'll accomplish with `tidyr::crossing()` what McElreath did with base R `expand.grid()`
  crossing(mu    = seq(from = 140, to = 160, length.out = n),
           sigma = seq(from = 4, to = 9, length.out = n))

glimpse(d_grid)
## Rows: 40,000
## Columns: 2
## $ mu    <dbl> 140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140,…
## $ sigma <dbl> 4.000000, 4.025126, 4.050251, 4.075377, 4.100503, 4.125628, 4.150754, 4.175879, 4.20…

d_grid contains every combination of mu and sigma across their specified values. Instead of base R sapply(), we’ll do the computations by making a custom function which we’ll then plug into purrr::map2().

grid_function <- function(mu, sigma) {
  
  dnorm(d2$height, mean = mu, sd = sigma, log = T) %>% 
    sum()
  
}

Now we’re ready to complete the tibble.

d_grid <-
  d_grid %>% 
  mutate(log_likelihood = map2(mu, sigma, grid_function)) %>%
  unnest(log_likelihood) %>% 
  mutate(prior_mu    = dnorm(mu, mean = 178, sd = 20, log = T),
         prior_sigma = dunif(sigma, min = 0, max = 50, log = T)) %>% 
  mutate(product = log_likelihood + prior_mu + prior_sigma) %>% 
  mutate(probability = exp(product - max(product)))
  
head(d_grid)
## # A tibble: 6 x 7
##      mu sigma log_likelihood prior_mu prior_sigma product probability
##   <dbl> <dbl>          <dbl>    <dbl>       <dbl>   <dbl>       <dbl>
## 1   140  4            -3813.    -5.72       -3.91  -3822.           0
## 2   140  4.03         -3778.    -5.72       -3.91  -3787.           0
## 3   140  4.05         -3743.    -5.72       -3.91  -3753.           0
## 4   140  4.08         -3709.    -5.72       -3.91  -3719.           0
## 5   140  4.10         -3676.    -5.72       -3.91  -3686.           0
## 6   140  4.13         -3644.    -5.72       -3.91  -3653.           0

In the final d_grid, the probability vector contains the posterior probabilities across values of mu and sigma. We can make a contour plot with geom_contour().

d_grid %>% 
  ggplot(aes(x = mu, y = sigma, z = probability)) + 
  geom_contour() +
  labs(x = expression(mu),
       y = expression(sigma)) +
  coord_cartesian(xlim = range(d_grid$mu),
                  ylim = range(d_grid$sigma)) +
  theme(panel.grid = element_blank())

We’ll make our heat map with geom_raster().

d_grid %>% 
  ggplot(aes(x = mu, y = sigma, fill = probability)) + 
  geom_raster(interpolate = T) +
  scale_fill_viridis_c(option = "B") +
  labs(x = expression(mu),
       y = expression(sigma)) +
  theme(panel.grid = element_blank())

4.3.4 Sampling from the posterior.

We can use dplyr::sample_n() to sample rows, with replacement, from d_grid.

set.seed(4)

d_grid_samples <- 
  d_grid %>% 
  sample_n(size = 1e4, replace = T, weight = probability)

d_grid_samples %>% 
  ggplot(aes(x = mu, y = sigma)) + 
  geom_point(size = .9, alpha = 1/15) +
  scale_fill_viridis_c() +
  labs(x = expression(mu[samples]),
       y = expression(sigma[samples])) +
  theme(panel.grid = element_blank())

We can use pivot_longer() and then facet_wrap() to plot the densities for both mu and sigma at once.

d_grid_samples %>% 
  pivot_longer(mu:sigma) %>% 

  ggplot(aes(x = value)) + 
  geom_density(fill = "grey33") +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab(NULL) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~ name, scales = "free", labeller = label_parsed)

We’ll use the tidybayes package to compute their posterior modes and 95% HDIs.

library(tidybayes)

d_grid_samples %>% 
  pivot_longer(mu:sigma) %>% 
  group_by(name) %>% 
  mode_hdi(value)
## # A tibble: 2 x 7
##   name   value .lower .upper .width .point .interval
##   <chr>  <dbl>  <dbl>  <dbl>  <dbl> <chr>  <chr>    
## 1 mu    155.   154.   155.     0.95 mode   hdi      
## 2 sigma   7.82   7.14   8.30   0.95 mode   hdi

Let’s say you wanted their posterior medians and 50% quantile-based intervals, instead. Just switch out the last line for median_qi(value, .width = .5).

4.3.4.1 Overthinking: Sample size and the normality of \(\sigma\)’s posterior.

Since we’ll be fitting models with brms almost exclusively from here on out, this section is largely mute. But we’ll do it anyway for the sake of practice. I’m going to break the steps up like before rather than compress the code together. Here’s d3.

set.seed(4)
(d3 <- sample(d2$height, size = 20))
##  [1] 147.3200 154.9400 168.9100 156.8450 165.7350 151.7650 165.7350 156.2100 144.7800 154.9400
## [11] 151.1300 147.9550 149.8600 162.5600 161.9250 164.4650 160.9852 151.7650 163.8300 149.8600

For our first step using d3, we’ll redefine d_grid.

n <- 200

# note we've redefined the ranges of `mu` and `sigma`
d_grid <-
  crossing(mu    = seq(from = 150, to = 170, length.out = n),
           sigma = seq(from = 4, to = 20, length.out = n))

Second, we’ll redefine our custom grid_function() function to operate over the height values of d3.

grid_function <- function(mu, sigma) {
  
  dnorm(d3, mean = mu, sd = sigma, log = T) %>% 
    sum()
  
}

Now we’ll use the amended grid_function() to make the posterior.

d_grid <-
  d_grid %>% 
  mutate(log_likelihood = map2_dbl(mu, sigma, grid_function)) %>% 
  mutate(prior_mu    = dnorm(mu, mean = 178, sd = 20, log = T),
         prior_sigma = dunif(sigma, min = 0, max = 50, log = T)) %>% 
  mutate(product = log_likelihood + prior_mu + prior_sigma) %>% 
  mutate(probability = exp(product - max(product)))

Did you catch our use of purrr::map2_dbl(), there, in place of purrr::map2()? It turns out that purrr::map() and purrr::map2() always return a list (see here and here). But we can add the _dbl suffix to those functions, which will instruct the purrr package to return a double vector (i.e., a common kind of numeric vector). The advantage of that approach is we no longer need to follow our map() or map2() lines with unnest(). To learn more about the ins and outs of the map() family, check out this section from R4DS or Jenny Bryan’s purrr tutorial.

Next we’ll sample_n() and plot.

set.seed(4)

d_grid_samples <- 
  d_grid %>% 
  sample_n(size = 1e4, replace = T, weight = probability)

d_grid_samples %>% 
  ggplot(aes(x = mu, y = sigma)) + 
  geom_point(size = .9, alpha = 1/15) +
  labs(x = expression(mu[samples]),
       y = expression(sigma[samples])) +
  theme(panel.grid = element_blank())

Behold the updated densities.

d_grid_samples %>% 
  pivot_longer(mu:sigma) %>% 

  ggplot(aes(x = value)) + 
  geom_density(fill = "grey33", size = 0) +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab(NULL) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~ name, scales = "free", labeller = label_parsed)

That labeller = label_parsed bit in the facet_wrap() function is what converted our subplot strip labels into Greek. You can learn more about labeller here. Anyway, our posterior for \(\sigma\) isn’t so Gaussian with that small \(n\).

This is the point in the project where we hop off the grid-approximation train. On the one hand, I think this is a great idea. Most of y’all reading this will never use grid approximation in a real-world applied data analysis. On the other hand, there is some pedagogical utility in practicing with it. It can help you grasp what it is we’re dong when we apply Bayes’ theorem. If you’d like more practice, check out the first several chapters in John Kruschke’s (2015) textbook and the corresponding chapters in my (2020c) ebook translating it into brms and tidyverse.

4.3.5 Finding the posterior distribution with quap brm().

Here we rewrite the statistical model, this time using font color to help differentiate the likelihood from the prior(s).

\[\begin{align*} \color{red}{\text{heights}_i} & \color{red}\sim \color{red}{\operatorname{Normal}(\mu, \sigma)} && \color{red}{\text{likelihood}} \\ \color{blue}\mu & \color{blue}\sim \color{blue}{\operatorname{Normal}(178, 20)} && \color{blue}{\text{prior}} \\ \color{blue}\sigma & \color{blue}\sim \color{blue}{\operatorname{Uniform}(0, 50)} \end{align*}\]

We won’t actually use rethinking::quap(). It’s time to jump straight to the primary brms modeling function, brm(). In the text, McElreath indexed his models with names like m4.1. I will largely follow that convention, but will replace the m with a b to stand for the brms package. Here’s how to fit the first model for this chapter.

b4.1 <- 
  brm(data = d2, 
      family = gaussian,
      height ~ 1,
      prior = c(prior(normal(178, 20), class = Intercept),
                prior(uniform(0, 50), class = sigma)),
      iter = 31000, warmup = 30000, chains = 4, cores = 4,
      seed = 4,
      file = "fits/b04.01")

McElreath’s uniform prior for \(\sigma\) was rough on brms. It took an unusually-large number of warmup iterations before the chains sampled properly. As McElreath covered in Chapter 9, Hamiltonian Monte Carlo (HMC) tends to work better when you default to an exponential or half Cauchy for \(\sigma\). Here’s how to do so with the half Cauchy.1

b4.1_hc <- 
  brm(data = d2, 
      family = gaussian,
      height ~ 1,
      prior = c(prior(normal(178, 20), class = Intercept),
                # the magic lives here
                prior(cauchy(0, 1), class = sigma)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 4,
      file = "fits/b04.01_hc")

This leads to an important point. After running a model fit with HMC, it’s a good idea to inspect the chains. As we’ll see, McElreath covered visual chain diagnostics in Chapter 9. Here’s a typical way to do so with brms.

plot(b4.1_hc)

If you want detailed diagnostics for the HMC chains, call launch_shinystan(b4.1). That’ll keep you busy for a while. But anyway, the chains look good. We can reasonably trust the results. Here’s how to get the model summary of our brm() object.

print(b4.1_hc)
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: height ~ 1 
##    Data: d2 (Number of observations: 352) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept   154.61      0.41   153.81   155.42 1.00     3833     2967
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     7.74      0.29     7.20     8.35 1.00     3511     2853
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

The summary() function works in a similar way. You can also get a Stan-like summary [see the RStan: the R interface to Stan vignette; Stan Development Team (2020)] with a little indexing.

b4.1_hc$fit
## Inference for Stan model: dd78ed3cf429c248161e09fa2f6f34e3.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##                 mean se_mean   sd     2.5%      25%      50%      75%    97.5% n_eff Rhat
## b_Intercept   154.61    0.01 0.41   153.81   154.34   154.61   154.89   155.42  3796    1
## sigma           7.74    0.00 0.29     7.20     7.54     7.73     7.93     8.35  3553    1
## lp__        -1227.52    0.02 1.00 -1230.16 -1227.91 -1227.22 -1226.79 -1226.54  1984    1
## 
## Samples were drawn using NUTS(diag_e) at Sun Mar 14 16:17:25 2021.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Whereas rethinking defaults to 89% intervals, using print() or summary() with brms models defaults to 95% intervals. Unless otherwise specified, I will stick with 95% intervals throughout. However, if you really want those 89% intervals, an easy way is with the prob argument within brms::summary() or brms::print().

summary(b4.1_hc, prob = .89)
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: height ~ 1 
##    Data: d2 (Number of observations: 352) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-89% CI u-89% CI Rhat Bulk_ESS Tail_ESS
## Intercept   154.61      0.41   153.96   155.27 1.00     3833     2967
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-89% CI u-89% CI Rhat Bulk_ESS Tail_ESS
## sigma     7.74      0.29     7.29     8.23 1.00     3511     2853
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Anyways, here’s the brms::brm() code for the model with the very-narrow-\(\mu\)-prior corresponding to the rethinking::quap() code in McElreath’s R code 4.31.

b4.2 <- 
  brm(data = d2, 
      family = gaussian,
      height ~ 1,
      prior = c(prior(normal(178, 0.1), class = Intercept),
                prior(uniform(0, 50), class = sigma)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 4,
      file = "fits/b04.02")
plot(b4.2)

I had to increase the warmup due to convergence issues. After doing so, everything looks to be on the up and up. The chains look great. Here’s the model summary().

summary(b4.2)
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: height ~ 1 
##    Data: d2 (Number of observations: 352) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept   177.86      0.10   177.66   178.06 1.00     3467     2724
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma    24.59      0.91    22.95    26.54 1.01      483      422
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Subsetting the summary() output with $fixed provides a convenient way to compare the Intercept summaries between b4.1_hc and b4.2.

rbind(summary(b4.1_hc)$fixed,
      summary(b4.2)$fixed)
##           Estimate Est.Error l-95% CI u-95% CI     Rhat Bulk_ESS Tail_ESS
## Intercept 154.6133 0.4100253 153.8146 155.4213 1.000225     3833     2967
## Intercept 177.8637 0.1014405 177.6609 178.0614 1.000384     3467     2724

4.3.6 Sampling from a quap() brm() fit.

brms doesn’t seem to have a convenience function that works the way vcov() does for rethinking. For example:

vcov(b4.1_hc)
##           Intercept
## Intercept 0.1681208

This only returns the first element in the matrix it did for rethinking. That is, it appears brms::vcov() only returns the variance/covariance matrix for the single-level \(\beta\) parameters. However, if you really wanted this information, you could get it after putting the HMC chains in a data frame. We do that with the posterior_samples(), which we’ll be using a lot of as we go along.

post <- posterior_samples(b4.1_hc)

head(post)
##   b_Intercept    sigma      lp__
## 1    154.2645 7.992651 -1227.255
## 2    154.6744 7.427472 -1227.075
## 3    154.7472 7.410110 -1227.193
## 4    154.7948 8.470115 -1229.452
## 5    154.2154 7.940177 -1227.219
## 6    154.7131 7.744806 -1226.553

Now select() the columns containing the draws from the desired parameters and feed them into cov().

select(post, b_Intercept:sigma) %>% 
  cov()
##              b_Intercept        sigma
## b_Intercept  0.168120762 -0.005250413
## sigma       -0.005250413  0.086979181

That was “(1) a vector of variances for the parameters and (2) a correlation matrix” for them (p. 90). Here are just the variances (i.e., the diagonal elements) and the correlation matrix.

# variances
select(post, b_Intercept:sigma) %>%
  cov() %>%
  diag()
## b_Intercept       sigma 
##  0.16812076  0.08697918
# correlation
post %>%
select(b_Intercept, sigma) %>%
  cor()
##             b_Intercept       sigma
## b_Intercept  1.00000000 -0.04341854
## sigma       -0.04341854  1.00000000

With our post <- posterior_samples(b4.1_hc) code from a few lines above, we’ve already produced the brms version of what McElreath achieved with extract.samples() on page 90. However, what happened under the hood was different. Whereas rethinking used the mvnorm() function from the MASS package (Ripley, 2019; Venables & Ripley, 2002), with brms we just extracted the iterations of the HMC chains and put them in a data frame.

str(post)
## 'data.frame':    4000 obs. of  3 variables:
##  $ b_Intercept: num  154 155 155 155 154 ...
##  $ sigma      : num  7.99 7.43 7.41 8.47 7.94 ...
##  $ lp__       : num  -1227 -1227 -1227 -1229 -1227 ...

Notice how our data frame, post, includes a third vector named lp__. That’s the log posterior. For details, see the brms reference manual (Bürkner, 2021i), the “The Log-Posterior (function and gradient)” section of the Stan Development Team’s (2020) vignette, RStan: the R interface to Stan, or Stephen Martin’s nice explanation of the log posterior on the Stan Forums. The log posterior will largely be outside of our focus in this ebook.

The summary() function doesn’t work for brms posterior data frames quite the way precis() does for posterior data frames from the rethinking package. Behold the results.

summary(post[, 1:2])
##   b_Intercept        sigma      
##  Min.   :153.0   Min.   :6.900  
##  1st Qu.:154.3   1st Qu.:7.541  
##  Median :154.6   Median :7.733  
##  Mean   :154.6   Mean   :7.744  
##  3rd Qu.:154.9   3rd Qu.:7.934  
##  Max.   :156.2   Max.   :8.950

Here’s one option using the transpose of a quantile() call nested within apply(), which is a very general function you can learn more about here or here.

t(apply(post[, 1:2], 2, quantile, probs = c(.5, .025, .75)))
##                    50%       2.5%        75%
## b_Intercept 154.612809 153.814609 154.887202
## sigma         7.733021   7.199217   7.934425

The base R code is compact, but somewhat opaque. Here’s how to do something similar with more explicit tidyverse code.

post %>%
  pivot_longer(-lp__) %>% 
  group_by(name) %>%
  summarise(mean = mean(value),
            sd   = sd(value),
            `2.5%`  = quantile(value, probs = .025),
            `97.5%` = quantile(value, probs = .975)) %>%
  mutate_if(is.numeric, round, digits = 2)
## # A tibble: 2 x 5
##   name          mean    sd `2.5%` `97.5%`
##   <chr>        <dbl> <dbl>  <dbl>   <dbl>
## 1 b_Intercept 155.   0.41   154.   155.  
## 2 sigma         7.74 0.290    7.2    8.35

You can always get pretty similar information by just putting the brm() fit object into posterior_summary().

posterior_summary(b4.1_hc)
##                 Estimate Est.Error         Q2.5        Q97.5
## b_Intercept   154.613285 0.4100253   153.814609   155.421274
## sigma           7.744364 0.2949223     7.199217     8.346968
## lp__        -1227.517317 0.9973747 -1230.155469 -1226.541248

And if you’re willing to drop the posterior \(SD\)s, you can use tidybayes::mean_hdi(), too.

post %>% 
  pivot_longer(-lp__) %>% 
  group_by(name) %>%
  mean_qi(value)
## # A tibble: 2 x 7
##   name         value .lower .upper .width .point .interval
##   <chr>        <dbl>  <dbl>  <dbl>  <dbl> <chr>  <chr>    
## 1 b_Intercept 155.   154.   155.     0.95 mean   qi       
## 2 sigma         7.74   7.20   8.35   0.95 mean   qi

Though none of these solutions get you those sweet little histograms, you can always make those for your HMC models by inserting the desired posterior draws into histospark().

rbind(histospark(post$b_Intercept),
      histospark(post$sigma))
##      [,1]         
## [1,] "▁▁▅▇▂▁▁"    
## [2,] "▁▁▂▅▇▇▃▁▁▁▁"

Hell, you can even tack those onto the output from our verbose tidyverse code from a few blocks up.

post %>%
  pivot_longer(-lp__) %>% 
  group_by(name) %>%
  summarise(mean = mean(value),
            sd   = sd(value),
            `2.5%`  = quantile(value, probs = .025),
            `97.5%` = quantile(value, probs = .975)) %>%
  mutate_if(is.numeric, round, digits = 2) %>% 
  mutate(histospark = c(histospark(post$b_Intercept), histospark(post$sigma)))
## # A tibble: 2 x 6
##   name          mean    sd `2.5%` `97.5%` histospark 
##   <chr>        <dbl> <dbl>  <dbl>   <dbl> <chr>      
## 1 b_Intercept 155.   0.41   154.   155.   ▁▁▅▇▂▁▁    
## 2 sigma         7.74 0.290    7.2    8.35 ▁▁▂▅▇▇▃▁▁▁▁

4.3.6.1 Overthinking: Start values for quap() brm().

We won’t be emphasizing start values in this ebook. But, yes, you can set start values for the HMC chains from brms, too. Within the brm() function, you do so with the inits argument. From the brm section within the brms reference manual, we read:

Either "random" or "0". If inits is "random" (the default), Stan will randomly generate initial values for parameters. If it is "0", all parameters are initialized to zero. This option is sometimes useful for certain families, as it happens that default ("random") inits cause samples to be essentially constant. Generally, setting inits = "0" is worth a try, if chains do not behave well. Alternatively, inits can be a list of lists containing the initial values, or a function (or function name) generating initial values. The latter options are mainly implemented for internal testing but are available to users if necessary. If specifying initial values using a list or a function then currently the parameter names must correspond to the names used in the generated Stan code (not the names used in R).

4.3.6.2 Overthinking: Under the hood with multivariate sampling.

Again, brms::posterior_samples() is not the same as rethinking::extract.samples(). Rather than use the MASS::mvnorm(), brms takes the draws from the HMC chains. McElreath covered all of this in Chapter 9 and we will too. You might also look at the brms reference manual or GitHub page for details. To get documentation in a hurry, you could also just execute ?posterior_samples.

4.4 Linear prediction

Here’s our scatter plot of weight and height.

ggplot(data = d2, 
       aes(x = weight, y = height)) +
  geom_point(shape = 1, size = 2) +
  theme_bw() +
  theme(panel.grid = element_blank())

There’s obviously a relationship: Knowing a person’s weight helps you predict height.

To make this vague observation into a more precise quantitative model that relates values of weight to plausible values of height, we need some more technology. How do we take our Gaussian model from the previous section and incorporate predictor variables? (p. 92)

4.4.1 The linear model strategy.

The strategy is to make the parameter for the mean of a Gaussian distribution, \(\mu\), into a linear function of the predictor variable and other, new parameters that we invent. This strategy is often simply called the linear model. The linear model strategy instructs the golem to assume that the predictor variable has a constant and additive relationship to the mean of the outcome. The golem then computes the posterior distribution of this constant relationship. (p. 92, emphasis in the original)

Like we did for our first model without a predictor, we’ll use font color to help differentiate between the likelihood and prior(s) of our new univariable model,

\[\begin{align*} \color{red}{\text{height}_i} & \color{red}\sim \color{red}{\operatorname{Normal}(\mu_i, \sigma)} && \color{red}{\text{likelihood}} \\ \color{red}{\mu_i} & \color{red}= \color{red}{\alpha + \beta (\text{weight}_i - \overline{\text{weight}})} && \color{red}{\text{\{the linear model is just a special part of the likelihood\}} } \\ \color{blue}\alpha & \color{blue}\sim \color{blue}{\operatorname{Normal}(178, 20)} && \color{blue}{\text{prior(s)}} \\ \color{blue}\beta & \color{blue}\sim \color{blue}{\operatorname{Normal}(0, 10)} \\ \color{blue}\sigma & \color{blue}\sim \color{blue}{\operatorname{Uniform}(0, 50)}. \end{align*}\]

Do note that \((\text{weight}_i - \overline{\text{weight}})\) part. As we’ll see, it’s often advantageous to mean center our predictors.

4.4.1.1 Probability of the data.

Let’s begin with just the probability of the observed height, the first line of the model. This is nearly identical to before, except now there is a little index \(i\) on the \(\mu\) as well as the [\(\text{height}\)]. You can read [\(\text{height}_i\)] as “each [\(\text{height}\)]” and \(\mu_i\) as “each \(\mu\).” The mean \(\mu\) now depends upon unique values on each row \(i\). So the little \(i\) on \(\mu_i\) indicates that the mean depends upon the row. (p. 93, emphasis in the original)

4.4.1.2 Linear model.

The mean \(\mu\) is no longer a parameter to be estimated. Rather, as seen in the second line of the model, \(\mu_i\) is constructed from other parameters, \(\alpha\) and \(\beta\), and the observed variable [\(\text{weight}\)]. This line is not a stochastic relationship–there is no \(\sim\) in it, but rather an \(=\) in it–because the definition of \(\mu_i\) is deterministic. That is to say that, once we know \(\alpha\) and \(\beta\) and [\(\text{weight}_i\)], we know \(\mu_i\) with certainty.

The value [\(\text{weight}_i\)] is just the weight value on row \(i\). It refers to the same individual as the height value, [\(\text{height}_i\)], on the same row. The parameters \(\alpha\) and \(\beta\) are more mysterious. Where did they come from? We made them up….

You’ll be making up all manner of parameters as your skills improve. (p. 93)

4.4.1.2.1 Rethinking: Nothing special or natural about linear models.

Note that there’s nothing special about the linear model, really. You can choose a different relationship between \(\alpha\) and \(\beta\) and \(\mu\). For example, the following is a perfectly legitimate definition for \(\mu_i\):

\[\mu_i = \alpha \exp(- \beta x_i)\]

This does not define a linear regression, but it does define a regression model. The linear relationship we are using instead is conventional, but nothing requires that you use it. (p. 94)

4.4.1.3 Priors.

The remaining lines in the model define distributions for the unobserved variables. These variables are commonly known as parameters, and their distributions as priors. There are three parameters: \(\alpha\), \(\beta\), and \(\sigma\). You’ve seen priors for \(\alpha\) and \(\sigma\) before, although \(\alpha\) was called \(\mu\) back then.

The prior for \(\beta\) deserves explanation. Why have a Gaussian prior with mean zero? (p. 94)

We’ll simulate to find out. Instead of using a loop to make our data for Figure 4.5, we’ll stay within the tidyverse.

set.seed(2971)
# how many lines would you like?
n_lines <- 100

lines <-
  tibble(n = 1:n_lines,
         a = rnorm(n_lines, mean = 178, sd = 20),
         b = rnorm(n_lines, mean = 0, sd = 10)) %>% 
  expand(nesting(n, a, b), weight = range(d2$weight)) %>% 
  mutate(height = a + b * (weight - mean(d2$weight)))

head(lines)
## # A tibble: 6 x 5
##       n     a      b weight height
##   <int> <dbl>  <dbl>  <dbl>  <dbl>
## 1     1  191. -7.06    31.1  289. 
## 2     1  191. -7.06    63.0   63.5
## 3     2  199.  0.839   31.1  187. 
## 4     2  199.  0.839   63.0  214. 
## 5     3  202.  3.93    31.1  147. 
## 6     3  202.  3.93    63.0  272.

Now we’ll plot the left panel from Figure 4.5.

lines %>% 
  ggplot(aes(x = weight, y = height, group = n)) +
  geom_hline(yintercept = c(0, 272), linetype = 2:1, size = 1/3) +
  geom_line(alpha = 1/10) +
  coord_cartesian(ylim = c(-100, 400)) +
  ggtitle("b ~ dnorm(0, 10)") +
  theme_classic()

The pattern doesn’t look like any human population at all. It essentially says that the relationship between weight and height could be absurdly positive or negative. Before we’ve even seen the data, this is a bad model. Can we do better?

We can do better immediately. (pp. 95–96)

One thing we know from the outset is that the correlation between human height and weight is positive. We might not be sure of the magnitude, but it’s definitely the case that, on average, taller people are heavier people. Within the univariable regression context, this implies that the regression coefficient for weight predicting height will be positive. It might be unclear how large that coefficient will be, but it will certainly be above zero. One way we might encode this information in our data is by using the log-normal distribution for our \(\beta\) prior. Here’s what \(\operatorname{Log-Normal}(0, 1)\) looks like.

set.seed(4)

tibble(b = rlnorm(1e4, mean = 0, sd = 1)) %>% 
  ggplot(aes(x = b)) +
  geom_density(fill = "grey92") +
  coord_cartesian(xlim = c(0, 5)) +
  theme_classic()

If you’re unfamiliar with the log-normal distribution, it is the distribution whose logarithm is normally distributed. For example, here’s what happens when we compare \(\operatorname{Normal}(0, 1)\) with \(\log \big ( \operatorname{Log-Normal}(0, 1) \big)\).

set.seed(4)

tibble(rnorm           = rnorm(1e5, mean = 0, sd = 1),
       `log(rlognorm)` = log(rlnorm(1e5, mean = 0, sd = 1))) %>% 
  pivot_longer(everything()) %>% 

  ggplot(aes(x = value)) +
  geom_density(fill = "grey92") +
  coord_cartesian(xlim = c(-3, 3)) +
  theme_classic() +
  facet_wrap(~ name, nrow = 2)

They are the same within simulation variance. Also, did you notice how we simulated those log-normal data with mean = 0, sd = 1? Those values are what the mean and standard deviation of the output from the rlnorm() function after they are log transformed. The formulas for the actual mean and standard deviation for the log-normal distribution itself are complicated (see here). They are

\[\begin{align*} \text{mean} & = \exp \left (\mu + \frac{\sigma^2}{2} \right) & \text{and} \\ \text{standard deviation} & = \sqrt{[\exp(\sigma ^{2})-1] \; \exp(2\mu +\sigma ^{2})}. \end{align*}\]

If we follow those formulas through, there are the mean and standard deviation for \(\operatorname{Log-Normal}(0, 1)\).

mu    <- 0
sigma <- 1

# mean
exp(mu + (sigma^2) / 2)
## [1] 1.648721
# sd
sqrt((exp(sigma^2) - 1) * exp(2 * mu + sigma^2))
## [1] 2.161197

Let’s confirm with simulated draws from rlnorm().

set.seed(4)

tibble(x = rlnorm(1e7, mean = 0, sd = 1)) %>% 
  summarise(mean = mean(x),
            sd   = sd(x))
## # A tibble: 1 x 2
##    mean    sd
##   <dbl> <dbl>
## 1  1.65  2.17

But okay, “so what [do all these complications] earn us? Do the prior predictive simulation again, now with the Log-Normal prior:” (p. 96).

# make a tibble to annotate the plot
text <-
  tibble(weight = c(34, 43),
         height = c(0 - 25, 272 + 25),
         label  = c("Embryo", "World's tallest person (272 cm)"))

# simulate
set.seed(2971)

tibble(n = 1:n_lines,
       a = rnorm(n_lines, mean = 178, sd = 20),
       b = rlnorm(n_lines, mean = 0, sd = 1)) %>% 
  expand(nesting(n, a, b), weight = range(d2$weight)) %>% 
  mutate(height = a + b * (weight - mean(d2$weight))) %>%
  
  # plot
  ggplot(aes(x = weight, y = height, group = n)) +
  geom_hline(yintercept = c(0, 272), linetype = 2:1, size = 1/3) +
  geom_line(alpha = 1/10) +
  geom_text(data = text,
            aes(label = label),
            size = 3) +
  coord_cartesian(ylim = c(-100, 400)) +
  ggtitle("log(b) ~ dnorm(0, 1)") +
  theme_classic()

“This is much more sensible. There is still a rare impossible relationship. But nearly all lines in the joint prior for \(\alpha\) and \(\beta\) are now within human reason” (p. 96)

4.4.1.3.1 Rethinking: What’s the correct prior?

Good luck with that question. Hang around on academic Twitter long enough and you’ll see folks debating this.

This is a mistake. There is no more a uniquely correct prior than there is a uniquely correct likelihood. Statistical models are machines for inference. Many machines will work, but some work better than others. Priors can be wrong, but only in the same sense that a kind of hammer can be wrong for building a table. (p. 96)

4.4.1.3.2 Rethinking: Prior predictive simulation and \(p\)-hacking.

“We don’t pay any attention to \(p\)-values in this book” (p. 97). Off hand, I’m not sure of the exact origin of the term \(p\)-hacking. But the paper by Simmons, Nelson and Simonsohn (2011), False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant, is often cited as an introduction to the problem.

4.4.2 Finding the posterior distribution.

Unlike with McElreath’s quap() formula syntax, I’m not aware that you can just specify something like weight – xbar in the formula argument in brm(). However, the alternative is easy: Just make a new variable in the data that is equivalent to weight – mean(weight). We’ll call it weight_c.

d2 <-
  d2 %>% 
  mutate(weight_c = weight - mean(weight))

Unlike with McElreath’s rethinking package, the brms::brm() syntax doesn’t mirror the statistical notation. But here are the analogues to the exposition at the bottom of page 97:

  • \(\text{height}_i \sim \operatorname{Normal}(\mu_i, \sigma)\): family = gaussian,
  • \(\mu_i = \alpha + \beta \text{weight}_i\): height ~ 1 + weight_c,
  • \(\alpha \sim \operatorname{Normal}(178, 20)\): prior(normal(178, 20), class = Intercept,
  • \(\beta \sim \operatorname{Log-Normal}(0, 1)\): prior(lognormal(0, 1), class = b), and
  • \(\sigma \sim \operatorname{Uniform}(0, 50)\): prior(uniform(0, 50), class = sigma).

Thus, to add a predictor you just the + operator in the model formula.

b4.3 <- 
  brm(data = d2, 
      family = gaussian,
      height ~ 1 + weight_c,
      prior = c(prior(normal(178, 20), class = Intercept),
                prior(lognormal(0, 1), class = b),
                prior(uniform(0, 50), class = sigma)),
      iter = 28000, warmup = 27000, chains = 4, cores = 4,
      seed = 4,
      file = "fits/b04.03")

This was another example of how using a uniform prior for \(\sigma\) was a major pain. As you’ll see in later chapters, you usually only need to set iter = 2000, warmup = 1000 when using even moderately better priors. But for now, we’ll continue to find ways to stick close to the text. Here are the trace plots.

plot(b4.3)

4.4.2.1 Overthinking: Logs and exps, oh my.

brms does not allow users to insert coefficients into functions like \(\exp()\) within the conventional formula syntax. We can fit a brms model like McElreath’s m4.3b if we adopt what’s called the non-linear syntax (Bürkner, 2021e). The non-linear syntax is a lot like the syntax McElreath uses in rethinking in that it typically includes both predictor and variable names in the formula. Since this is so early in the book and we’re just working through a problem in an Overthinking tangent, I won’t go into a full-blow explanation, here. There will be many more opportunities to practice with the non-linear syntax in the chapters to come (e.g., Section 5.3.2, Section 6.2.1). For now, here’s how we might fit the model.

b4.3b <- 
  brm(data = d2, 
      family = gaussian,
      bf(height ~ a + exp(lb) * weight_c,
         a ~ 1,
         lb ~ 1,
         nl = TRUE),
      prior = c(prior(normal(178, 20), class = b, nlpar = a),
                prior(normal(0, 1), class = b, nlpar = lb),
                prior(uniform(0, 50), class = sigma)),
      iter = 31000, warmup = 30000, chains = 4, cores = 4,
      seed = 4,
      file = "fits/b04.03b")

We continue to need unusually large numbers of warmup iterations due to that pesky \(\operatorname{Uniform}(0, 50)\) prior on \(\sigma\). If you execute summary(b4.3b), you’ll see the intercept and \(\sigma\) summaries for this model are about the same as those for b4.3, above. The difference is for the \(\beta\) parameter, which we called lb in the b4.3b model. If we term that parameter from b4.3 as \(\beta^\text{b4.3}\) and the one from our new model \(\beta^\text{b4.3b}\), it turns out that \(\beta^\text{b4.3} = \exp \left (\beta^\text{b4.3b} \right )\).

fixef(b4.3)["weight_c", "Estimate"]
## [1] 0.9040527
fixef(b4.3b)["lb_Intercept", "Estimate"] %>% exp()
## [1] 0.9036588

They’re the same within simulation variance.

4.4.3 Interpreting the posterior distribution.

“One trouble with statistical models is that they are hard to understand” (p. 99). Welcome to the world of applied statistics, friends. 😅

4.4.3.0.1 Rethinking: What do parameters mean?

A basic issue with interpreting model-based estimates is in knowing the meaning of parameters. There is no consensus about what a parameter means, however, because different people take different philosophical stances towards models, probability, and prediction. The perspective in this book is a common Bayesian perspective: Posterior probabilities of parameter values describe the relative compatibility of different states of the world with the data, according to the model. (p. 99, emphasis in the original)

4.4.3.1 Tables of marginal distributions.

With a little [] subsetting we can exclude the log posterior from our summary for b4.3.

posterior_summary(b4.3)[1:3, ] %>% 
  round(digits = 2)
##             Estimate Est.Error   Q2.5  Q97.5
## b_Intercept    154.6      0.27 154.08 155.14
## b_weight_c       0.9      0.04   0.83   0.99
## sigma            5.1      0.19   4.75   5.50

If we put our brms fit into the vcov() function, we’ll get the variance/covariance matrix of the intercept and weight_c coefficient.

vcov(b4.3) %>% 
  round(3)
##           Intercept weight_c
## Intercept     0.076    0.000
## weight_c      0.000    0.002

No \(\sigma\), however. To get that, we’ll have to extract the posterior draws and use the cov() function, instead.

posterior_samples(b4.3) %>%
  select(-lp__) %>%
  cov() %>%
  round(digits = 3)
##             b_Intercept b_weight_c sigma
## b_Intercept       0.076      0.000 0.000
## b_weight_c        0.000      0.002 0.000
## sigma             0.000      0.000 0.036

The pairs() function will work for a brms fit much like it would one from rethinking. It will show “both the marginal posteriors and the covariance” (p. 100).

pairs(b4.3)

4.4.3.2 Plotting posterior inference against the data.

“It’s almost always much more useful to plot the posterior inference against the data. Not only does plotting help in interpreting the posterior, but it also provides an informal check on model assumptions” (p. 100).

Here is the code for Figure 4.6. Note our use of the fixef() function.

d2 %>%
  ggplot(aes(x = weight_c, y = height)) +
  geom_abline(intercept = fixef(b4.3)[1], 
              slope     = fixef(b4.3)[2]) +
  geom_point(shape = 1, size = 2, color = "royalblue") +
  theme_classic()

Note how the breaks on our \(x\)-axis look off. That’s because we fit the model with weight_c and we plotted the points in that metric, too. Since we computed weight_c by subtracting the mean of weight from the data, we can adjust the \(x\)-axis break point labels by simply adding that value back.

labels <-
  c(-10, 0, 10) + mean(d2$weight) %>% 
  round(digits = 0)

d2 %>%
  ggplot(aes(x = weight_c, y = height)) +
  geom_abline(intercept = fixef(b4.3)[1], 
              slope     = fixef(b4.3)[2]) +
  geom_point(shape = 1, size = 2, color = "royalblue") +
  scale_x_continuous("weight",
                     breaks = c(-10, 0, 10),
                     labels = labels) +
  theme_bw() +
  theme(panel.grid = element_blank())

4.4.3.3 Adding uncertainty around the mean.

Be default, we extract all the posterior iterations with posterior_samples().

post <- posterior_samples(b4.3)

post %>%
  slice(1:5)  # this serves a similar function as `head()`
##   b_Intercept b_weight_c    sigma      lp__
## 1    154.0395  0.8883175 5.019815 -1081.040
## 2    155.0064  0.8813894 5.149540 -1080.008
## 3    154.2648  0.8641695 5.200915 -1080.066
## 4    154.4480  0.9149853 4.956529 -1079.143
## 5    154.7072  0.9176580 5.190356 -1079.014

Here are the four models leading up to McElreath’s Figure 4.7.

N <- 10

b4.3_010 <- 
  brm(data = d2 %>%
        slice(1:N),  # note our tricky use of `N` and `slice()`
      family = gaussian,
      height ~ 1 + weight_c,
      prior = c(prior(normal(178, 20), class = Intercept),
                prior(lognormal(0, 1), class = b),
                prior(uniform(0, 50), class = sigma)),
      iter = 11000, warmup = 10000, chains = 4, cores = 4,
      seed = 4,
      file = "fits/b04.03_010")

N <- 50

b4.3_050 <- 
  brm(data = d2 %>%
        slice(1:N), 
      family = gaussian,
      height ~ 1 + weight_c,
      prior = c(prior(normal(178, 20), class = Intercept),
                prior(lognormal(0, 1), class = b),
                prior(uniform(0, 50), class = sigma)),
      iter = 11000, warmup = 10000, chains = 4, cores = 4,
      seed = 4,
      file = "fits/b04.03_050")

N <- 150

b4.3_150 <- 
  brm(data = d2 %>%
        slice(1:N), 
      family = gaussian,
      height ~ 1 + weight_c,
      prior = c(prior(normal(178, 20), class = Intercept),
                prior(lognormal(0, 1), class = b),
                prior(uniform(0, 50), class = sigma)),
      iter = 21000, warmup = 20000, chains = 4, cores = 4,
      seed = 4,
      file = "fits/b04.03_150")

N <- 352

b4.3_352 <- 
  brm(data = d2 %>%
        slice(1:N), 
      family = gaussian,
      height ~ 1 + weight_c,
      prior = c(prior(normal(178, 20), class = Intercept),
                prior(lognormal(0, 1), class = b),
                prior(uniform(0, 50), class = sigma)),
      iter = 28000, warmup = 27000, chains = 4, cores = 4,
      seed = 4,
      file = "fits/b04.03_352")

I’m not going to clutter up the document with all the trace plots and coefficient summaries from these four models. But here’s how to get that information.

plot(b4.3_010)
print(b4.3_010)

plot(b4.3_050)
print(b4.3_050)

plot(b4.3_150)
print(b4.3_150)

plot(b4.3_352)
print(b4.3_352)

We’ll need to put the chains of each model into data frames.

post010 <- posterior_samples(b4.3_010)
post050 <- posterior_samples(b4.3_050)
post150 <- posterior_samples(b4.3_150)
post352 <- posterior_samples(b4.3_352)

Here is the code for the four individual plots.

p1 <- 
  ggplot(data =  d2[1:10, ], 
         aes(x = weight_c, y = height)) +
  geom_abline(intercept = post010[1:20, 1], 
              slope     = post010[1:20, 2],
              size = 1/3, alpha = .3) +
  geom_point(shape = 1, size = 2, color = "royalblue") +
  coord_cartesian(xlim = range(d2$weight_c),
                  ylim = range(d2$height)) +
  labs(subtitle = "N = 10")

p2 <-
  ggplot(data =  d2[1:50, ], 
         aes(x = weight_c, y = height)) +
  geom_abline(intercept = post050[1:20, 1], 
              slope     = post050[1:20, 2],
              size = 1/3, alpha = .3) +
  geom_point(shape = 1, size = 2, color = "royalblue") +
  coord_cartesian(xlim = range(d2$weight_c),
                  ylim = range(d2$height)) +
  labs(subtitle = "N = 50")

p3 <-
  ggplot(data =  d2[1:150, ], 
         aes(x = weight_c, y = height)) +
  geom_abline(intercept = post150[1:20, 1], 
              slope     = post150[1:20, 2],
              size = 1/3, alpha = .3) +
  geom_point(shape = 1, size = 2, color = "royalblue") +
  coord_cartesian(xlim = range(d2$weight_c),
                  ylim = range(d2$height)) +
  labs(subtitle = "N = 150")

p4 <- 
  ggplot(data =  d2[1:352, ], 
         aes(x = weight_c, y = height)) +
  geom_abline(intercept = post352[1:20, 1], 
              slope     = post352[1:20, 2],
              size = 1/3, alpha = .3) +
  geom_point(shape = 1, size = 2, color = "royalblue") +
  coord_cartesian(xlim = range(d2$weight_c),
                  ylim = range(d2$height)) +
  labs(subtitle = "N = 352")

Note how we used the good old bracket syntax (e.g., d2[1:10 , ]) to index rows from our d2 data. With tidyverse-style syntax, we could have done slice(d2, 1:10) or d2 %>% slice(1:10) instead.

Now we can combine the ggplots with patchwork syntax to make the full version of Figure 4.7.

(p1 + p2 + p3 + p4) &
  scale_x_continuous("weight",
                     breaks = c(-10, 0, 10),
                     labels = labels) &
  theme_classic()

“Notice that the cloud of regression lines grows more compact as the sample size increases. This is a result of the model growing more confident about the location of the mean” (p. 102).

4.4.3.4 Plotting regression intervals and contours.

Since we used weight_c to fit our model, we might first want to understand what exactly the mean value is for weight.

mean(d2$weight)
## [1] 44.99049

Just a hair under 45. If we’re interested in \(\mu\) at weight = 50, that implies we’re also interested in \(\mu\) at weight_c + 5.01.

mu_at_50 <- 
  post %>% 
  transmute(mu_at_50 = b_Intercept + b_weight_c + 5.01)
 
head(mu_at_50)
##   mu_at_50
## 1 159.9379
## 2 160.8978
## 3 160.1390
## 4 160.3730
## 5 160.6349
## 6 160.3570

And here is a version McElreath’s Figure 4.8 density plot.

mu_at_50 %>%
  ggplot(aes(x = mu_at_50)) +
  geom_density(size = 0, fill = "royalblue") +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab(expression(mu["height | weight = 50"])) +
  theme_classic()

We’ll use mean_hdi() to get both 89% and 95% HPDIs along with the mean.

mean_hdi(mu_at_50[, 1], .width = c(.89, .95))
##          y     ymin     ymax .width .point .interval
## 1 160.5127 160.0827 160.9675   0.89   mean       hdi
## 2 160.5127 159.9828 161.0691   0.95   mean       hdi

If you wanted to express those sweet 95% HPDIs on your density plot, you might use tidybayes::stat_halfeye(). Since stat_halfeye() also returns a point estimate, we’ll throw in the mode.

mu_at_50 %>%
  ggplot(aes(x = mu_at_50, y = 0)) +
  stat_halfeye(point_interval = mode_hdi, .width = .95,
               fill = "royalblue") +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab(expression(mu["height | weight = 50"])) +
  theme_classic()

With brms, you would use fitted() to do what McElreath accomplished with link().

mu <- fitted(b4.3, summary = F)

str(mu)
##  num [1:4000, 1:352] 157 158 157 157 157 ...

When you specify summary = F, fitted() returns a matrix of values with as many rows as there were post-warmup iterations across your HMC chains and as many columns as there were cases in your analysis. Because we had 4,000 post-warmup iterations and \(n = 352\), fitted() returned a matrix of 4,000 rows and 352 vectors. If you omitted the summary = F argument, the default is TRUE and fitted() will return summary information instead.

Much like rethinking’s link(), brms::fitted() can accommodate custom predictor values with its newdata argument.

weight_seq <- 
  tibble(weight = 25:70) %>% 
  mutate(weight_c = weight - mean(d2$weight))

mu <-
  fitted(b4.3,
         summary = F,
         newdata = weight_seq) %>%
  data.frame() %>%
  # here we name the columns after the `weight` values from which they were computed
  set_names(25:70) %>% 
  mutate(iter = 1:n())

Anticipating ggplot2, we went ahead and converted the output to a data frame. But we might do a little more data processing with the aid of tidyr::pivot_longer(), which will convert the data from the wide format to the long format. If you are new to the distinction between wide and long data, you can learn more from the Pivot data from wide to long vignette from the tidyverse team (2020); Simon Ejdemyr’s blog post, Wide & long data; or Karen Grace-Martin’s blog post, The wide and long data format for repeated measures data.

mu <- 
  mu %>%
  pivot_longer(-iter,
               names_to = "weight",
               values_to = "height") %>% 
  # we might reformat `weight` to numerals
  mutate(weight = as.numeric(weight))

head(mu)
## # A tibble: 6 x 3
##    iter weight height
##   <int>  <dbl>  <dbl>
## 1     1     25   136.
## 2     1     26   137.
## 3     1     27   138.
## 4     1     28   139.
## 5     1     29   140.
## 6     1     30   141.

Now our data processing is done, here we reproduce McElreath’s Figure 4.9.a.

d2 %>%
  ggplot(aes(x = weight, y = height)) +
  geom_point(data = mu %>% filter(iter < 101), 
             color = "navyblue", alpha = .05) +
  coord_cartesian(xlim = c(30, 65)) +
  theme(panel.grid = element_blank())

With fitted(), it’s quite easy to plot a regression line and its intervals. Just omit the summary = T argument.

mu_summary <-
  fitted(b4.3, 
         newdata = weight_seq) %>%
  data.frame() %>%
  bind_cols(weight_seq)

head(mu_summary)
##   Estimate Est.Error     Q2.5    Q97.5 weight  weight_c
## 1 136.5262 0.8833715 134.7405 138.1822     25 -19.99049
## 2 137.4303 0.8437461 135.7240 139.0135     26 -18.99049
## 3 138.3343 0.8043407 136.7118 139.8447     27 -17.99049
## 4 139.2384 0.7651894 137.7003 140.6785     28 -16.99049
## 5 140.1424 0.7263331 138.6846 141.5160     29 -15.99049
## 6 141.0465 0.6878220 139.6690 142.3502     30 -14.99049

Here it is, our analogue to Figure 4.9.b.

d2 %>%
  ggplot(aes(x = weight, y = height)) +
  geom_smooth(data = mu_summary,
              aes(y = Estimate, ymin = Q2.5, ymax = Q97.5),
              stat = "identity",
              fill = "grey70", color = "black", alpha = 1, size = 1/2) +
  geom_point(color = "navyblue", shape = 1, size = 1.5, alpha = 2/3) +
  coord_cartesian(xlim = range(d2$weight)) +
  theme(text = element_text(family = "Times"),
        panel.grid = element_blank())

if you wanted to use intervals other than the default 95% ones, you’d enter a probs argument like this: fitted(b4.3, newdata = weight.seq, probs = c(.25, .75)). The resulting third and fourth vectors from the fitted() object would be named Q25 and Q75 instead of the default Q2.5 and Q97.5. The Q prefix stands for quantile.

4.4.3.4.1 Rethinking: Overconfident intervals.

The compatibility interval for the regression line in Figure 4.9 clings tightly to the MAP line. Thus there is very little uncertainty about the average height as a function of average weight. But you have to keep in mind that these inferences are always conditional on the model. Even a very bad model can have very tight compatibility intervals. It may help if you think of the regression line in Figure 4.9 as saying: Conditional on the assumption that height and weight are related by a straight line, then this is the most plausible line, and these are its plausible bounds. (p. 107, emphasis in the original)

4.4.3.5 Prediction intervals.

Even though our statistical model (omitting priors for the sake of simplicity) is

\[\text{height}_i \sim \operatorname{Normal}(\mu_i = \alpha + \beta x_, \sigma),\]

we’ve only been plotting the \(\mu\) part. In order to bring in the variability expressed by \(\sigma\), we’ll have to switch to the predict() function. Much as brms::fitted() was our analogue to rethinking::link(), brms::predict() is our analogue to rethinking::sim().

We can reuse our weight_seq data from before. But in case you forgot, here’s that code again.

weight_seq <- 
  tibble(weight = 25:70) %>% 
  mutate(weight_c = weight - mean(d2$weight))

The predict() code looks a lot like what we used for fitted().

pred_height <-
  predict(b4.3,
          newdata = weight_seq) %>%
  data.frame() %>%
  bind_cols(weight_seq)
  
pred_height %>%
  slice(1:6)
##   Estimate Est.Error     Q2.5    Q97.5 weight  weight_c
## 1 136.5890  5.204946 126.6381 146.9790     25 -19.99049
## 2 137.2489  5.166086 126.9521 147.3029     26 -18.99049
## 3 138.3838  5.246919 128.1976 148.8397     27 -17.99049
## 4 139.2259  5.143560 129.2588 149.1741     28 -16.99049
## 5 139.9700  5.170863 129.7009 149.9936     29 -15.99049
## 6 141.0105  5.101771 130.6494 151.2377     30 -14.99049

This time the summary information in our data frame is for, as McElreath put it, “simulated heights, not distributions of plausible average height, \(\mu\)” (p. 108). Another way of saying that is that these simulations are the joint consequence of both \(\mu\) and \(\sigma\), unlike the results of fitted(), which only reflect \(\mu\). Figure 4.10 shows how you might visualize them.

d2 %>%
  ggplot(aes(x = weight)) +
  geom_ribbon(data = pred_height, 
              aes(ymin = Q2.5, ymax = Q97.5),
              fill = "grey83") +
  geom_smooth(data = mu_summary,
              aes(y = Estimate, ymin = Q2.5, ymax = Q97.5),
              stat = "identity",
              fill = "grey70", color = "black", alpha = 1, size = 1/2) +
  geom_point(aes(y = height),
             color = "navyblue", shape = 1, size = 1.5, alpha = 2/3) +
  coord_cartesian(xlim = range(d2$weight),
                  ylim = range(d2$height)) +
  theme(text = element_text(family = "Times"),
        panel.grid = element_blank())

Notice that the outline for the wide shaded interval is a little rough. This is the simulation variance in the tails of the sampled Gaussian values. If it really bothers you, increase the number of samples you take from the posterior distribution. (p. 109)

With our brms model fitting approach, that would mean we’d have to refit b4.3 after specifying a larger number of post-warmup iterations with alterations to the iter and warmup parameters.

4.4.3.5.1 Overthinking: Rolling your own sim predict().

Here we follow McElreath’s example and do our model-based predictions by hand. Instead of relying on base R apply() and sapply(), here the main action is in expand(), the second mutate() line and the group_by() + summarise() combination.

# `predict()` by hand
post %>% 
  expand(nesting(b_Intercept, b_weight_c, sigma),
         weight = 25:70) %>% 
  mutate(weight_c = weight - mean(d2$weight)) %>% 
  mutate(sim_height = rnorm(n(),
                            mean = b_Intercept + b_weight_c * weight_c,
                            sd   = sigma)) %>% 
  group_by(weight) %>% 
  summarise(mean = mean(sim_height),
            ll   = quantile(sim_height, prob = .025),
            ul   = quantile(sim_height, prob = .975)) %>% 
  
  # plot
  ggplot(aes(x = weight)) +
  geom_smooth(aes(y = mean, ymin = ll, ymax = ul),
              stat = "identity",
              fill = "grey83", color = "black", alpha = 1, size = 1/2) +
  geom_point(data = d2,
             aes(y = height),
             color = "navyblue", shape = 1, size = 1.5, alpha = 2/3) +
  coord_cartesian(xlim = range(d2$weight),
                  ylim = range(d2$height)) +
  theme(text = element_text(family = "Times"),
        panel.grid = element_blank())

We specifically left out the fitted() intervals to make it more apparent what we were simulating. You might also note that we could have easily replaced that three-line summarise() code with a single line of tidybayes::mean_qi(sim_height), or whatever combination of central tendency and interval type you wanted (e.g., mode_hdi(sim_height, .width = .89)).

4.5 Curves from lines

“The models so far all assume that a straight line describes the relationship. But there’s nothing special about straight lines, aside from their simplicity.” (p. 110).

4.5.1 Polynomial regression.

Remember d?

d %>%
  glimpse()
## Rows: 544
## Columns: 4
## $ height <dbl> 151.7650, 139.7000, 136.5250, 156.8450, 145.4150, 163.8300, 149.2250, 168.9100, 147…
## $ weight <dbl> 47.82561, 36.48581, 31.86484, 53.04191, 41.27687, 62.99259, 38.24348, 55.47997, 34.…
## $ age    <dbl> 63.0, 63.0, 65.0, 41.0, 51.0, 35.0, 32.0, 27.0, 19.0, 54.0, 47.0, 66.0, 73.0, 20.0,…
## $ male   <int> 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0,…

McElreath suggested we plot height against weight using the full sample.

d %>% 
  ggplot(aes(x = weight, y = height)) +
  geom_point(color = "navyblue", shape = 1, size = 1.5, alpha = 2/3) +
  annotate(geom = "text",
           x = 42, y = 115,
           label = "This relation is\nvisibly curved.",
           family = "Times") +
  theme(text = element_text(family = "Times"),
        panel.grid = element_blank())

Those variables two appear to follow an orderly relation, but whatever it is, it’s clearly not a simple straight line. The quadratic model is probably the most commonly used polynomial regression model. It follows the generic form

\[\mu = \alpha + \beta_1 x_i + \color{navy}{\beta_2 x_i^2}.\]

McElreath warned: “Fitting these models to data is easy. Interpreting them can be hard” (p. 111). Standardizing will help brm() fit the model. We might standardize our weight variable like so.

d <-
  d %>%
  mutate(weight_s = (weight - mean(weight)) / sd(weight)) %>% 
  mutate(weight_s2 = weight_s^2)

While we were at it, we just went ahead and computed the weight_s2 variable. We can express our statistical model as

\[\begin{align*} \text{height}_i & \sim \operatorname{Normal}(\mu_i, \sigma) \\ \mu_i & = \alpha + \beta_1 \text{weight_s}_i + \color{navy}{\beta_2 \text{weight_s}^2_i} \\ \alpha & \sim \operatorname{Normal}(178, 20) \\ \beta_1 & \sim \operatorname{Log-Normal}(0, 1) \\ \color{navy}{\beta_2} & \color{navy}\sim \color{navy}{\operatorname{Normal}(0, 1)} \\ \sigma & \sim \operatorname{Uniform}(0, 50). \end{align*}\]

Here’s how we might fit the quadratic model with brms.

b4.5 <- 
  brm(data = d, 
      family = gaussian,
      height ~ 1 + weight_s + weight_s2,
      prior = c(prior(normal(178, 20), class = Intercept),
                prior(lognormal(0, 1), class = b, coef = "weight_s"),
                prior(normal(0, 1), class = b, coef = "weight_s2"),
                prior(uniform(0, 50), class = sigma)),
      iter = 30000, warmup = 29000, chains = 4, cores = 4,
      seed = 4,
      file = "fits/b04.05")

Note our use of the coef argument within our prior statements. Since \(\beta_1\) and \(\beta_2\) are both parameters of class = b within the brms set-up, we need to use the coef argument in cases when we want their priors to differ.

plot(b4.5)

print(b4.5)
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: height ~ 1 + weight_s + weight_s2 
##    Data: d (Number of observations: 544) 
## Samples: 4 chains, each with iter = 30000; warmup = 29000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept   146.03      0.38   145.29   146.78 1.01     1369     2019
## weight_s     21.75      0.29    21.17    22.33 1.00     1773     2031
## weight_s2    -7.77      0.28    -8.34    -7.25 1.01      901      935
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     5.81      0.18     5.48     6.18 1.00     2472     2150
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Our quadratic plot requires new fitted()- and predict()-oriented wrangling.

weight_seq <- 
  tibble(weight_s = seq(from = -2.5, to = 2.5, length.out = 30)) %>% 
  mutate(weight_s2 = weight_s^2)

fitd_quad <-
  fitted(b4.5, 
         newdata = weight_seq) %>%
  data.frame() %>%
  bind_cols(weight_seq)

pred_quad <-
  predict(b4.5, 
          newdata = weight_seq) %>%
  data.frame() %>%
  bind_cols(weight_seq)  

Behold the code for our version of Figure 4.11.b.

p2 <-
  ggplot(data = d, 
       aes(x = weight_s)) +
  geom_ribbon(data = pred_quad, 
              aes(ymin = Q2.5, ymax = Q97.5),
              fill = "grey83") +
  geom_smooth(data = fitd_quad,
              aes(y = Estimate, ymin = Q2.5, ymax = Q97.5),
              stat = "identity",
              fill = "grey70", color = "black", alpha = 1, size = 1/2) +
  geom_point(aes(y = height),
             color = "navyblue", shape = 1, size = 1.5, alpha = 1/3) +
  labs(subtitle = "quadratic",
       y = "height") +
  coord_cartesian(xlim = range(d$weight_s),
                  ylim = range(d$height)) +
  theme(text = element_text(family = "Times"),
        panel.grid = element_blank())

p2

From a formula perspective, the cubic model is a simple extension of the quadratic:

\[\mu = \alpha + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3.\]

Before we fit the model, we need to wrangle the data again.

d <-
  d %>% 
  mutate(weight_s3 = weight_s^3)

Now fit the model like so.

b4.6 <- 
  brm(data = d, 
      family = gaussian,
      height ~ 1 + weight_s + weight_s2 + weight_s3,
      prior = c(prior(normal(178, 20), class = Intercept),
                prior(lognormal(0, 1), class = b, coef = "weight_s"),
                prior(normal(0, 1), class = b, coef = "weight_s2"),
                prior(normal(0, 1), class = b, coef = "weight_s3"),
                prior(uniform(0, 50), class = sigma)),
      iter = 40000, warmup = 39000, chains = 4, cores = 4,
      seed = 4,
      file = "fits/b04.06")

And now we’ll fit the good old linear model.

b4.7 <- 
  brm(data = d, 
      family = gaussian,
      height ~ 1 + weight_s,
      prior = c(prior(normal(178, 20), class = Intercept),
                prior(lognormal(0, 1), class = b, coef = "weight_s"),
                prior(uniform(0, 50), class = sigma)),
      iter = 40000, warmup = 39000, chains = 4, cores = 4,
      seed = 4,
      file = "fits/b04.07")

Here’s the fitted(), predict(), and ggplot2 code for Figure 4.11.c, the cubic model.

weight_seq <- 
  weight_seq %>% 
  mutate(weight_s3 = weight_s^3)

fitd_cub <-
  fitted(b4.6, 
         newdata = weight_seq) %>%
  as_tibble() %>%
  bind_cols(weight_seq)

pred_cub <-
  predict(b4.6, 
          newdata = weight_seq) %>%
  as_tibble() %>%
  bind_cols(weight_seq) 

p3 <-
  ggplot(data = d, 
       aes(x = weight_s)) +
  geom_ribbon(data = pred_cub, 
              aes(ymin = Q2.5, ymax = Q97.5),
              fill = "grey83") +
  geom_smooth(data = fitd_cub,
              aes(y = Estimate, ymin = Q2.5, ymax = Q97.5),
              stat = "identity",
              fill = "grey70", color = "black", alpha = 1, size = 1/4) +
  geom_point(aes(y = height),
             color = "navyblue", shape = 1, size = 1.5, alpha = 1/3) +
  labs(subtitle = "cubic",
       y = "height") +
  coord_cartesian(xlim = range(d$weight_s),
                  ylim = range(d$height)) +
  theme(text = element_text(family = "Times"),
        panel.grid = element_blank())

p3

And here’s the fitted(), predict(), and ggplot2 code for Figure 4.11.a, the linear model.

fitd_line <-
  fitted(b4.7, 
         newdata = weight_seq) %>%
  as_tibble() %>%
  bind_cols(weight_seq)

pred_line <-
  predict(b4.7, 
          newdata = weight_seq) %>%
  as_tibble() %>%
  bind_cols(weight_seq) 

p1 <-
  ggplot(data = d, 
       aes(x = weight_s)) +
  geom_ribbon(data = pred_line, 
              aes(ymin = Q2.5, ymax = Q97.5),
              fill = "grey83") +
  geom_smooth(data = fitd_line,
              aes(y = Estimate, ymin = Q2.5, ymax = Q97.5),
              stat = "identity",
              fill = "grey70", color = "black", alpha = 1, size = 1/4) +
  geom_point(aes(y = height),
             color = "navyblue", shape = 1, size = 1.5, alpha = 1/3) +
  labs(subtitle = "linear",
       y = "height") +
  coord_cartesian(xlim = range(d$weight_s),
                  ylim = range(d$height)) +
  theme(text = element_text(family = "Times"),
        panel.grid = element_blank())

p1

Did you notice how we labeled each of the past three plots as p1, p2, and p3? Here we use those names to plot them all together with patchwork syntax.

p1 | p2 | p3

As fun as this all has been,

it’s not clear that any of these models make a lot of sense. They are good geocentric descriptions of the sample, yes. But there are two problems. First, a better fit to the sample might not actually be a better model. That’s the subject of Chapter 7. Second, the model contains no biological information. We aren’t learning any causal relationship between height and weight. We’ll deal with this second problem much later, in Chapter 16. (p. 113)

4.5.1.0.1 Overthinking: Converting back to natural scale.

You can apply McElreath’s conversion trick within the ggplot2 environment, too. Here it is with the cubic model.

at <- c(-2, -1, 0, 1, 2)

ggplot(data = d, 
       aes(x = weight_s)) +
  geom_ribbon(data = pred_cub, 
              aes(ymin = Q2.5, ymax = Q97.5),
              fill = "grey83") +
  geom_point(aes(y = height),
             color = "navyblue", shape = 1, size = 1.5, alpha = 1/3) +
  coord_cartesian(xlim = range(d$weight_s)) +
  theme(text = element_text(family = "Times"),
        panel.grid = element_blank()) +
  
  # here it is!
  scale_x_continuous("standardized weight converted back",
                     breaks = at,
                     labels = round(at*sd(d$weight) + mean(d$weight), 1))

4.5.2 Splines.

Load the cherry_blossoms data (Aono, 2012; Aono & Kazui, 2008; Aono & Saito, 2010).

library(rethinking)

data(cherry_blossoms)
d <- cherry_blossoms
rm(cherry_blossoms)
detach(package:rethinking, unload = T)

Minus the mini histograms, here is our ground-up tidyverse way to summarize our new d data the way McElreath did with his precis().

d %>% 
  gather() %>% 
  group_by(key) %>% 
  summarise(mean = mean(value, na.rm = T),
            sd   = sd(value, na.rm = T),
            ll   = quantile(value, prob = .055, na.rm = T),
            ul   = quantile(value, prob = .945, na.rm = T)) %>% 
  mutate_if(is.double, round, digits = 2)
## # A tibble: 5 x 5
##   key           mean     sd     ll      ul
##   <chr>        <dbl>  <dbl>  <dbl>   <dbl>
## 1 doy         105.     6.41  94.4   115   
## 2 temp          6.14   0.66   5.15    7.29
## 3 temp_lower    5.1    0.85   3.79    6.37
## 4 temp_upper    7.19   0.99   5.9     8.9 
## 5 year       1408    351.   868.   1948.

McElreath encouraged us to plot doy against year.

d %>% 
  ggplot(aes(x = year, y = doy)) +
  # color from here: https://www.colorhexa.com/ffb7c5
  geom_point(color = "#ffb7c5", alpha = 1/2) +
  theme_bw() +
  theme(panel.grid = element_blank(),
        # color from here: https://www.colordic.org/w/, inspired by https://chichacha.netlify.com/2018/11/29/plotting-traditional-colours-of-japan/
        panel.background = element_rect(fill = "#4f455c"))

It looks like there are some wiggly trends, but it’s hard to tell with a scatter plot.

Our goal is to approximate the blossom trend with a wiggly function. With B-splines, just like with polynomial regression, we do this by generating new predictor variables and using those in the linear model, \(\mu_i\). Unlike polynomial regression, B-splines do not directly transform the predictor by squaring or cubing it. Instead they invent a series of entirely new, synthetic predictor variables. Each of these synthetic variables exists only to gradually turn a specific parameter on and off within a specific range of the real predictor variable. Each of the synthetic variables is called a basis function. The linear model ends up looking very familiar:

\[\mu_i = \alpha + w_1 B_{i, 1} + w_2 B_{i, 2} + w_3 B_{i, 3} + \dots\]

where \(B_{i,n}\) is the \(n\)-th basis function’s value on row \(i\), and the \(w\) parameters are corresponding weights for each. The parameters act like slopes, adjusting the influence of each basis function on the mean \(\mu_i\). So really this is just another linear regression, but with some fancy, synthetic predictor variables. (p. 115, emphasis in the original)

It turns out there are cases with missing data for the doy variable.

d %>% 
  count(is.na(doy)) %>% 
  mutate(percent = 100 * n / sum(n))
##   is.na(doy)   n  percent
## 1      FALSE 827 68.06584
## 2       TRUE 388 31.93416

Let’s follow McElreath and make a subset of the data that excludes cases with missing data in doy. Within the tidyverse, we might do so with the tidyr::drop_na() function.

d2 <-
  d %>% 
  drop_na(doy)

On page 117 in the text, McElreath indirectly explained how to make Figure 4.12 by walking through the workflow for making Figure 4.13. Here we mimic that ordering.

First, we choose the knots. Remember, the knots are just values of year that serve as pivots for our spline. Where should the knots go? There are different ways to answer this question. You can, in principle, put the knots wherever you like. Their locations are part of the model, and you are responsible for them. Let’s do what we did in the simple example above, place the knots at different evenlyspaced quantiles of the predictor variable. This gives you more knots where there are more observations. We used only 5 knots in the first example. Now let’s go for 15:

num_knots <- 15
knot_list <- quantile(d2$year, probs = seq(from = 0, to = 1, length.out = num_knots))

Our knot_list contains 15 year values.

knot_list
##        0% 7.142857% 14.28571% 21.42857% 28.57143% 35.71429% 42.85714%       50% 57.14286% 64.28571% 
##       812      1036      1174      1269      1377      1454      1518      1583      1650      1714 
## 71.42857% 78.57143% 85.71429% 92.85714%      100% 
##      1774      1833      1893      1956      2015

Here’s what it looks like if we use those knot_list values to chop up our year/doy scatter plot, from above.

d %>% 
  ggplot(aes(x = year, y = doy)) +
  geom_vline(xintercept = knot_list, color = "white", alpha = 1/2) +
  geom_point(color = "#ffb7c5", alpha = 1/2) +
  theme_bw() +
  theme(panel.background = element_rect(fill = "#4f455c"),
        panel.grid = element_blank())

The next choice is polynomial degree. This determines how basis functions combine, which determines how the parameters interact to produce the spline. For degree 1, as in Figure 4.12, two basis functions combine at each point. For degree 2, three functions combine at each point. For degree 3, four combine. R already has a nice function that will build basis functions for any list of knots and degree. This code will construct the necessary basis functions for a degree 3 (cubic) spline: (p. 117)

library(splines)

B <- bs(d2$year,
        knots = knot_list[-c(1, num_knots)], 
        degree = 3, 
        intercept = TRUE)

Look closely at McElreath’s tricky knot_list[-c(1, num_knots)] code. Whereas knot_list contains 15 ordered year values, McElreath shaved off the first and last year values with knot_list[-c(1, num_knots)], leaving 13. This is because, by default, the bs() function places knots at the boundaries. Since the first and 15th values in knot_list were boundary values for year, we removed them to avoid redundancies. We can confirm this with the code, below.

B %>% str()
##  'bs' num [1:827, 1:17] 1 0.96 0.767 0.563 0.545 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:17] "1" "2" "3" "4" ...
##  - attr(*, "degree")= int 3
##  - attr(*, "knots")= Named num [1:13] 1036 1174 1269 1377 1454 ...
##   ..- attr(*, "names")= chr [1:13] "7.142857%" "14.28571%" "21.42857%" "28.57143%" ...
##  - attr(*, "Boundary.knots")= int [1:2] 812 2015
##  - attr(*, "intercept")= logi TRUE

Look at the second to last line, - attr(*, "Boundary.knots")= int [1:2] 812 2015. Those default "Boundary.knots" are the same as knot_list[c(1, num_knots)]. Let’s confirm.

knot_list[c(1, num_knots)]
##   0% 100% 
##  812 2015

By the degree = 3 argument, we indicated we wanted a cubic spline. McElreath used degree = 1 for Figure 4.12. For reasons I’m not prepared to get into, here, splines don’t always include intercept parameters. Indeed, the bs() default is intercept = FALSE. McElreath’s code indicated he wanted to fit a B-spline that included an intercept. Thus: intercept = TRUE.

Here’s how we might make our version of the top panel of Figure 4.13.

# wrangle a bit
b <-
  B %>% 
  data.frame() %>% 
  set_names(str_c(0, 1:9), 10:17) %>%  
  bind_cols(select(d2, year)) %>% 
  pivot_longer(-year,
               names_to = "bias_function",
               values_to = "bias")

# plot
b %>% 
  ggplot(aes(x = year, y = bias, group = bias_function)) +
  geom_vline(xintercept = knot_list, color = "white", alpha = 1/2) +
  geom_line(color = "#ffb7c5", alpha = 1/2, size = 1.5) +
  ylab("bias value") +
  theme_bw() +
  theme(panel.background = element_rect(fill = "#4f455c"),
        panel.grid = element_blank())

To elucidate what’s going on in that plot, we might break it up with facet_wrap().

b %>% 
  mutate(bias_function = str_c("bias function ", bias_function)) %>% 
  
  ggplot(aes(x = year, y = bias)) +
  geom_vline(xintercept = knot_list, color = "white", alpha = 1/2) +
  geom_line(color = "#ffb7c5", size = 1.5) +
  ylab("bias value") +
  theme_bw() +
  theme(panel.background = element_rect(fill = "#4f455c"),
        panel.grid = element_blank(),
        strip.background = element_rect(fill = scales::alpha("#ffb7c5", .25), color = "transparent"),
        strip.text = element_text(size = 8, margin = margin(0.1, 0, 0.1, 0, "cm"))) +
  facet_wrap(~ bias_function, ncol = 1)

Now to get the parameter weights for each basis function, we need to actually define the model and make it run. The model is just a linear regression. The synthetic basis functions do all the work. We’ll use each column of the matrix B as a variable. We’ll also have an intercept to capture the average blossom day. This will make it easier to define priors on the basis weights, because then we can just conceive of each as a deviation from the intercept. (p. 117)

That last line is another indication for why we set intercept = TRUE. Our model will follow the form

\[\begin{align*} \text{day_in_year}_i & \sim \operatorname{Normal} (\mu_i, \sigma) \\ \mu_i & = \alpha + \color{#4f455c}{\sum_{k=1}^K w_k B_{k, i}} \\ \alpha & \sim \operatorname{Normal}(100, 10) \\ \color{#4f455c}{w_j} & \color{#4f455c}\sim \color{#4f455c}{\operatorname{Normal}(0, 10)} \\ \sigma & \sim \operatorname{Exponential}(1), \end{align*}\]

where \(\alpha\) is the intercept, \(B_{k, i}\) is the value of the \(k^\text{th}\) bias function on the \(i^\text{th}\) row of the data, and \(w_k\) is the estimated regression weight for the corresponding \(k^\text{th}\) bias function.

Throughout this chapter, I’ve griped a bit about using the uniform prior for \(\sigma\). Now that McElreath has introduced the exponential distribution as an alternative, those gripes are coming to an end. The exponential distribution is controlled by a single parameter, \(\lambda\), which is also called the rate. As it turns out, the mean of the exponential distribution is the inverse of the rate, \(1 / \lambda\). Here we use the dexp() function to get a sense of what that prior looks like.

tibble(x = seq(from = 0, to = 10, by = 0.1)) %>% 
  mutate(d = dexp(x, rate = 1)) %>% 
  
  ggplot(aes(x = x, y = d)) +
  geom_area(fill = "#ffb7c5") +
  scale_y_continuous(NULL, breaks = NULL) +
  theme_bw() +
  theme(panel.background = element_rect(fill = "#4f455c"),
        panel.grid = element_blank())

“We’ll use exponential priors for the rest of the book, in place of uniform priors. It is much more common to have a sense of the average deviation than of the maximum” (p. 119). 🎉

Acknowledgment: The workflow to follow is heavily influenced by the helpful contributions from Stephen Wild. My first pass through the material in this section was a mess. Wild’s insights knocked it out of the park and I couldn’t be more grateful. 🍻

Before fitting this model in brms, well take a minor detour on the data structure. In his R code 4.76, McElreath defined his data in a list, list( D=d2$doy , B=B ). Our approach will be a little different. Here, we’ll add the B matrix to our d2 data frame and name the results as d3.

d3 <-
  d2 %>% 
  mutate(B = B) 

# take a look at the structure of `d3
d3 %>% glimpse()
## Rows: 827
## Columns: 6
## $ year       <int> 812, 815, 831, 851, 853, 864, 866, 869, 889, 891, 892, 894, 895, 896, 902, 908,…
## $ doy        <int> 92, 105, 96, 108, 104, 100, 106, 95, 104, 109, 108, 106, 104, 104, 102, 98, 95,…
## $ temp       <dbl> NA, NA, NA, 7.38, NA, 6.42, 6.44, NA, 6.83, 6.98, 7.11, 6.98, 7.08, 7.20, 7.50,…
## $ temp_upper <dbl> NA, NA, NA, 12.10, NA, 8.69, 8.11, NA, 8.48, 8.96, 9.11, 8.40, 8.57, 8.69, 8.95…
## $ temp_lower <dbl> NA, NA, NA, 2.66, NA, 4.14, 4.77, NA, 5.19, 5.00, 5.11, 5.55, 5.58, 5.72, 6.06,…
## $ B          <dbl[,17]> <matrix[33 x 17]>

In our d3 data, columns year through temp_lower are all standard data columns. The B column is a matrix column, which contains the same number of rows as the others, but also smuggled in 17 columns within that column. Each of those 17 columns corresponds to one of our synthetic \(B_k\) variables. The advantage of such a data structure is we can simply define our formula argument as doy ~ 1 + B, where B is a stand-in for B.1 + B.2 + ... + B.17.

Here’s how to fit the model.

b4.8 <- 
  brm(data = d3,
      family = gaussian,
      doy ~ 1 + B,
      prior = c(prior(normal(100, 10), class = Intercept),
                prior(normal(0, 10), class = b),
                prior(exponential(1), class = sigma)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 4,
      file = "fits/b04.08")

Here’s the model summary.

print(b4.8)
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: doy ~ 1 + B 
##    Data: d3 (Number of observations: 827) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept   103.59      2.49    98.58   108.45 1.01      761     1071
## B1           -3.19      3.88   -10.80     4.21 1.00     1591     2435
## B2           -1.11      3.92    -8.82     6.62 1.00     1463     1859
## B3           -1.27      3.68    -8.58     6.07 1.00     1332     1833
## B4            4.56      2.96    -1.13    10.54 1.00     1046     1648
## B5           -1.08      2.99    -6.89     4.78 1.00      954     1399
## B6            4.03      3.02    -1.86    10.07 1.00     1056     1515
## B7           -5.55      2.91   -11.23     0.30 1.00      981     1672
## B8            7.57      2.92     1.92    13.29 1.00     1005     1583
## B9           -1.22      2.97    -6.90     4.58 1.01     1001     1458
## B10           2.79      3.03    -3.17     9.02 1.00     1019     1527
## B11           4.42      3.00    -1.49    10.27 1.00     1056     1846
## B12          -0.37      2.99    -5.98     5.70 1.00     1013     1551
## B13           5.28      2.99    -0.52    11.07 1.00     1024     1768
## B14           0.46      3.09    -5.44     6.74 1.00     1036     1712
## B15          -1.05      3.37    -7.77     5.59 1.00     1215     2147
## B16          -7.25      3.44   -14.01    -0.49 1.00     1287     1936
## B17          -7.88      3.26   -14.24    -1.29 1.00     1246     1824
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     5.94      0.14     5.67     6.23 1.00     4173     2921
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Look at that. Each of the 17 columns in our B matrix was assigned its own parameter. If you fit this model using McElreath’s rethinking code, you’ll see the results are very similar. Anyway, McElreath’s comments are in line with the general consensus on spline modes: the parameter estimates are very difficult to interpret directly. It’s often easier to just plot the results. First we’ll use posterior_samples().

post <- posterior_samples(b4.8)

glimpse(post)
## Rows: 4,000
## Columns: 20
## $ b_Intercept <dbl> 103.74086, 104.25796, 103.13392, 104.66690, 105.20821, 105.41704, 105.65525, 1…
## $ b_B1        <dbl> -6.4826048, -5.4990824, -2.1646605, 0.4607746, -9.0040921, -7.3966727, -11.508…
## $ b_B2        <dbl> 3.74875270, -0.53550232, -0.42138745, -8.06386888, 0.66505588, 0.72880418, 2.9…
## $ b_B3        <dbl> -4.22042929, -2.26081740, -0.95183959, -1.78965585, -8.02914674, -9.16372510, …
## $ b_B4        <dbl> 5.8833689, 4.7218957, 3.5397449, 3.2324508, 5.1640256, 5.9370012, 2.3368581, 3…
## $ b_B5        <dbl> -4.6338971, -2.0490845, -1.7740736, -1.4582729, -4.3913120, -3.1619283, -4.307…
## $ b_B6        <dbl> 6.00593380, 2.37093832, 6.60788828, 0.95493974, 2.78863329, 2.14199241, 1.6839…
## $ b_B7        <dbl> -6.9436490, -3.8421888, -7.9974119, -6.7766133, -5.6201809, -6.7678482, -8.336…
## $ b_B8        <dbl> 7.316514, 5.119854, 8.393274, 5.799375, 6.329872, 5.731784, 5.708979, 6.760733…
## $ b_B9        <dbl> -1.9490036, -0.2646558, -1.4861210, -1.3906578, -3.7449285, -3.6035334, -4.008…
## $ b_B10       <dbl> 3.9720418, 2.9088482, 4.1508398, 1.3121326, 2.3465406, 3.1685000, 1.6009567, 0…
## $ b_B11       <dbl> 3.6933735, 3.2552081, 2.3290076, 3.5780345, 1.4853387, 1.4431854, 1.8636113, 5…
## $ b_B12       <dbl> -0.2419564, -2.2181982, -0.6924766, -1.5380999, -1.3345390, -1.7848732, -1.199…
## $ b_B13       <dbl> 5.416313, 3.645593, 6.208850, 4.199717, 4.373958, 3.727896, 1.780963, 4.056961…
## $ b_B14       <dbl> 1.5023931, 2.6503143, 1.1680142, 1.9840733, -3.6880151, -4.3624263, -1.3456437…
## $ b_B15       <dbl> -1.51102386, -4.81636227, 1.75411291, -3.42256204, -1.68102337, -2.74340145, -…
## $ b_B16       <dbl> -8.4839941, -7.0335783, -8.7948696, -4.4579658, -10.4273168, -10.3023468, -13.…
## $ b_B17       <dbl> -9.947067, -10.641582, -8.199229, -8.768240, -7.976792, -8.981154, -7.529841, …
## $ sigma       <dbl> 6.077514, 5.949900, 6.119600, 5.906621, 5.937513, 5.884939, 5.901013, 5.992085…
## $ lp__        <dbl> -2711.191, -2710.897, -2714.579, -2714.185, -2711.390, -2713.697, -2712.118, -…

With a little wrangling, we can use summary information from post to make our version of the middle panel of Figure 4.13.

post %>% 
  select(b_B1:b_B17) %>% 
  set_names(c(str_c(0, 1:9), 10:17)) %>% 
  pivot_longer(everything(), names_to = "bias_function") %>% 
  group_by(bias_function) %>% 
  summarise(weight = mean(value)) %>% 
  full_join(b, by = "bias_function") %>% 
  
  # plot
  ggplot(aes(x = year, y = bias * weight, group = bias_function)) +
  geom_vline(xintercept = knot_list, color = "white", alpha = 1/2) +
  geom_line(color = "#ffb7c5", alpha = 1/2, size = 1.5) +
  theme_bw() +
  theme(panel.background = element_rect(fill = "#4f455c"),
        panel.grid = element_blank()) 

In case you missed it, the main action in the ggplot2 code was y = bias * weight, where we defined the \(y\)-axis as the product of bias and weight. This is fulfillment of the \(w_k B_{k, i}\) parts of the model. Now here’s how we might use brms::fitted() to make the lower plot of Figure 4.13.

f <- fitted(b4.8)

f %>% 
  data.frame() %>% 
  bind_cols(d2) %>% 
  
  ggplot(aes(x = year, y = doy, ymin = Q2.5, ymax = Q97.5)) + 
  geom_vline(xintercept = knot_list, color = "white", alpha = 1/2) +
  geom_hline(yintercept = fixef(b4.8)[1, 1], color = "white", linetype = 2) +
  geom_point(color = "#ffb7c5", alpha = 1/2) +
  geom_ribbon(fill = "white", alpha = 2/3) +
  labs(x = "year",
       y = "day in year") +
  theme_bw() +
  theme(panel.background = element_rect(fill = "#4f455c"),
        panel.grid = element_blank())

If it wasn’t clear, the dashed horizontal line intersecting a little above 100 on the \(y\)-axis is the poster mean for the intercept. Now let’s use our skills to remake the simpler model expressed in Figure 4.12. This model, recall, is based on 5 knots.

# redo the `B` splines
num_knots <- 5
knot_list <- quantile(d2$year, probs = seq(from = 0, to = 1, length.out = num_knots))

B <- bs(d2$year,
        knots = knot_list[-c(1, num_knots)], 
        # this makes the splines liner rater than cubic
        degree = 1, 
        intercept = TRUE)

# define a new `d4` data
d4 <- 
  d2 %>% 
  mutate(B = B)

b4.9 <- 
  brm(data = d4,
      family = gaussian,
      formula = doy ~ 1 + B,
      prior = c(prior(normal(100, 10), class = Intercept),
                prior(normal(0, 10), class = b),
                prior(exponential(1), class = sigma)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 4,
      file = "fits/b04.09")

Review the new model summary.

print(b4.9)
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: doy ~ 1 + B 
##    Data: d4 (Number of observations: 827) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept   103.46      4.57    94.27   112.27 1.01      673     1077
## B1           -0.32      4.65    -9.48     9.16 1.01      694     1124
## B2            1.49      4.60    -7.52    10.84 1.01      692     1048
## B3            1.57      4.60    -7.40    10.90 1.01      681     1117
## B4            3.99      4.59    -5.00    13.02 1.01      689     1140
## B5           -5.56      4.63   -14.64     3.76 1.01      685     1175
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     6.10      0.15     5.82     6.41 1.00     2537     1993
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Here we do all the work in bulk to make and save the three subplots for Figure 4.12.

## top
# wrangle a bit
b <-
  invoke(data.frame, d4) %>% 
  pivot_longer(starts_with("B"),
               names_to = "bias_function",
               values_to = "bias")

# plot
p1 <- 
  b %>% 
  ggplot(aes(x = year, y = bias, group = bias_function)) +
  geom_vline(xintercept = knot_list, color = "white", alpha = 1/2) +
  geom_line(color = "#ffb7c5", alpha = 1/2, size = 1.5) +
  scale_x_continuous(NULL, breaks = NULL) +
  ylab("bias value")

## middle
# wrangle
p2 <-
  posterior_samples(b4.9) %>% 
  select(b_B1:b_B5) %>% 
  set_names(str_c("B.", 1:5)) %>% 
  pivot_longer(everything(), names_to = "bias_function") %>% 
  group_by(bias_function) %>% 
  summarise(weight = mean(value)) %>% 
  full_join(b, by = "bias_function") %>% 
  
  # plot
  ggplot(aes(x = year, y = bias * weight, group = bias_function)) +
  geom_vline(xintercept = knot_list, color = "white", alpha = 1/2) +
  geom_line(color = "#ffb7c5", alpha = 1/2, size = 1.5) +
  scale_x_continuous(NULL, breaks = NULL)

## bottom
# wrangle
f <- fitted(b4.9)

p3 <-
  f %>% 
  data.frame() %>% 
  bind_cols(d2) %>% 
  
  # plot
  ggplot(aes(x = year, y = doy, ymin = Q2.5, ymax = Q97.5)) + 
  geom_vline(xintercept = knot_list, color = "white", alpha = 1/2) +
  geom_hline(yintercept = fixef(b4.9)[1, 1], color = "white", linetype = 2) +
  geom_point(color = "#ffb7c5", alpha = 1/2) +
  geom_ribbon(fill = "white", alpha = 2/3) +
  labs(x = "year",
       y = "day in year")

Now combine the subplots with patchwork syntax and behold their glory.

(p1 / p2 / p3) &
  theme_bw() &
  theme(panel.background = element_rect(fill = "#4f455c"),
        panel.grid = element_blank())

4.5.3 Smooth functions for a rough world.

The splines in the previous section are just the beginning. A entire class of models, generalized additive models (GAMs), focuses on predicting an outcome variable using smooth functions of some predictor variables. The topic is deep enough to deserve its own book. (p. 120, emphasis in the original)

McElreath ended that block quote with a reference to his endnote #78. On page 562, we read: “A very popular and comprehensive text is Wood (2017a).”

4.6 Summary First bonus: Smooth functions with brms::s()

It’s convenient for us how McElreath ended that last section with a reference to Simon Wood’s work because brms allows for a variety of non-linear models by borrowing functions from Woods’s mgcv package (Wood, 2003, 2004, 2011, 2017b, 2019; Wood et al., 2016). The two smooth functions brms imports from mgcv are s() and t2(). We’ll be exploring s(). We might use the brms::get_prior() function to get a sense of how to set up the priors when using s().

get_prior(data = d2,
          family = gaussian,
          doy ~ 1 + s(year))
##                   prior     class    coef group resp dpar nlpar bound       source
##                  (flat)         b                                          default
##                  (flat)         b syear_1                             (vectorized)
##  student_t(3, 105, 5.9) Intercept                                          default
##    student_t(3, 0, 5.9)       sds                                          default
##    student_t(3, 0, 5.9)       sds s(year)                             (vectorized)
##    student_t(3, 0, 5.9)     sigma                                          default

We have an overall intercept (class = Intercept), a single \(\beta\) parameter for year (class = b), a \(\sigma\) parameter (class = sigma), and an unfamiliar parameter of class = sds. I’m not going to go into that last parameter in any detail, here. We’ll need to work our way up through Chapter 13 and the multilevel model to get a full picture of what it means. The important thing to note here is that the priors for our s()-based alternative to the B-spline models, above, are going to look a little different. Here’s how we might fit an alternative to b4.8

b4.10 <-
  brm(data = d2,
      family = gaussian,
      doy ~ 1 + s(year),
      prior = c(prior(normal(100, 10), class = Intercept),
                prior(normal(0, 10), class = b),
                prior(student_t(3, 0, 5.9), class = sds),
                prior(exponential(1), class = sigma)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 4,
      control = list(adapt_delta = .99),
      file = "fits/b04.10")

Check out the model summary.

print(b4.10)
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: doy ~ 1 + s(year) 
##    Data: d2 (Number of observations: 827) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Smooth Terms: 
##              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sds(syear_1)    22.36      7.56    11.87    40.48 1.00     1115     1863
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept   104.54      0.22   104.13   104.98 1.00     4095     2264
## syear_1      -9.87      8.94   -27.42     7.43 1.00     3123     2818
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     6.04      0.15     5.75     6.36 1.00     4305     2900
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Our intercept and \(\sigma\) summaries are similar to those we got from b4.8. The rest looks different. Here’s what happens when we use brms::fitted() to plot the implications of the model.

fitted(b4.10) %>% 
  data.frame() %>% 
  bind_cols(select(d2, year, doy)) %>% 
  
  ggplot(aes(x = year, y = doy, ymin = Q2.5, ymax = Q97.5)) +
  geom_hline(yintercept = fixef(b4.10)[1, 1], color = "white", linetype = 2) +
  geom_point(color = "#ffb7c5", alpha = 1/2) +
  geom_ribbon(fill = "white", alpha = 2/3) +
  labs(subtitle = "b4.7 using s(year)",
       y = "day in year") +
  theme_bw() +
  theme(panel.background = element_rect(fill = "#4f455c"), 
        panel.grid = element_blank())

That smooth doesn’t look quite the same. Hopefully this isn’t terribly surprising. We used a function from a different package and ended up with a different underlying statistical model. In fact, we didn’t even use a B-spline. The default for s() is to use what’s called a thin plate regression spline. If we’d like to fit a B-spline, we have to set bs = "bs". Here’s an example.

b4.11 <-
  brm(data = d2,
      family = gaussian,
      doy ~ 1 + s(year, bs = "bs", k = 19),
      prior = c(prior(normal(100, 10), class = Intercept),
                prior(normal(0, 10), class = b),
                prior(student_t(3, 0, 5.9), class = sds),
                prior(exponential(1), class = sigma)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 4,
      control = list(adapt_delta = .99),
      file = "fits/b04.11")
print(b4.11)
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: doy ~ 1 + s(year, bs = "bs", k = 19) 
##    Data: d2 (Number of observations: 827) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Smooth Terms: 
##              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sds(syear_1)     1.31      0.59     0.54     2.80 1.01      622      988
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept   104.54      0.20   104.14   104.95 1.00     5150     2533
## syear_1      -0.09      0.33    -0.75     0.57 1.00     1873     2216
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     5.98      0.15     5.70     6.28 1.00     4065     3050
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Now here’s the depiction of our s()-based B-spline model.

fitted(b4.11) %>% 
  data.frame() %>% 
  bind_cols(select(d2, year, doy)) %>% 
  
  ggplot(aes(x = year, y = doy, ymin = Q2.5, ymax = Q97.5)) +
  geom_hline(yintercept = fixef(b4.11)[1, 1], color = "white", linetype = 2) +
  geom_point(color = "#ffb7c5", alpha = 1/2) +
  geom_ribbon(fill = "white", alpha = 2/3) +
  labs(subtitle = 'b4.7_bs using s(year, bs = "bs")',
       y = "day in year") +
  theme_bw() +
  theme(panel.background = element_rect(fill = "#4f455c"), 
        panel.grid = element_blank())

There are still other important differences between the underlying statistical model for b4.11 and the earlier b4.8 that I’m just not going to go into, here.

For more on the B-splines and smooths, more generally, check out the blog post by the great Gavin Simpson, Extrapolating with B splines and GAMs. For a high-level introduction to the models you can fit with mgcv, check out the nice talk by Noam Ross, Nonlinear models in R: The wonderful world of mgcv, or the equally-nice presentation by Simpson, Introduction to generalized additive models with R and mgcv. Ross offers a free online course covering mgcv, called GAMS in R, and he maintains a GitHub repo cataloguing other GAM-related resources, called Resources for learning about and using GAMs in R. For specific examples of fitting various GAMS with brms, check out Simpson’s blog post, Fitting GAMs with brms: part 1. Finally, Tristan Mahr has a nice blog post called Random effects and penalized splines are the same thing, where he outlined the connections between penalized smooths, such as you might fit with mgcv, with the multilevel model, which we’ll learn all about starting in Chapter 13, which helps explain what’s going on with the s() function in our last two models, b4.10 and b4.11.

4.7 Second bonus: Group predictors with matrix columns

When we fit b4.8, our direct brms analogue to McElreath’s m4.72, we used a compact syntax to pass a matrix column of predictors into the formula. If memory serves, this is one of the only places in the text where we see this. It would be easy for the casual reader to think this was only appropriate for something like a spline model. But that’s not the case. One could use the matrix-column trick as a general approach. In this bonus section, we’ll explore how.

In Section 11.2.6 of their (2020) text, Gelman, Hill, and Vehtari worked through an example of a multiple regression model,

\[ \begin{align*} y_i & \sim \operatorname{Normal}(\mu_i, \sigma) \\ \mu_i & = \alpha + \theta z_i + \sum_{k = 1}^K b_k x_{k, i} \\ & \text{<priors>}, \end{align*} \]

where \(y_i\) was some continuous variable collected across participants, \(i\). The \(\alpha\) term was the intercept and the \(\theta\) term was the regression slope for a binary variable \(z\)–we’ll practice with binary predictors in Section 5.3. More to our interest, the last portion of the equation is a compact way to convey there are \(K\) additional predictors and their associated regression coefficients, which we might more explicitly express as \(\beta_1 x_{1, i} + \cdots + \beta_k x_{k, i}\), where \(K \geq 1\). In this particular example, \(K = 10\), meaning there were ten \(x_{k, i}\) predictors, making this an example of a model with 11 total predictor variables.

Riffing off of Gelman and colleagues, here’s how you might simulate data of this kind.

# how many cases would you like?
n <- 100

# how many continuous x predictor variables would you like?
k <- 10

# simulate a dichotomous dummy variable for z
# simulate an n by k array for X
set.seed(4)

d <- 
  tibble(z = sample(0:1, size = n, replace = T),
         X = array(runif(n * k, min = 0, max = 1), dim = c(n, k)))

# set the data-generating parameter values
a     <- 1
theta <- 5
b     <- 1:k
sigma <- 2

# simulate the criterion
d <-
  d %>% 
  mutate(y = as.vector(a + X %*% b + theta * z + rnorm(n, mean = 0, sd = sigma)))

# check the data structure
d %>% 
  glimpse()
## Rows: 100
## Columns: 3
## $ z <int> 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1…
## $ X <dbl[,10]> <matrix[33 x 10]>
## $ y <dbl> 19.51231, 26.75217, 28.92495, 20.63144, 29.18219, 43.67198, 36.98107, 36.68241, 28.…

Although our d tibble has only three columns, the X column is a matrix column into which we’ve smuggled ten columns more. Here’s how we might access them more directly.

d %>% 
  pull(X) %>% 
  glimpse()
##  num [1:100, 1:10] 0.253 0.63 0.266 0.532 0.468 ...

See? There’s an \(100 \times 10\) data matrix in there. Tricky. Here’s how to fit the full model with brms where we use the compact matrix-column syntax in the formula argument.

b4.12 <-
  brm(data = d,
      family = gaussian,
      y ~ 1 + z + X,
      prior = c(prior(normal(0, 2), class = Intercept),
                prior(normal(0, 10), class = b),
                prior(exponential(1), class = sigma)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 4,
      file = "fits/b04.12")

Check the parameter summary.

print(b4.12)
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: y ~ 1 + z + X 
##    Data: d (Number of observations: 100) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept     0.95      1.21    -1.42     3.24 1.00     7096     3172
## z             4.74      0.42     3.92     5.57 1.00     7132     2651
## X1            0.56      0.74    -0.88     2.04 1.00     7515     3646
## X2            0.88      0.68    -0.47     2.24 1.00     7180     3427
## X3            3.40      0.72     1.99     4.82 1.00     7592     3281
## X4            2.82      0.73     1.38     4.26 1.00     7795     3134
## X5            5.75      0.71     4.35     7.15 1.00     6154     3273
## X6            6.41      0.74     4.98     7.85 1.00     6828     3383
## X7            8.49      0.73     7.06     9.92 1.00     7494     3063
## X8            8.40      0.71     7.02     9.78 1.00     7389     3078
## X9            8.83      0.80     7.28    10.39 1.00     6807     3026
## X10           9.33      0.70     7.94    10.69 1.00     6668     2701
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     1.99      0.16     1.71     2.32 1.00     5356     3113
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

brms automatically numbered our \(K = 10\) X variables as X1 through X10. As far as applications go, I’m not sure where I’d use this way of storing and modeling data in real life. But maybe some of y’all work in domains where this is just the right way to approach your data needs. If so, good luck and happy modeling.

Session info

sessionInfo()
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] splines   parallel  stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] tidybayes_2.3.1      brms_2.15.0          Rcpp_1.0.6           rstan_2.21.2        
##  [5] StanHeaders_2.21.0-7 patchwork_1.1.1      forcats_0.5.1        stringr_1.4.0       
##  [9] dplyr_1.0.5          purrr_0.3.4          readr_1.4.0          tidyr_1.1.3         
## [13] tibble_3.1.0         ggplot2_3.3.3        tidyverse_1.3.0     
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1         backports_1.2.1      plyr_1.8.6           igraph_1.2.6        
##   [5] svUnit_1.0.3         crosstalk_1.1.0.1    TH.data_1.0-10       rstantools_2.1.1    
##   [9] inline_0.3.17        digest_0.6.27        htmltools_0.5.1.1    rsconnect_0.8.16    
##  [13] fansi_0.4.2          magrittr_2.0.1       modelr_0.1.8         RcppParallel_5.0.2  
##  [17] matrixStats_0.57.0   xts_0.12.1           sandwich_3.0-0       prettyunits_1.1.1   
##  [21] colorspace_2.0-0     rvest_0.3.6          ggdist_2.4.0.9000    haven_2.3.1         
##  [25] xfun_0.22            callr_3.5.1          crayon_1.4.1         jsonlite_1.7.2      
##  [29] lme4_1.1-25          survival_3.2-7       zoo_1.8-8            glue_1.4.2          
##  [33] gtable_0.3.0         emmeans_1.5.2-1      V8_3.4.0             distributional_0.2.2
##  [37] pkgbuild_1.2.0       shape_1.4.5          abind_1.4-5          scales_1.1.1        
##  [41] mvtnorm_1.1-1        emo_0.0.0.9000       DBI_1.1.0            miniUI_0.1.1.1      
##  [45] isoband_0.2.3        viridisLite_0.3.0    xtable_1.8-4         HDInterval_0.2.2    
##  [49] stats4_4.0.4         DT_0.16              htmlwidgets_1.5.2    httr_1.4.2          
##  [53] threejs_0.3.3        arrayhelpers_1.1-0   ellipsis_0.3.1       pkgconfig_2.0.3     
##  [57] loo_2.4.1            farver_2.0.3         dbplyr_2.0.0         utf8_1.1.4          
##  [61] tidyselect_1.1.0     labeling_0.4.2       rlang_0.4.10         reshape2_1.4.4      
##  [65] later_1.1.0.1        munsell_0.5.0        cellranger_1.1.0     tools_4.0.4         
##  [69] cli_2.3.1            generics_0.1.0       broom_0.7.5          ggridges_0.5.2      
##  [73] evaluate_0.14        fastmap_1.0.1        processx_3.4.5       knitr_1.31          
##  [77] fs_1.5.0             nlme_3.1-152         mime_0.10            projpred_2.0.2      
##  [81] xml2_1.3.2           compiler_4.0.4       bayesplot_1.8.0      shinythemes_1.1.2   
##  [85] rstudioapi_0.13      curl_4.3             gamm4_0.2-6          reprex_0.3.0        
##  [89] statmod_1.4.35       stringi_1.5.3        highr_0.8            ps_1.6.0            
##  [93] Brobdingnag_1.2-6    lattice_0.20-41      Matrix_1.3-2         nloptr_1.2.2.2      
##  [97] markdown_1.1         shinyjs_2.0.0        vctrs_0.3.6          pillar_1.5.1        
## [101] lifecycle_1.0.0      bridgesampling_1.0-0 estimability_1.3     httpuv_1.5.4        
## [105] R6_2.5.0             bookdown_0.21        promises_1.1.1       gridExtra_2.3       
## [109] codetools_0.2-18     boot_1.3-26          colourpicker_1.1.0   MASS_7.3-53         
## [113] gtools_3.8.2         assertthat_0.2.1     withr_2.4.1          shinystan_2.5.0     
## [117] multcomp_1.4-16      mgcv_1.8-33          hms_0.5.3            grid_4.0.4          
## [121] coda_0.19-4          minqa_1.2.4          rmarkdown_2.7        shiny_1.5.0         
## [125] lubridate_1.7.9.2    base64enc_0.1-3      dygraphs_1.1.1.6

  1. In the first edition of his text, McElreath (2015) started out with the uniform prior for \(\sigma\) and transitioned to the half Cauchy when he introduced HMC in Chapter 8. In this edition, he largely replaced the half Cauchy with the exponential distribution. Both are fine. If you’d like to learn more about the half Cauchy, you might check out Section 8.4.3.1 of my (2020a) translation of his first edition. We’ll learn about the exponential distribution just a little later in this chapter.↩︎

  2. I know. I’m sorry our model numbering diverged from the numbering in the text. The problem popped up in Section 4.5.1. where McElreath fit a linear model with the weight_s predictor, but didn’t show the code in the text and, as a consequence, didn’t assign that model a number. Since we explicitly fit that linear model, we assigned it the next number in the sequence. So it goes. But yes, I am correct in comparing our b4.8 to McElreath’s m4.7.↩︎