# 4 Geocentric Models

Linear regressionis the geocentric model of applied statistics. By “linear regression,” we will mean a family of simple statistical golems that attempt to learn about the mean and variance of some measurement, using an additive combination of other measurements. Like geocentrism, linear regression can usefully describe a very large variety of natural phenomena. Like geocentrism, linear regression is a descriptive model that corresponds to many different process models. If we read its structure too literally, we’re likely to make mistakes. But used wisely, these little linear golems continue to be useful. (McElreath, 2020a, p. 71,emphasis, in the original)

## 4.1 Why normal distributions are normal

After laying out his soccer field coin toss shuffle premise, McElreath wrote:

It’s hard to say where any individual person will end up, but you can say with great confidence what the collection of positions will be. The distances will be distributed in approximately normal, or Gaussian, fashion. This is true even though the underlying distribution is binomial. It does this because there are so many more possible ways to realize a sequence of left-right steps that sums to zero. There are slightly fewer ways to realize a sequence that ends up one step left or right of zero, and so on, with the number of possible sequences declining in the characteristic bell curve of the normal distribution. (p. 72)

### 4.1.1 Normal by addition.

Here’s a way to do the simulation necessary for the plot in the top panel of Figure 4.2.

```
library(tidyverse)
# we set the seed to make the results of `runif()` reproducible.
set.seed(4)
<-
pos # make data with 100 people, 16 steps each with a starting point of `step == 0` (i.e., 17 rows per person)
crossing(person = 1:100,
step = 0:16) %>%
# for all steps above `step == 0` simulate a `deviation`
mutate(deviation = map_dbl(step, ~if_else(. == 0, 0, runif(1, -1, 1)))) %>%
# after grouping by `person`, compute the cumulative sum of the deviations, then `ungroup()`
group_by(person) %>%
mutate(position = cumsum(deviation)) %>%
ungroup()
```

That `map_dbl()`

code within the first `mutate()`

line might look odd. Go here to learn more about iterating with `purrr::map_dbl()`

.

We might `glimpse()`

at the data.

`glimpse(pos)`

```
## Rows: 1,700
## Columns: 4
## $ person <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ step <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 0, 1, 2, 3, 4, 5, 6, 7…
## $ deviation <dbl> 0.00000000, -0.98210841, -0.41252078, -0.44525008, 0.62714843, -0.47914446, 0.44…
## $ position <dbl> 0.0000000, -0.9821084, -1.3946292, -1.8398793, -1.2127308, -1.6918753, -1.243063…
```

Here’s the code to make the top panel of Figure 4.2.

```
ggplot(data = pos,
aes(x = step, y = position, group = person)) +
geom_vline(xintercept = c(4, 8, 16), linetype = 2) +
geom_line(aes(color = person < 2, alpha = person < 2)) +
scale_color_manual(values = c("skyblue4", "black")) +
scale_alpha_manual(values = c(1/5, 1)) +
scale_x_continuous("step number", breaks = c(0, 4, 8, 12, 16)) +
theme(legend.position = "none")
```

Here’s the code for the bottom three plots of Figure 4.2.

```
# Figure 4.2.a.
<-
p1 %>%
pos filter(step == 4) %>%
ggplot(aes(x = position)) +
geom_line(stat = "density", color = "dodgerblue1") +
labs(title = "4 steps")
# Figure 4.2.b.
<-
p2 %>%
pos filter(step == 8) %>%
ggplot(aes(x = position)) +
geom_density(color = "dodgerblue2", outline.type = "full") +
labs(title = "8 steps")
# this is an intermediary step to get an SD value
<-
sd %>%
pos filter(step == 16) %>%
summarise(sd = sd(position)) %>%
pull(sd)
# Figure 4.2.c.
<-
p3 %>%
pos filter(step == 16) %>%
ggplot(aes(x = position)) +
stat_function(fun = dnorm,
args = list(mean = 0, sd = sd),
linetype = 2) +
geom_density(color = "transparent", fill = "dodgerblue3", alpha = 1/2) +
labs(title = "16 steps",
y = "density")
library(patchwork)
# combine the ggplots
| p2 | p3) & coord_cartesian(xlim = c(-6, 6)) (p1
```

While we were at it, we explored a few ways to express densities. The main action was with the `geom_line()`

, `geom_density()`

, and `stat_function()`

functions, respectively.

Any process that adds together random values from the same distribution converges to a normal. But it’s not easy to grasp why addition should result in a bell curve of sums. Here’s a conceptual way to think of the process. Whatever the average value of the source distribution, each sample from it can be thought of as a fluctuation from that average value. When we begin to add these fluctuations together, they also begin to cancel one another out. A large positive fluctuation will cancel a large negative one. (p. 73)

### 4.1.2 Normal by multiplication.

Here’s McElreath’s simple random growth rate.

```
set.seed(4)
prod(1 + runif(12, 0, 0.1))
```

`## [1] 1.774719`

In the `runif()`

part of that code, we generated 12 random draws from the uniform distribution with bounds \([0, 0.1]\). Within the `prod()`

function, we first added `1`

to each of those values and then computed their product. Consider a more explicit variant of the code.

```
set.seed(4)
tibble(a = 1,
b = runif(12, 0, 0.1)) %>%
mutate(c = a + b) %>%
summarise(p = prod(c))
```

```
## # A tibble: 1 x 1
## p
## <dbl>
## 1 1.77
```

Same result. Rather than using base **R** `replicate()`

to do this many times, let’s practice with `purrr::map_dbl()`

like before.

```
set.seed(4)
<-
growth tibble(growth = map_dbl(1:10000, ~ prod(1 + runif(12, 0, 0.1))))
ggplot(data = growth, aes(x = growth)) +
geom_density()
```

“The smaller the effect of each locus, the better this additive approximation will be” (p. 74). Let’s compare big and small.

```
# simulate
set.seed(4)
<-
samples tibble(big = map_dbl(1:10000, ~ prod(1 + runif(12, 0, 0.5))),
small = map_dbl(1:10000, ~ prod(1 + runif(12, 0, 0.01))))
# wrangle
%>%
samples pivot_longer(everything(), values_to = "samples") %>%
# plot
ggplot(aes(x = samples)) +
geom_density(fill = "black") +
facet_wrap(~ name, scales = "free")
```

Yep, the `small`

samples were more Gaussian. “The interacting growth deviations, as long as they are sufficiently small, converge to a Gaussian distribution. In this way, the range of causal forces that tend towards Gaussian distributions extends well beyond purely additive interactions” (p. 74).

### 4.1.3 Normal by log-multiplication.

Instead of saving our tibble, we’ll just feed it directly into our plot.

```
%>%
samples mutate(log_big = log(big)) %>%
ggplot(aes(x = log_big)) +
geom_density(fill = "gray33") +
xlab("the log of the big")
```

Yet another Gaussian distribution. We get the Gaussian distribution back, because adding logs is equivalent to multiplying the original numbers. So even multiplicative interactions of large deviations can produce Gaussian distributions, once we measure the outcomes on the log scale. (p. 75)

### 4.1.4 Using Gaussian distributions.

“The justifications for using the Gaussian distribution fall into two broad categories: (1) ontological and (2) epistemological” (p. 75). I’m a fan of the justifications to follow.

#### 4.1.4.1 Ontological justification.

The Gaussian is

a widespread pattern, appearing again and again at different scales and in different domains. Measurement errors, variations in growth, and the velocities of molecules all tend towards Gaussian distributions. These processes do this because at their heart, these processes add together fluctuations. And repeatedly adding finite fluctuations results in a distribution of sums that have shed all information about the underlying process, aside from mean and spread.

[However,] one consequence of this is that statistical models based on Gaussian distributions cannot reliably identify micro-process. (p. 75)

But like Ptolemy’s circles within circles, the Gaussian can still be useful even if it sheds information.

#### 4.1.4.2 Epistemological justification.

By the epistemological justification, the Gaussian represents a particular state of ignorance. When all we know or are willing to say about a distribution of measures (measures are continuous values on the real number line) is their mean and variance, then the Gaussian distribution arises as the most consistent with our assumptions.

That is to say that the Gaussian distribution is the most natural expression of our state of ignorance, because if all we are willing to assume is that a measure has finite variance, the Gaussian distribution is the shape that can be realized in the largest number of ways and does not introduce any new assumptions. It is the least surprising and least informative assumption to make. In this way, the Gaussian is the distribution most consistent with our assumptions. Or rather, it is the most consistent with our golem’s assumptions. If you don’t think the distribution should be Gaussian, then that implies that you know something else that you should tell your golem about, something that would improve inference. (p. 75)

We’ll dive deeper into why the Gaussian is such a natural expression of ignorance in these contexts when we cover maximum entropy in Chapter 7.

#### 4.1.4.3 Rethinking: Heavy tails.

The Gaussian distribution is common in nature and has some nice properties. But there are some risks in using it as a default data model. The extreme ends of a distribution are known as its tails. And the Gaussian distribution has some very thin tails–there is very little probability in them. Instead most of the mass in the Gaussian lies within one standard deviation of the mean. Many natural (and unnatural) processes have much heavier tails. (p. 76)

You have no idea how excited I am that we’ll be covering some of these heavy-tailed alternatives!

#### 4.1.4.4 Overthinking: Gaussian distribution.

In this section McElreath gave the formula for the Gaussian probability density function. Let \(y\) be the criterion, \(\mu\) be the mean, and \(\sigma\) be the standard deviation. Then the probability density of some Gaussian value \(y\) is

\[p(y|\mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left (- \frac{(y - \mu)^2}{2 \sigma^2} \right).\]

McElreath’s right. “This looks monstrous” (p. 76). Why not demystify that monster with a little **R** code? For simplicity, we’ll compute \(p(y|\mu, \sigma)\) for a series of \(y\)-values ranging from -1 to 1, holding \(\mu = 0\) and \(\sigma = 0.1\). Then we’ll plot.

```
tibble(y = seq(from = -1, to = 1, by = .01),
mu = 0,
sigma = 0.1) %>%
# compute p(y) with a hand-made Gaussian probability density function
mutate(p = (1 / sqrt(2 * pi * sigma^2)) * exp(-((y - mu)^2 / (2 * sigma^2)))) %>%
ggplot(aes(x = y, y = p)) +
geom_line() +
ylab(expression(italic(p)(italic("y|")*mu==0*","~sigma==0.1)))
```

Notice how \(p(y|\mu, \sigma)\) peaks around 4 when \(y = 0\). We can also get that value with the `dnorm()`

function, which will return the \(p(y)\) value for a given combination of \(y\), \(\mu\), and \(\sigma\).

`dnorm(0, mean = 0, sd = 0.1)`

`## [1] 3.989423`

The answer, about 4, is no mistake. Probability

densityis the rate of change in cumulative probability. So where cumulative probability is increasing rapidly, density can easily exceed 1. But if we calculate the area under the density function, it will never exceed 1. Such areas are also called probability mass. You can usually ignore these density/mass details while doing computational work. But it’s good to be aware of the distinction. (p. 76,emphasisin the original)

## 4.2 A language for describing models

Our mathy ways of summarizing models will be something like

\[\begin{align*} \text{criterion}_i & \sim \operatorname{Normal}(\mu_i, \sigma) \\ \mu_i & = \beta \times \text{predictor}_i \\ \beta & \sim \operatorname{Normal}(0, 10) \\ \sigma & \sim \operatorname{Exponential}(1) \\ x_i & \sim \operatorname{Normal}(0, 1). \end{align*}\]

“If that doesn’t make much sense, good. That indicates that you are holding the right textbook” (p. 77). Welcome applied statistics! 🤓

### 4.2.1 Re-describing the globe tossing model.

For the globe tossing model, the probability \(p\) of a count of water \(w\) based on \(n\) trials was

\[\begin{align*} w & \sim \operatorname{Binomial}(n, p) \\ p & \sim \operatorname{Uniform}(0, 1), \end{align*}\]

where the top line indicates we’re using the Binomial likelihood function to model \(w\) given unique combinations of \(n\) and \(p\). In our example, \(w\) and \(n\) were already defined in the data, so all we need to do is compute \(p\). Since \(p\) is the only parameter, it’s the only element that gets a prior, which is what that second line was.

#### 4.2.1.1 Overthinking: From model definition to Bayes’ theorem.

We can use grid approximation to work through our globe tossing model.

```
# how many `p_grid` points would you like?
<- 100
n_points
<-
d tibble(p_grid = seq(from = 0, to = 1, length.out = n_points),
w = 6,
n = 9) %>%
mutate(prior = dunif(p_grid, 0, 1),
likelihood = dbinom(w, n, p_grid)) %>%
mutate(posterior = likelihood * prior / sum(likelihood * prior))
head(d)
```

```
## # A tibble: 6 x 6
## p_grid w n prior likelihood posterior
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 6 9 1 0. 0.
## 2 0.0101 6 9 1 8.65e-11 8.74e-12
## 3 0.0202 6 9 1 5.37e- 9 5.43e-10
## 4 0.0303 6 9 1 5.93e- 8 5.99e- 9
## 5 0.0404 6 9 1 3.23e- 7 3.26e- 8
## 6 0.0505 6 9 1 1.19e- 6 1.21e- 7
```

In case you were curious, here’s what they look like.

```
%>%
d pivot_longer(prior:posterior) %>%
# this line allows us to dictate the order in which the panels will appear
mutate(name = factor(name, levels = c("prior", "likelihood", "posterior"))) %>%
ggplot(aes(x = p_grid, y = value, fill = name)) +
geom_area() +
scale_fill_manual(values = c("blue", "red", "purple")) +
scale_y_continuous(NULL, breaks = NULL) +
theme(legend.position = "none") +
facet_wrap(~ name, scales = "free")
```

The posterior is a combination of the prior and the likelihood. When the prior is flat across the parameter space, the posterior is just the likelihood re-expressed as a probability. As we go along, you’ll see we almost never use flat priors in practice. Be warned that eschewing flat priors is a recent development, however. You only have to look at the literature from a couple decades ago to see mounds and mounds of flat priors.

## 4.3 A Gaussian model of height

There are an infinite number of possible Gaussian distributions. Some have small means. Others have large means. Some are wide, with a large \(\sigma\). Others are narrow. We want our Bayesian machine to consider every possible distribution, each defined by a combination of \(\mu\) and \(\sigma\), and rank them by posterior plausibility. Posterior plausibility provides a measure of the logical compatibility of each possible distribution with the data and model. (p. 79)

### 4.3.1 The data.

Let’s load the Howell (2001, 2010) data from McElreath’s (2020b) **rethinking** package.

```
library(rethinking)
data(Howell1)
<- Howell1 d
```

Here we open our focal statistical package, Bürkner’s **brms**. But before we do, we’ll want to detach the **rethinking** package. **R** will not allow us to use a function from one package that shares the same name as a different function from another package if both packages are open at the same time. The **rethinking** and **brms** packages are designed for similar purposes and, unsurprisingly, overlap in some of their function names. To prevent problems, we will always make sure **rethinking** is detached before using **brms**. To learn more on the topic, see this R-bloggers post.

```
rm(Howell1)
detach(package:rethinking, unload = T)
library(brms)
```

Go ahead and investigate the data with `str()`

, the **tidyverse** analogue for which is `glimpse()`

.

```
%>%
d str()
```

```
## 'data.frame': 544 obs. of 4 variables:
## $ height: num 152 140 137 157 145 ...
## $ weight: num 47.8 36.5 31.9 53 41.3 ...
## $ age : num 63 63 65 41 51 35 32 27 19 54 ...
## $ male : int 1 0 0 1 0 1 0 1 0 1 ...
```

The **brms** package does not have a function that works like `rethinking::precis()`

for providing numeric and graphical summaries of variables, as see in McElreath’s **R** code 4.9. We can get some of that information with `summary()`

.

```
%>%
d summary()
```

```
## height weight age male
## Min. : 53.98 Min. : 4.252 Min. : 0.00 Min. :0.0000
## 1st Qu.:125.09 1st Qu.:22.008 1st Qu.:12.00 1st Qu.:0.0000
## Median :148.59 Median :40.058 Median :27.00 Median :0.0000
## Mean :138.26 Mean :35.611 Mean :29.34 Mean :0.4724
## 3rd Qu.:157.48 3rd Qu.:47.209 3rd Qu.:43.00 3rd Qu.:1.0000
## Max. :179.07 Max. :62.993 Max. :88.00 Max. :1.0000
```

We might make the histograms like this.

```
%>%
d pivot_longer(everything()) %>%
mutate(name = factor(name, levels = c("height", "weight", "age", "male"))) %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 10) +
facet_wrap(~ name, scales = "free", ncol = 1)
```

If you’re curious, McElreath made those tiny histograms with help from Wickham’s `histospark()`

function. Here’s the code.

```
<- c("\u2581", "\u2582", "\u2583", "\u2585", "\u2587")
sparks
<- function(x, width = 10) {
histospark <- graphics::hist(x, breaks = width, plot = FALSE)
bins
<- cut(
factor $counts / max(bins$counts),
binsbreaks = seq(0, 1, length = length(sparks) + 1),
labels = sparks,
include.lowest = TRUE
)
paste0(factor, collapse = "")
}
```

Here’s how it works.

`histospark(d$weight)`

`## [1] "▁▂▃▂▂▂▂▅▇▇▃▂▁"`

One of the neat things about the `histospark()`

function is you can insert the output right in your R Markdown prose. For example, we can use is to casually show how left skewed the our `height`

variable is: ▁▁▁▁▁▁▁▂▁▇▇▅▁. But it’s time to get back on track. You can isolate `height`

values with the `dplyr::select()`

function.

```
%>%
d select(height) %>%
glimpse()
```

```
## Rows: 544
## Columns: 1
## $ height <dbl> 151.7650, 139.7000, 136.5250, 156.8450, 145.4150, 163.8300, 149.2250, 168.9100, 147…
```

If you want the values in a numeric vector rather than in a data fame, try `pull(d, height)`

.

We can use the `dplyr::filter()`

function to make an adults-only data frame.

```
<-
d2 %>%
d filter(age >= 18)
```

Our reduced `d2`

does indeed have \(n = 352\) cases.

```
%>%
d2 count()
```

```
## n
## 1 352
```

#### 4.3.1.1 Overthinking: Data frames and indexes.

For more on indexing, check out Chapter 9 of Peng’s (2019) *R programming for data science*, or even the Subsetting subsection in *R4DS*.

This probably reflects my training history, but the structure of a data frame seems natural and inherently appealing. If you’re in the other camp, do check out either of these two data wrangling talks (here and here) by the ineffable Jenny Bryan.

### 4.3.2 The model.

Please heed McElreath’s warnings to

be careful about choosing the Gaussian distribution only when the plotted outcome variable looks Gaussian to you. Gawking at the raw data, to try to decide how to model them, is usually not a good idea. The data could be a mixture of different Gaussian distributions, for example, and in that case you won’t be able to detect the underlying normality just by eyeballing the outcome distribution. Furthermore, as mentioned earlier in this chapter, the empirical distribution needn’t be actually Gaussian in order to justify using a Gaussian probability distribution. (p. 81)

Anyway, the likelihood for our model is

\[\text{heights}_i \sim \operatorname{Normal}(\mu, \sigma),\]

where the \(i\) subscript indexes the individual cases in the data. Our two parameters are \(\mu\) and \(\sigma\), which we will estimate using Bayes’ formula. Our prior for \(\mu\) will be

\[\mu \sim \operatorname{Normal}(178, 20)\]

and our prior for \(\sigma\) will be

\[\sigma \sim \operatorname{Uniform}(0, 50).\]

Here’s the shape of the prior for \(\mu\), \(\mathcal N(178, 20)\).

```
<-
p1 tibble(x = seq(from = 100, to = 250, by = .1)) %>%
ggplot(aes(x = x, y = dnorm(x, mean = 178, sd = 20))) +
geom_line() +
scale_x_continuous(breaks = seq(from = 100, to = 250, by = 75)) +
labs(title = "mu ~ dnorm(178, 20)",
y = "density")
p1
```

And here’s the **ggplot2** code for our prior for \(\sigma\), a uniform distribution with a minimum value of 0 and a maximum value of 50. We don’t really need the \(y\)-axis when looking at the shapes of a density, so we’ll just remove it with `scale_y_continuous()`

.

```
<-
p2 tibble(x = seq(from = -10, to = 60, by = .1)) %>%
ggplot(aes(x = x, y = dunif(x, min = 0, max = 50))) +
geom_line() +
scale_x_continuous(breaks = c(0, 50)) +
scale_y_continuous(NULL, breaks = NULL) +
ggtitle("sigma ~ dunif(0, 50)")
p2
```

We can simulate from both priors at once to get a prior probability distribution of `heights`

.

```
<- 1e4
n
set.seed(4)
<-
sim tibble(sample_mu = rnorm(n, mean = 178, sd = 20),
sample_sigma = runif(n, min = 0, max = 50)) %>%
mutate(height = rnorm(n, mean = sample_mu, sd = sample_sigma))
<- sim %>%
p3 ggplot(aes(x = height)) +
geom_density(fill = "grey33") +
scale_x_continuous(breaks = c(0, 73, 178, 283)) +
scale_y_continuous(NULL, breaks = NULL) +
ggtitle("height ~ dnorm(mu, sigma)") +
theme(panel.grid = element_blank())
p3
```

If you look at the \(x\)-axis breaks on the plot in McElreath’s lower left panel in Figure 4.3, you’ll notice they’re intentional. To compute the mean and 3 standard deviations above and below, you might do this.

```
%>%
sim summarise(ll = mean(height) - sd(height) * 3,
mean = mean(height),
ul = mean(height) + sd(height) * 3) %>%
mutate_all(round, digits = 1)
```

```
## # A tibble: 1 x 3
## ll mean ul
## <dbl> <dbl> <dbl>
## 1 73.9 177. 281.
```

Our values are very close to his, but are off by just a bit due to simulation variation.

Here’s the work to make the lower right panel of Figure 4.3. Watch out; we’re starting to get fancy.

```
# simulate
set.seed(4)
<-
sim tibble(sample_mu = rnorm(n, mean = 178, sd = 100),
sample_sigma = runif(n, min = 0, max = 50)) %>%
mutate(height = rnorm(n, mean = sample_mu, sd = sample_sigma))
# compute the values we'll use to break on our x axis
<-
breaks c(mean(sim$height) - 3 * sd(sim$height), 0, mean(sim$height), mean(sim$height) + 3 * sd(sim$height)) %>%
round(digits = 0)
# this is just for aesthetics
<-
text tibble(height = 272 - 25,
y = .0013,
label = "tallest man",
angle = 90)
# plot
<-
p4 %>%
sim ggplot(aes(x = height)) +
geom_density(fill = "black", size = 0) +
geom_vline(xintercept = 0, color = "grey92") +
geom_vline(xintercept = 272, color = "grey92", linetype = 3) +
geom_text(data = text,
aes(y = y, label = label, angle = angle),
color = "grey92") +
scale_x_continuous(breaks = breaks) +
scale_y_continuous(NULL, breaks = NULL) +
ggtitle("height ~ dnorm(mu, sigma)\nmu ~ dnorm(178, 100)") +
theme(panel.grid = element_blank())
p4
```

You may have noticed how we were saving each of the four last plots as `p1`

through `p4`

. Let’s combine the four to make our version of McElreath’s Figure 4.3.

`+ xlab("mu") | p2 + xlab("sigma")) / (p3 | p4) (p1 `

On page 84, McElreath said his prior simulation indicated 4% of the heights would be below zero. Here’s how we might determe that percentage for our simulation.

```
%>%
sim count(height < 0) %>%
mutate(percent = 100 * n / sum(n))
```

```
## # A tibble: 2 x 3
## `height < 0` n percent
## <lgl> <int> <dbl>
## 1 FALSE 9571 95.7
## 2 TRUE 429 4.29
```

Here’s the break down compared to the tallest man on record, Robert Pershing Wadlow (1918–1940).

```
%>%
sim count(height < 272) %>%
mutate(percent = 100 * n / sum(n))
```

```
## # A tibble: 2 x 3
## `height < 272` n percent
## <lgl> <int> <dbl>
## 1 FALSE 1761 17.6
## 2 TRUE 8239 82.4
```

Does this matter? In this case, we have so much data that the silly prior is harmless. But that won’t always be the case. There are plenty of inference problems for which the data alone are not sufficient, no matter how numerous. Bayes lets us proceed in these cases. But only if we use our scientific knowledge to construct sensible priors. Using scientific knowledge to build priors is not cheating. The important thing is that your prior not be based on the values in the data, but only on what you know about the data before you see it. (p. 84)

#### 4.3.2.1 Rethinking: A farewell to epsilon.

Some readers will have already met an alternative notation for a Gaussian linear model:

\[\begin{align*} h_i & = \mu + \epsilon_i \\ \epsilon_i & \sim \operatorname{Normal}(0, \sigma) \end{align*}\]

This is equivalent to the \(h_i \sim \operatorname{Normal}(\mu, \sigma)\) form, with the \(\epsilon\) standing in for the Gaussian density. But this \(\epsilon\) form is poor form. The reason is that it does not usually generalize to other types of models. This means it won’t be possible to express non-Gaussian models using tricks like \(\epsilon\). Better to learn one system that does generalize. (p. 84)

Agreed.

### 4.3.3 Grid approximation of the posterior distribution.

As McElreath explained, you’ll never use this for practical data analysis. But I found this helped me better understanding what exactly we’re doing with Bayesian estimation. So let’s play along.

```
<- 200
n
<-
d_grid # we'll accomplish with `tidyr::crossing()` what McElreath did with base R `expand.grid()`
crossing(mu = seq(from = 140, to = 160, length.out = n),
sigma = seq(from = 4, to = 9, length.out = n))
glimpse(d_grid)
```

```
## Rows: 40,000
## Columns: 2
## $ mu <dbl> 140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140,…
## $ sigma <dbl> 4.000000, 4.025126, 4.050251, 4.075377, 4.100503, 4.125628, 4.150754, 4.175879, 4.20…
```

`d_grid`

contains every combination of `mu`

and `sigma`

across their specified values. Instead of base **R** `sapply()`

, we’ll do the computations by making a custom function which we’ll then plug into `purrr::map2()`

.

```
<- function(mu, sigma) {
grid_function
dnorm(d2$height, mean = mu, sd = sigma, log = T) %>%
sum()
}
```

Now we’re ready to complete the tibble.

```
<-
d_grid %>%
d_grid mutate(log_likelihood = map2(mu, sigma, grid_function)) %>%
unnest(log_likelihood) %>%
mutate(prior_mu = dnorm(mu, mean = 178, sd = 20, log = T),
prior_sigma = dunif(sigma, min = 0, max = 50, log = T)) %>%
mutate(product = log_likelihood + prior_mu + prior_sigma) %>%
mutate(probability = exp(product - max(product)))
head(d_grid)
```

```
## # A tibble: 6 x 7
## mu sigma log_likelihood prior_mu prior_sigma product probability
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 140 4 -3813. -5.72 -3.91 -3822. 0
## 2 140 4.03 -3778. -5.72 -3.91 -3787. 0
## 3 140 4.05 -3743. -5.72 -3.91 -3753. 0
## 4 140 4.08 -3709. -5.72 -3.91 -3719. 0
## 5 140 4.10 -3676. -5.72 -3.91 -3686. 0
## 6 140 4.13 -3644. -5.72 -3.91 -3653. 0
```

In the final `d_grid`

, the `probability`

vector contains the posterior probabilities across values of `mu`

and `sigma`

. We can make a contour plot with `geom_contour()`

.

```
%>%
d_grid ggplot(aes(x = mu, y = sigma, z = probability)) +
geom_contour() +
labs(x = expression(mu),
y = expression(sigma)) +
coord_cartesian(xlim = range(d_grid$mu),
ylim = range(d_grid$sigma)) +
theme(panel.grid = element_blank())
```

We’ll make our heat map with `geom_raster()`

.

```
%>%
d_grid ggplot(aes(x = mu, y = sigma, fill = probability)) +
geom_raster(interpolate = T) +
scale_fill_viridis_c(option = "B") +
labs(x = expression(mu),
y = expression(sigma)) +
theme(panel.grid = element_blank())
```

### 4.3.4 Sampling from the posterior.

We can use `dplyr::sample_n()`

to sample rows, with replacement, from `d_grid`

.

```
set.seed(4)
<-
d_grid_samples %>%
d_grid sample_n(size = 1e4, replace = T, weight = probability)
%>%
d_grid_samples ggplot(aes(x = mu, y = sigma)) +
geom_point(size = .9, alpha = 1/15) +
scale_fill_viridis_c() +
labs(x = expression(mu[samples]),
y = expression(sigma[samples])) +
theme(panel.grid = element_blank())
```

We can use `pivot_longer()`

and then `facet_wrap()`

to plot the densities for both `mu`

and `sigma`

at once.

```
%>%
d_grid_samples pivot_longer(mu:sigma) %>%
ggplot(aes(x = value)) +
geom_density(fill = "grey33") +
scale_y_continuous(NULL, breaks = NULL) +
xlab(NULL) +
theme(panel.grid = element_blank()) +
facet_wrap(~ name, scales = "free", labeller = label_parsed)
```

We’ll use the **tidybayes** package to compute their posterior modes and 95% HDIs.

```
library(tidybayes)
%>%
d_grid_samples pivot_longer(mu:sigma) %>%
group_by(name) %>%
mode_hdi(value)
```

```
## # A tibble: 2 x 7
## name value .lower .upper .width .point .interval
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 mu 155. 154. 155. 0.95 mode hdi
## 2 sigma 7.82 7.14 8.30 0.95 mode hdi
```

Let’s say you wanted their posterior medians and 50% quantile-based intervals, instead. Just switch out the last line for `median_qi(value, .width = .5)`

.

#### 4.3.4.1 Overthinking: Sample size and the normality of \(\sigma\)’s posterior.

Since we’ll be fitting models with **brms** almost exclusively from here on out, this section is largely mute. But we’ll do it anyway for the sake of practice. I’m going to break the steps up like before rather than compress the code together. Here’s `d3`

.

```
set.seed(4)
<- sample(d2$height, size = 20)) (d3
```

```
## [1] 147.3200 154.9400 168.9100 156.8450 165.7350 151.7650 165.7350 156.2100 144.7800 154.9400
## [11] 151.1300 147.9550 149.8600 162.5600 161.9250 164.4650 160.9852 151.7650 163.8300 149.8600
```

For our first step using `d3`

, we’ll redefine `d_grid`

.

```
<- 200
n
# note we've redefined the ranges of `mu` and `sigma`
<-
d_grid crossing(mu = seq(from = 150, to = 170, length.out = n),
sigma = seq(from = 4, to = 20, length.out = n))
```

Second, we’ll redefine our custom `grid_function()`

function to operate over the `height`

values of `d3`

.

```
<- function(mu, sigma) {
grid_function
dnorm(d3, mean = mu, sd = sigma, log = T) %>%
sum()
}
```

Now we’ll use the amended `grid_function()`

to make the posterior.

```
<-
d_grid %>%
d_grid mutate(log_likelihood = map2_dbl(mu, sigma, grid_function)) %>%
mutate(prior_mu = dnorm(mu, mean = 178, sd = 20, log = T),
prior_sigma = dunif(sigma, min = 0, max = 50, log = T)) %>%
mutate(product = log_likelihood + prior_mu + prior_sigma) %>%
mutate(probability = exp(product - max(product)))
```

Did you catch our use of `purrr::map2_dbl()`

, there, in place of `purrr::map2()`

? It turns out that `purrr::map()`

and `purrr::map2()`

always return a list (see here and here). But we can add the `_dbl`

suffix to those functions, which will instruct the **purrr** package to return a double vector (i.e., a common kind of numeric vector). The advantage of that approach is we no longer need to follow our `map()`

or `map2()`

lines with `unnest()`

. To learn more about the ins and outs of the `map()`

family, check out this section from *R4DS* or Jenny Bryan’s *purrr tutorial*.

Next we’ll `sample_n()`

and plot.

```
set.seed(4)
<-
d_grid_samples %>%
d_grid sample_n(size = 1e4, replace = T, weight = probability)
%>%
d_grid_samples ggplot(aes(x = mu, y = sigma)) +
geom_point(size = .9, alpha = 1/15) +
labs(x = expression(mu[samples]),
y = expression(sigma[samples])) +
theme(panel.grid = element_blank())
```

Behold the updated densities.

```
%>%
d_grid_samples pivot_longer(mu:sigma) %>%
ggplot(aes(x = value)) +
geom_density(fill = "grey33", size = 0) +
scale_y_continuous(NULL, breaks = NULL) +
xlab(NULL) +
theme(panel.grid = element_blank()) +
facet_wrap(~ name, scales = "free", labeller = label_parsed)
```

That `labeller = label_parsed`

bit in the `facet_wrap()`

function is what converted our subplot strip labels into Greek. You can learn more about `labeller`

here. Anyway, our posterior for \(\sigma\) isn’t so Gaussian with that small \(n\).

This is the point in the project where we hop off the grid-approximation train. On the one hand, I think this is a great idea. Most of y’all reading this will never use grid approximation in a real-world applied data analysis. On the other hand, there is some pedagogical utility in practicing with it. It can help you grasp what it is we’re dong when we apply Bayes’ theorem. If you’d like more practice, check out the first several chapters in John Kruschke’s (2015) textbook and the corresponding chapters in my (2020c) ebook translating it into **brms** and **tidyverse**.

### 4.3.5 Finding the posterior distribution with `quap`

`brm()`

.

`quap`

Here we rewrite the statistical model, this time using font color to help differentiate the likelihood from the prior(s).

\[\begin{align*} \color{red}{\text{heights}_i} & \color{red}\sim \color{red}{\operatorname{Normal}(\mu, \sigma)} && \color{red}{\text{likelihood}} \\ \color{blue}\mu & \color{blue}\sim \color{blue}{\operatorname{Normal}(178, 20)} && \color{blue}{\text{prior}} \\ \color{blue}\sigma & \color{blue}\sim \color{blue}{\operatorname{Uniform}(0, 50)} \end{align*}\]

We won’t actually use `rethinking::quap()`

. It’s time to jump straight to the primary **brms** modeling function, `brm()`

. In the text, McElreath indexed his models with names like `m4.1`

. I will largely follow that convention, but will replace the *m* with a *b* to stand for the **brms** package. Here’s how to fit the first model for this chapter.

```
.1 <-
b4brm(data = d2,
family = gaussian,
~ 1,
height prior = c(prior(normal(178, 20), class = Intercept),
prior(uniform(0, 50), class = sigma)),
iter = 31000, warmup = 30000, chains = 4, cores = 4,
seed = 4,
file = "fits/b04.01")
```

McElreath’s uniform prior for \(\sigma\) was rough on **brms**. It took an unusually-large number of warmup iterations before the chains sampled properly. As McElreath covered in Chapter 9, Hamiltonian Monte Carlo (HMC) tends to work better when you default to an exponential or half Cauchy for \(\sigma\). Here’s how to do so with the half Cauchy.^{1}

```
.1_hc <-
b4brm(data = d2,
family = gaussian,
~ 1,
height prior = c(prior(normal(178, 20), class = Intercept),
# the magic lives here
prior(cauchy(0, 1), class = sigma)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 4,
file = "fits/b04.01_hc")
```

This leads to an important point. After running a model fit with HMC, it’s a good idea to inspect the chains. As we’ll see, McElreath covered visual chain diagnostics in Chapter 9. Here’s a typical way to do so with **brms**.

`plot(b4.1_hc)`

If you want detailed diagnostics for the HMC chains, call `launch_shinystan(b4.1)`

. That’ll keep you busy for a while. But anyway, the chains look good. We can reasonably trust the results. Here’s how to get the model summary of our `brm()`

object.

`print(b4.1_hc)`

```
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: height ~ 1
## Data: d2 (Number of observations: 352)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 154.61 0.41 153.81 155.42 1.00 3833 2967
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 7.74 0.29 7.20 8.35 1.00 3511 2853
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

The `summary()`

function works in a similar way. You can also get a Stan-like summary [see the *RStan: the R interface to Stan* vignette; Stan Development Team (2020)] with a little indexing.

`.1_hc$fit b4`

```
## Inference for Stan model: dd78ed3cf429c248161e09fa2f6f34e3.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## b_Intercept 154.61 0.01 0.41 153.81 154.34 154.61 154.89 155.42 3796 1
## sigma 7.74 0.00 0.29 7.20 7.54 7.73 7.93 8.35 3553 1
## lp__ -1227.52 0.02 1.00 -1230.16 -1227.91 -1227.22 -1226.79 -1226.54 1984 1
##
## Samples were drawn using NUTS(diag_e) at Sun Mar 14 16:17:25 2021.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
```

Whereas **rethinking** defaults to 89% intervals, using `print()`

or `summary()`

with **brms** models defaults to 95% intervals. Unless otherwise specified, I will stick with 95% intervals throughout. However, if you really want those 89% intervals, an easy way is with the `prob`

argument within `brms::summary()`

or `brms::print()`

.

`summary(b4.1_hc, prob = .89)`

```
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: height ~ 1
## Data: d2 (Number of observations: 352)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-89% CI u-89% CI Rhat Bulk_ESS Tail_ESS
## Intercept 154.61 0.41 153.96 155.27 1.00 3833 2967
##
## Family Specific Parameters:
## Estimate Est.Error l-89% CI u-89% CI Rhat Bulk_ESS Tail_ESS
## sigma 7.74 0.29 7.29 8.23 1.00 3511 2853
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

Anyways, here’s the `brms::brm()`

code for the model with the very-narrow-\(\mu\)-prior corresponding to the `rethinking::quap()`

code in McElreath’s **R** code 4.31.

```
.2 <-
b4brm(data = d2,
family = gaussian,
~ 1,
height prior = c(prior(normal(178, 0.1), class = Intercept),
prior(uniform(0, 50), class = sigma)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 4,
file = "fits/b04.02")
```

`plot(b4.2)`

I had to increase the `warmup`

due to convergence issues. After doing so, everything looks to be on the up and up. The chains look great. Here’s the model `summary()`

.

`summary(b4.2)`

```
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: height ~ 1
## Data: d2 (Number of observations: 352)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 177.86 0.10 177.66 178.06 1.00 3467 2724
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 24.59 0.91 22.95 26.54 1.01 483 422
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

Subsetting the `summary()`

output with `$fixed`

provides a convenient way to compare the `Intercept`

summaries between `b4.1_hc`

and `b4.2`

.

```
rbind(summary(b4.1_hc)$fixed,
summary(b4.2)$fixed)
```

```
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 154.6133 0.4100253 153.8146 155.4213 1.000225 3833 2967
## Intercept 177.8637 0.1014405 177.6609 178.0614 1.000384 3467 2724
```

### 4.3.6 Sampling from a `quap()`

`brm()`

fit.

`quap()`

**brms** doesn’t seem to have a convenience function that works the way `vcov()`

does for **rethinking**. For example:

`vcov(b4.1_hc)`

```
## Intercept
## Intercept 0.1681208
```

This only returns the first element in the matrix it did for **rethinking**. That is, it appears `brms::vcov()`

only returns the variance/covariance matrix for the single-level \(\beta\) parameters. However, if you really wanted this information, you could get it after putting the HMC chains in a data frame. We do that with the `posterior_samples()`

, which we’ll be using a lot of as we go along.

```
<- posterior_samples(b4.1_hc)
post
head(post)
```

```
## b_Intercept sigma lp__
## 1 154.2645 7.992651 -1227.255
## 2 154.6744 7.427472 -1227.075
## 3 154.7472 7.410110 -1227.193
## 4 154.7948 8.470115 -1229.452
## 5 154.2154 7.940177 -1227.219
## 6 154.7131 7.744806 -1226.553
```

Now `select()`

the columns containing the draws from the desired parameters and feed them into `cov()`

.

```
select(post, b_Intercept:sigma) %>%
cov()
```

```
## b_Intercept sigma
## b_Intercept 0.168120762 -0.005250413
## sigma -0.005250413 0.086979181
```

That was “(1) a vector of variances for the parameters and (2) a correlation matrix” for them (p. 90). Here are just the variances (i.e., the diagonal elements) and the correlation matrix.

```
# variances
select(post, b_Intercept:sigma) %>%
cov() %>%
diag()
```

```
## b_Intercept sigma
## 0.16812076 0.08697918
```

```
# correlation
%>%
post select(b_Intercept, sigma) %>%
cor()
```

```
## b_Intercept sigma
## b_Intercept 1.00000000 -0.04341854
## sigma -0.04341854 1.00000000
```

With our `post <- posterior_samples(b4.1_hc)`

code from a few lines above, we’ve already produced the **brms** version of what McElreath achieved with `extract.samples()`

on page 90. However, what happened under the hood was different. Whereas **rethinking** used the `mvnorm()`

function from the **MASS** package (Ripley, 2019; Venables & Ripley, 2002), with **brms** we just extracted the iterations of the HMC chains and put them in a data frame.

`str(post)`

```
## 'data.frame': 4000 obs. of 3 variables:
## $ b_Intercept: num 154 155 155 155 154 ...
## $ sigma : num 7.99 7.43 7.41 8.47 7.94 ...
## $ lp__ : num -1227 -1227 -1227 -1229 -1227 ...
```

Notice how our data frame, `post`

, includes a third vector named `lp__`

. That’s the log posterior. For details, see the **brms** reference manual (Bürkner, 2021i), the “The Log-Posterior (function and gradient)” section of the Stan Development Team’s (2020) vignette, *RStan: the R interface to Stan*, or Stephen Martin’s nice explanation of the log posterior on the Stan Forums. The log posterior will largely be outside of our focus in this ebook.

The `summary()`

function doesn’t work for **brms** posterior data frames quite the way `precis()`

does for posterior data frames from the **rethinking package**. Behold the results.

`summary(post[, 1:2])`

```
## b_Intercept sigma
## Min. :153.0 Min. :6.900
## 1st Qu.:154.3 1st Qu.:7.541
## Median :154.6 Median :7.733
## Mean :154.6 Mean :7.744
## 3rd Qu.:154.9 3rd Qu.:7.934
## Max. :156.2 Max. :8.950
```

Here’s one option using the transpose of a `quantile()`

call nested within `apply()`

, which is a very general function you can learn more about here or here.

`t(apply(post[, 1:2], 2, quantile, probs = c(.5, .025, .75)))`

```
## 50% 2.5% 75%
## b_Intercept 154.612809 153.814609 154.887202
## sigma 7.733021 7.199217 7.934425
```

The base **R** code is compact, but somewhat opaque. Here’s how to do something similar with more explicit **tidyverse** code.

```
%>%
post pivot_longer(-lp__) %>%
group_by(name) %>%
summarise(mean = mean(value),
sd = sd(value),
`2.5%` = quantile(value, probs = .025),
`97.5%` = quantile(value, probs = .975)) %>%
mutate_if(is.numeric, round, digits = 2)
```

```
## # A tibble: 2 x 5
## name mean sd `2.5%` `97.5%`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 b_Intercept 155. 0.41 154. 155.
## 2 sigma 7.74 0.290 7.2 8.35
```

You can always get pretty similar information by just putting the `brm()`

fit object into `posterior_summary()`

.

`posterior_summary(b4.1_hc)`

```
## Estimate Est.Error Q2.5 Q97.5
## b_Intercept 154.613285 0.4100253 153.814609 155.421274
## sigma 7.744364 0.2949223 7.199217 8.346968
## lp__ -1227.517317 0.9973747 -1230.155469 -1226.541248
```

And if you’re willing to drop the posterior \(SD\)s, you can use `tidybayes::mean_hdi()`

, too.

```
%>%
post pivot_longer(-lp__) %>%
group_by(name) %>%
mean_qi(value)
```

```
## # A tibble: 2 x 7
## name value .lower .upper .width .point .interval
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 b_Intercept 155. 154. 155. 0.95 mean qi
## 2 sigma 7.74 7.20 8.35 0.95 mean qi
```

Though none of these solutions get you those sweet little histograms, you can always make those for your HMC models by inserting the desired posterior draws into `histospark()`

.

```
rbind(histospark(post$b_Intercept),
histospark(post$sigma))
```

```
## [,1]
## [1,] "▁▁▅▇▂▁▁"
## [2,] "▁▁▂▅▇▇▃▁▁▁▁"
```

Hell, you can even tack those onto the output from our verbose **tidyverse** code from a few blocks up.

```
%>%
post pivot_longer(-lp__) %>%
group_by(name) %>%
summarise(mean = mean(value),
sd = sd(value),
`2.5%` = quantile(value, probs = .025),
`97.5%` = quantile(value, probs = .975)) %>%
mutate_if(is.numeric, round, digits = 2) %>%
mutate(histospark = c(histospark(post$b_Intercept), histospark(post$sigma)))
```

```
## # A tibble: 2 x 6
## name mean sd `2.5%` `97.5%` histospark
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 b_Intercept 155. 0.41 154. 155. ▁▁▅▇▂▁▁
## 2 sigma 7.74 0.290 7.2 8.35 ▁▁▂▅▇▇▃▁▁▁▁
```

#### 4.3.6.1 Overthinking: Start values for `quap()`

`brm()`

.

`quap()`

We won’t be emphasizing start values in this ebook. But, yes, you can set start values for the HMC chains from **brms**, too. Within the `brm()`

function, you do so with the `inits`

argument. From the `brm`

section within the **brms** reference manual, we read:

Either

`"random"`

or`"0"`

. If inits is`"random"`

(the default), Stan will randomly generate initial values for parameters. If it is`"0"`

, all parameters are initialized to zero. This option is sometimes useful for certain families, as it happens that default (`"random"`

) inits cause samples to be essentially constant. Generally, setting`inits = "0"`

is worth a try, if chains do not behave well. Alternatively,`inits`

can be a list of lists containing the initial values, or a function (or function name) generating initial values. The latter options are mainly implemented for internal testing but are available to users if necessary. If specifying initial values using a list or a function then currently the parameter names must correspond to the names used in the generated Stan code (not the names used in`R`

).

#### 4.3.6.2 Overthinking: Under the hood with multivariate sampling.

Again, `brms::posterior_samples()`

is not the same as `rethinking::extract.samples()`

. Rather than use the `MASS::mvnorm()`

, **brms** takes the draws from the HMC chains. McElreath covered all of this in Chapter 9 and we will too. You might also look at the **brms** reference manual or GitHub page for details. To get documentation in a hurry, you could also just execute `?posterior_samples`

.

## 4.4 Linear prediction

Here’s our scatter plot of `weight`

and `height`

.

```
ggplot(data = d2,
aes(x = weight, y = height)) +
geom_point(shape = 1, size = 2) +
theme_bw() +
theme(panel.grid = element_blank())
```

There’s obviously a relationship: Knowing a person’s weight helps you predict height.

To make this vague observation into a more precise quantitative model that relates values of

`weight`

to plausible values of`height`

, we need some more technology. How do we take our Gaussian model from the previous section and incorporate predictor variables? (p. 92)

### 4.4.1 The linear model strategy.

The strategy is to make the parameter for the mean of a Gaussian distribution, \(\mu\), into a linear function of the predictor variable and other, new parameters that we invent. This strategy is often simply called the

linear model. The linear model strategy instructs the golem to assume that the predictor variable has a constant and additive relationship to the mean of the outcome. The golem then computes the posterior distribution of this constant relationship. (p. 92,emphasisin the original)

Like we did for our first model without a predictor, we’ll use font color to help differentiate between the likelihood and prior(s) of our new univariable model,

\[\begin{align*} \color{red}{\text{height}_i} & \color{red}\sim \color{red}{\operatorname{Normal}(\mu_i, \sigma)} && \color{red}{\text{likelihood}} \\ \color{red}{\mu_i} & \color{red}= \color{red}{\alpha + \beta (\text{weight}_i - \overline{\text{weight}})} && \color{red}{\text{\{the linear model is just a special part of the likelihood\}} } \\ \color{blue}\alpha & \color{blue}\sim \color{blue}{\operatorname{Normal}(178, 20)} && \color{blue}{\text{prior(s)}} \\ \color{blue}\beta & \color{blue}\sim \color{blue}{\operatorname{Normal}(0, 10)} \\ \color{blue}\sigma & \color{blue}\sim \color{blue}{\operatorname{Uniform}(0, 50)}. \end{align*}\]

Do note that \((\text{weight}_i - \overline{\text{weight}})\) part. As we’ll see, it’s often advantageous to mean center our predictors.

#### 4.4.1.1 Probability of the data.

Let’s begin with just the probability of the observed height, the first line of the model. This is nearly identical to before, except now there is a little index \(i\) on the \(\mu\) as well as the [\(\text{height}\)]. You can read [\(\text{height}_i\)] as “each [\(\text{height}\)]” and \(\mu_i\) as “each \(\mu\).” The mean \(\mu\) now depends upon unique values on each row \(i\). So the little \(i\) on \(\mu_i\) indicates that

the mean depends upon the row. (p. 93,emphasisin the original)

#### 4.4.1.2 Linear model.

The mean \(\mu\) is no longer a parameter to be estimated. Rather, as seen in the second line of the model, \(\mu_i\) is constructed from other parameters, \(\alpha\) and \(\beta\), and the observed variable [\(\text{weight}\)]. This line is not a stochastic relationship–there is no \(\sim\) in it, but rather an \(=\) in it–because the definition of \(\mu_i\) is deterministic. That is to say that, once we know \(\alpha\) and \(\beta\) and [\(\text{weight}_i\)], we know \(\mu_i\) with certainty.

The value [\(\text{weight}_i\)] is just the weight value on row \(i\). It refers to the same individual as the height value, [\(\text{height}_i\)], on the same row. The parameters \(\alpha\) and \(\beta\) are more mysterious. Where did they come from? We made them up….

You’ll be making up all manner of parameters as your skills improve. (p. 93)

##### 4.4.1.2.1 Rethinking: Nothing special or natural about linear models.

Note that there’s nothing special about the linear model, really. You can choose a different relationship between \(\alpha\) and \(\beta\) and \(\mu\). For example, the following is a perfectly legitimate definition for \(\mu_i\):

\[\mu_i = \alpha \exp(- \beta x_i)\]

This does not define a linear regression, but it does define a regression model. The linear relationship we are using instead is conventional, but nothing requires that you use it. (p. 94)

#### 4.4.1.3 Priors.

The remaining lines in the model define distributions for the unobserved variables. These variables are commonly known as parameters, and their distributions as priors. There are three parameters: \(\alpha\), \(\beta\), and \(\sigma\). You’ve seen priors for \(\alpha\) and \(\sigma\) before, although \(\alpha\) was called \(\mu\) back then.

The prior for \(\beta\) deserves explanation. Why have a Gaussian prior with mean zero? (p. 94)

We’ll simulate to find out. Instead of using a loop to make our data for Figure 4.5, we’ll stay within the **tidyverse**.

```
set.seed(2971)
# how many lines would you like?
<- 100
n_lines
<-
lines tibble(n = 1:n_lines,
a = rnorm(n_lines, mean = 178, sd = 20),
b = rnorm(n_lines, mean = 0, sd = 10)) %>%
expand(nesting(n, a, b), weight = range(d2$weight)) %>%
mutate(height = a + b * (weight - mean(d2$weight)))
head(lines)
```

```
## # A tibble: 6 x 5
## n a b weight height
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 191. -7.06 31.1 289.
## 2 1 191. -7.06 63.0 63.5
## 3 2 199. 0.839 31.1 187.
## 4 2 199. 0.839 63.0 214.
## 5 3 202. 3.93 31.1 147.
## 6 3 202. 3.93 63.0 272.
```

Now we’ll plot the left panel from Figure 4.5.

```
%>%
lines ggplot(aes(x = weight, y = height, group = n)) +
geom_hline(yintercept = c(0, 272), linetype = 2:1, size = 1/3) +
geom_line(alpha = 1/10) +
coord_cartesian(ylim = c(-100, 400)) +
ggtitle("b ~ dnorm(0, 10)") +
theme_classic()
```

The pattern doesn’t look like any human population at all. It essentially says that the relationship between weight and height could be absurdly positive or negative. Before we’ve even seen the data, this is a bad model. Can we do better?

We can do better immediately. (pp. 95–96)

One thing we know from the outset is that the correlation between human height and weight is positive. We might not be sure of the magnitude, but it’s definitely the case that, on average, taller people are heavier people. Within the univariable regression context, this implies that the regression coefficient for `weight`

predicting `height`

will be positive. It might be unclear how large that coefficient will be, but it will certainly be above zero. One way we might encode this information in our data is by using the log-normal distribution for our \(\beta\) prior. Here’s what \(\operatorname{Log-Normal}(0, 1)\) looks like.

```
set.seed(4)
tibble(b = rlnorm(1e4, mean = 0, sd = 1)) %>%
ggplot(aes(x = b)) +
geom_density(fill = "grey92") +
coord_cartesian(xlim = c(0, 5)) +
theme_classic()
```

If you’re unfamiliar with the log-normal distribution, it is the distribution whose logarithm is normally distributed. For example, here’s what happens when we compare \(\operatorname{Normal}(0, 1)\) with \(\log \big ( \operatorname{Log-Normal}(0, 1) \big)\).

```
set.seed(4)
tibble(rnorm = rnorm(1e5, mean = 0, sd = 1),
`log(rlognorm)` = log(rlnorm(1e5, mean = 0, sd = 1))) %>%
pivot_longer(everything()) %>%
ggplot(aes(x = value)) +
geom_density(fill = "grey92") +
coord_cartesian(xlim = c(-3, 3)) +
theme_classic() +
facet_wrap(~ name, nrow = 2)
```

They are the same within simulation variance. Also, did you notice how we simulated those log-normal data with `mean = 0, sd = 1`

? Those values are what the mean and standard deviation of the output from the `rlnorm()`

function **after** they are log transformed. The formulas for the actual mean and standard deviation for the log-normal distribution itself are complicated (see here). They are

\[\begin{align*} \text{mean} & = \exp \left (\mu + \frac{\sigma^2}{2} \right) & \text{and} \\ \text{standard deviation} & = \sqrt{[\exp(\sigma ^{2})-1] \; \exp(2\mu +\sigma ^{2})}. \end{align*}\]

If we follow those formulas through, there are the mean and standard deviation for \(\operatorname{Log-Normal}(0, 1)\).

```
<- 0
mu <- 1
sigma
# mean
exp(mu + (sigma^2) / 2)
```

`## [1] 1.648721`

```
# sd
sqrt((exp(sigma^2) - 1) * exp(2 * mu + sigma^2))
```

`## [1] 2.161197`

Let’s confirm with simulated draws from `rlnorm()`

.

```
set.seed(4)
tibble(x = rlnorm(1e7, mean = 0, sd = 1)) %>%
summarise(mean = mean(x),
sd = sd(x))
```

```
## # A tibble: 1 x 2
## mean sd
## <dbl> <dbl>
## 1 1.65 2.17
```

But okay, “so what [do all these complications] earn us? Do the prior predictive simulation again, now with the Log-Normal prior:” (p. 96).

```
# make a tibble to annotate the plot
<-
text tibble(weight = c(34, 43),
height = c(0 - 25, 272 + 25),
label = c("Embryo", "World's tallest person (272 cm)"))
# simulate
set.seed(2971)
tibble(n = 1:n_lines,
a = rnorm(n_lines, mean = 178, sd = 20),
b = rlnorm(n_lines, mean = 0, sd = 1)) %>%
expand(nesting(n, a, b), weight = range(d2$weight)) %>%
mutate(height = a + b * (weight - mean(d2$weight))) %>%
# plot
ggplot(aes(x = weight, y = height, group = n)) +
geom_hline(yintercept = c(0, 272), linetype = 2:1, size = 1/3) +
geom_line(alpha = 1/10) +
geom_text(data = text,
aes(label = label),
size = 3) +
coord_cartesian(ylim = c(-100, 400)) +
ggtitle("log(b) ~ dnorm(0, 1)") +
theme_classic()
```

“This is much more sensible. There is still a rare impossible relationship. But nearly all lines in the joint prior for \(\alpha\) and \(\beta\) are now within human reason” (p. 96)

##### 4.4.1.3.1 Rethinking: What’s the correct prior?

Good luck with that question. Hang around on academic Twitter long enough and you’ll see folks debating this.

This is a mistake. There is no more a uniquely correct prior than there is a uniquely correct likelihood. Statistical models are machines for inference. Many machines will work, but some work better than others. Priors can be wrong, but only in the same sense that a kind of hammer can be wrong for building a table. (p. 96)

##### 4.4.1.3.2 Rethinking: Prior predictive simulation and \(p\)-hacking.

“We don’t pay any attention to \(p\)-values in this book” (p. 97). Off hand, I’m not sure of the exact origin of the term \(p\)-hacking. But the paper by Simmons, Nelson and Simonsohn (2011), *False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant*, is often cited as an introduction to the problem.

### 4.4.2 Finding the posterior distribution.

Unlike with McElreath’s `quap()`

formula syntax, I’m not aware that you can just specify something like `weight – xbar`

in the `formula`

argument in `brm()`

. However, the alternative is easy: Just make a new variable in the data that is equivalent to `weight – mean(weight)`

. We’ll call it `weight_c`

.

```
<-
d2 %>%
d2 mutate(weight_c = weight - mean(weight))
```

Unlike with McElreath’s **rethinking** package, the `brms::brm()`

syntax doesn’t mirror the statistical notation. But here are the analogues to the exposition at the bottom of page 97:

- \(\text{height}_i \sim \operatorname{Normal}(\mu_i, \sigma)\):
`family = gaussian`

, - \(\mu_i = \alpha + \beta \text{weight}_i\):
`height ~ 1 + weight_c`

, - \(\alpha \sim \operatorname{Normal}(178, 20)\):
`prior(normal(178, 20), class = Intercept`

, - \(\beta \sim \operatorname{Log-Normal}(0, 1)\):
`prior(lognormal(0, 1), class = b)`

, and - \(\sigma \sim \operatorname{Uniform}(0, 50)\):
`prior(uniform(0, 50), class = sigma)`

.

Thus, to add a predictor you just the `+`

operator in the model `formula`

.

```
.3 <-
b4brm(data = d2,
family = gaussian,
~ 1 + weight_c,
height prior = c(prior(normal(178, 20), class = Intercept),
prior(lognormal(0, 1), class = b),
prior(uniform(0, 50), class = sigma)),
iter = 28000, warmup = 27000, chains = 4, cores = 4,
seed = 4,
file = "fits/b04.03")
```

This was another example of how using a uniform prior for \(\sigma\) was a major pain. As you’ll see in later chapters, you usually only need to set `iter = 2000, warmup = 1000`

when using even moderately better priors. But for now, we’ll continue to find ways to stick close to the text. Here are the trace plots.

`plot(b4.3)`

#### 4.4.2.1 Overthinking: Logs and exps, oh my.

**brms** does not allow users to insert coefficients into functions like \(\exp()\) within the conventional `formula`

syntax. We can fit a **brms** model like McElreath’s `m4.3b`

if we adopt what’s called the non-linear syntax (Bürkner, 2021e). The non-linear syntax is a lot like the syntax McElreath uses in **rethinking** in that it typically includes both predictor and variable names in the `formula`

. Since this is so early in the book and we’re just working through a problem in an **Overthinking** tangent, I won’t go into a full-blow explanation, here. There will be many more opportunities to practice with the non-linear syntax in the chapters to come (e.g., Section 5.3.2, Section 6.2.1). For now, here’s how we might fit the model.

```
.3b <-
b4brm(data = d2,
family = gaussian,
bf(height ~ a + exp(lb) * weight_c,
~ 1,
a ~ 1,
lb nl = TRUE),
prior = c(prior(normal(178, 20), class = b, nlpar = a),
prior(normal(0, 1), class = b, nlpar = lb),
prior(uniform(0, 50), class = sigma)),
iter = 31000, warmup = 30000, chains = 4, cores = 4,
seed = 4,
file = "fits/b04.03b")
```

We continue to need unusually large numbers of warmup iterations due to that pesky \(\operatorname{Uniform}(0, 50)\) prior on \(\sigma\). If you execute `summary(b4.3b)`

, you’ll see the intercept and \(\sigma\) summaries for this model are about the same as those for `b4.3`

, above. The difference is for the \(\beta\) parameter, which we called `lb`

in the `b4.3b`

model. If we term that parameter from `b4.3`

as \(\beta^\text{b4.3}\) and the one from our new model \(\beta^\text{b4.3b}\), it turns out that \(\beta^\text{b4.3} = \exp \left (\beta^\text{b4.3b} \right )\).

`fixef(b4.3)["weight_c", "Estimate"]`

`## [1] 0.9040527`

`fixef(b4.3b)["lb_Intercept", "Estimate"] %>% exp()`

`## [1] 0.9036588`

They’re the same within simulation variance.

### 4.4.3 Interpreting the posterior distribution.

“One trouble with statistical models is that they are hard to understand” (p. 99). Welcome to the world of applied statistics, friends. 😅

##### 4.4.3.0.1 Rethinking: What do parameters mean?

A basic issue with interpreting model-based estimates is in knowing the meaning of parameters. There is no consensus about what a parameter means, however, because different people take different philosophical stances towards models, probability, and prediction. The perspective in this book is a common Bayesian perspective:

Posterior probabilities of parameter values describe the relative compatibility of different states of the world with the data, according to the model. (p. 99,emphasisin the original)

#### 4.4.3.1 Tables of marginal distributions.

With a little `[]`

subsetting we can exclude the log posterior from our summary for `b4.3`

.

```
posterior_summary(b4.3)[1:3, ] %>%
round(digits = 2)
```

```
## Estimate Est.Error Q2.5 Q97.5
## b_Intercept 154.6 0.27 154.08 155.14
## b_weight_c 0.9 0.04 0.83 0.99
## sigma 5.1 0.19 4.75 5.50
```

If we put our **brms** fit into the `vcov()`

function, we’ll get the variance/covariance matrix of the intercept and `weight_c`

coefficient.

```
vcov(b4.3) %>%
round(3)
```

```
## Intercept weight_c
## Intercept 0.076 0.000
## weight_c 0.000 0.002
```

No \(\sigma\), however. To get that, we’ll have to extract the posterior draws and use the `cov()`

function, instead.

```
posterior_samples(b4.3) %>%
select(-lp__) %>%
cov() %>%
round(digits = 3)
```

```
## b_Intercept b_weight_c sigma
## b_Intercept 0.076 0.000 0.000
## b_weight_c 0.000 0.002 0.000
## sigma 0.000 0.000 0.036
```

The `pairs()`

function will work for a **brms** fit much like it would one from **rethinking**. It will show “both the marginal posteriors and the covariance” (p. 100).

`pairs(b4.3)`

#### 4.4.3.2 Plotting posterior inference against the data.

“It’s almost always much more useful to plot the posterior inference against the data. Not only does plotting help in interpreting the posterior, but it also provides an informal check on model assumptions” (p. 100).

Here is the code for Figure 4.6. Note our use of the `fixef()`

function.

```
%>%
d2 ggplot(aes(x = weight_c, y = height)) +
geom_abline(intercept = fixef(b4.3)[1],
slope = fixef(b4.3)[2]) +
geom_point(shape = 1, size = 2, color = "royalblue") +
theme_classic()
```

Note how the breaks on our \(x\)-axis look off. That’s because we fit the model with `weight_c`

and we plotted the points in that metric, too. Since we computed `weight_c`

by subtracting the mean of `weight`

from the data, we can adjust the \(x\)-axis break point labels by simply adding that value back.

```
<-
labels c(-10, 0, 10) + mean(d2$weight) %>%
round(digits = 0)
%>%
d2 ggplot(aes(x = weight_c, y = height)) +
geom_abline(intercept = fixef(b4.3)[1],
slope = fixef(b4.3)[2]) +
geom_point(shape = 1, size = 2, color = "royalblue") +
scale_x_continuous("weight",
breaks = c(-10, 0, 10),
labels = labels) +
theme_bw() +
theme(panel.grid = element_blank())
```

#### 4.4.3.3 Adding uncertainty around the mean.

Be default, we extract all the posterior iterations with `posterior_samples()`

.

```
<- posterior_samples(b4.3)
post
%>%
post slice(1:5) # this serves a similar function as `head()`
```

```
## b_Intercept b_weight_c sigma lp__
## 1 154.0395 0.8883175 5.019815 -1081.040
## 2 155.0064 0.8813894 5.149540 -1080.008
## 3 154.2648 0.8641695 5.200915 -1080.066
## 4 154.4480 0.9149853 4.956529 -1079.143
## 5 154.7072 0.9176580 5.190356 -1079.014
```

Here are the four models leading up to McElreath’s Figure 4.7.

```
<- 10
N
.3_010 <-
b4brm(data = d2 %>%
slice(1:N), # note our tricky use of `N` and `slice()`
family = gaussian,
~ 1 + weight_c,
height prior = c(prior(normal(178, 20), class = Intercept),
prior(lognormal(0, 1), class = b),
prior(uniform(0, 50), class = sigma)),
iter = 11000, warmup = 10000, chains = 4, cores = 4,
seed = 4,
file = "fits/b04.03_010")
<- 50
N
.3_050 <-
b4brm(data = d2 %>%
slice(1:N),
family = gaussian,
~ 1 + weight_c,
height prior = c(prior(normal(178, 20), class = Intercept),
prior(lognormal(0, 1), class = b),
prior(uniform(0, 50), class = sigma)),
iter = 11000, warmup = 10000, chains = 4, cores = 4,
seed = 4,
file = "fits/b04.03_050")
<- 150
N
.3_150 <-
b4brm(data = d2 %>%
slice(1:N),
family = gaussian,
~ 1 + weight_c,
height prior = c(prior(normal(178, 20), class = Intercept),
prior(lognormal(0, 1), class = b),
prior(uniform(0, 50), class = sigma)),
iter = 21000, warmup = 20000, chains = 4, cores = 4,
seed = 4,
file = "fits/b04.03_150")
<- 352
N
.3_352 <-
b4brm(data = d2 %>%
slice(1:N),
family = gaussian,
~ 1 + weight_c,
height prior = c(prior(normal(178, 20), class = Intercept),
prior(lognormal(0, 1), class = b),
prior(uniform(0, 50), class = sigma)),
iter = 28000, warmup = 27000, chains = 4, cores = 4,
seed = 4,
file = "fits/b04.03_352")
```

I’m not going to clutter up the document with all the trace plots and coefficient summaries from these four models. But here’s how to get that information.

```
plot(b4.3_010)
print(b4.3_010)
plot(b4.3_050)
print(b4.3_050)
plot(b4.3_150)
print(b4.3_150)
plot(b4.3_352)
print(b4.3_352)
```

We’ll need to put the chains of each model into data frames.

```
<- posterior_samples(b4.3_010)
post010 <- posterior_samples(b4.3_050)
post050 <- posterior_samples(b4.3_150)
post150 <- posterior_samples(b4.3_352) post352
```

Here is the code for the four individual plots.

```
<-
p1 ggplot(data = d2[1:10, ],
aes(x = weight_c, y = height)) +
geom_abline(intercept = post010[1:20, 1],
slope = post010[1:20, 2],
size = 1/3, alpha = .3) +
geom_point(shape = 1, size = 2, color = "royalblue") +
coord_cartesian(xlim = range(d2$weight_c),
ylim = range(d2$height)) +
labs(subtitle = "N = 10")
<-
p2 ggplot(data = d2[1:50, ],
aes(x = weight_c, y = height)) +
geom_abline(intercept = post050[1:20, 1],
slope = post050[1:20, 2],
size = 1/3, alpha = .3) +
geom_point(shape = 1, size = 2, color = "royalblue") +
coord_cartesian(xlim = range(d2$weight_c),
ylim = range(d2$height)) +
labs(subtitle = "N = 50")
<-
p3 ggplot(data = d2[1:150, ],
aes(x = weight_c, y = height)) +
geom_abline(intercept = post150[1:20, 1],
slope = post150[1:20, 2],
size = 1/3, alpha = .3) +
geom_point(shape = 1, size = 2, color = "royalblue") +
coord_cartesian(xlim = range(d2$weight_c),
ylim = range(d2$height)) +
labs(subtitle = "N = 150")
<-
p4 ggplot(data = d2[1:352, ],
aes(x = weight_c, y = height)) +
geom_abline(intercept = post352[1:20, 1],
slope = post352[1:20, 2],
size = 1/3, alpha = .3) +
geom_point(shape = 1, size = 2, color = "royalblue") +
coord_cartesian(xlim = range(d2$weight_c),
ylim = range(d2$height)) +
labs(subtitle = "N = 352")
```

Note how we used the good old bracket syntax (e.g., `d2[1:10 , ]`

) to index rows from our `d2`

data. With **tidyverse**-style syntax, we could have done `slice(d2, 1:10)`

or `d2 %>% slice(1:10)`

instead.

Now we can combine the ggplots with **patchwork** syntax to make the full version of Figure 4.7.

```
+ p2 + p3 + p4) &
(p1 scale_x_continuous("weight",
breaks = c(-10, 0, 10),
labels = labels) &
theme_classic()
```

“Notice that the cloud of regression lines grows more compact as the sample size increases. This is a result of the model growing more confident about the location of the mean” (p. 102).

#### 4.4.3.4 Plotting regression intervals and contours.

Since we used `weight_c`

to fit our model, we might first want to understand what exactly the mean value is for `weight`

.

`mean(d2$weight)`

`## [1] 44.99049`

Just a hair under 45. If we’re interested in \(\mu\) at `weight`

= 50, that implies we’re also interested in \(\mu\) at `weight_c`

+ 5.01.

```
<-
mu_at_50 %>%
post transmute(mu_at_50 = b_Intercept + b_weight_c + 5.01)
head(mu_at_50)
```

```
## mu_at_50
## 1 159.9379
## 2 160.8978
## 3 160.1390
## 4 160.3730
## 5 160.6349
## 6 160.3570
```

And here is a version McElreath’s Figure 4.8 density plot.

```
%>%
mu_at_50 ggplot(aes(x = mu_at_50)) +
geom_density(size = 0, fill = "royalblue") +
scale_y_continuous(NULL, breaks = NULL) +
xlab(expression(mu["height | weight = 50"])) +
theme_classic()
```

We’ll use `mean_hdi()`

to get both 89% and 95% HPDIs along with the mean.

`mean_hdi(mu_at_50[, 1], .width = c(.89, .95))`

```
## y ymin ymax .width .point .interval
## 1 160.5127 160.0827 160.9675 0.89 mean hdi
## 2 160.5127 159.9828 161.0691 0.95 mean hdi
```

If you wanted to express those sweet 95% HPDIs on your density plot, you might use `tidybayes::stat_halfeye()`

. Since `stat_halfeye()`

also returns a point estimate, we’ll throw in the mode.

```
%>%
mu_at_50 ggplot(aes(x = mu_at_50, y = 0)) +
stat_halfeye(point_interval = mode_hdi, .width = .95,
fill = "royalblue") +
scale_y_continuous(NULL, breaks = NULL) +
xlab(expression(mu["height | weight = 50"])) +
theme_classic()
```

With **brms**, you would use `fitted()`

to do what McElreath accomplished with `link()`

.

```
<- fitted(b4.3, summary = F)
mu
str(mu)
```

`## num [1:4000, 1:352] 157 158 157 157 157 ...`

When you specify `summary = F`

, `fitted()`

returns a matrix of values with as many rows as there were post-warmup iterations across your HMC chains and as many columns as there were cases in your analysis. Because we had 4,000 post-warmup iterations and \(n = 352\), `fitted()`

returned a matrix of 4,000 rows and 352 vectors. If you omitted the `summary = F`

argument, the default is `TRUE`

and `fitted()`

will return summary information instead.

Much like **rethinking**’s `link()`

, `brms::fitted()`

can accommodate custom predictor values with its `newdata`

argument.

```
<-
weight_seq tibble(weight = 25:70) %>%
mutate(weight_c = weight - mean(d2$weight))
<-
mu fitted(b4.3,
summary = F,
newdata = weight_seq) %>%
data.frame() %>%
# here we name the columns after the `weight` values from which they were computed
set_names(25:70) %>%
mutate(iter = 1:n())
```

Anticipating **ggplot2**, we went ahead and converted the output to a data frame. But we might do a little more data processing with the aid of `tidyr::pivot_longer()`

, which will convert the data from the wide format to the long format. If you are new to the distinction between wide and long data, you can learn more from the *Pivot data from wide to long* vignette from the **tidyverse** team (2020); Simon Ejdemyr’s blog post, *Wide & long data*; or Karen Grace-Martin’s blog post, *The wide and long data format for repeated measures data*.

```
<-
mu %>%
mu pivot_longer(-iter,
names_to = "weight",
values_to = "height") %>%
# we might reformat `weight` to numerals
mutate(weight = as.numeric(weight))
head(mu)
```

```
## # A tibble: 6 x 3
## iter weight height
## <int> <dbl> <dbl>
## 1 1 25 136.
## 2 1 26 137.
## 3 1 27 138.
## 4 1 28 139.
## 5 1 29 140.
## 6 1 30 141.
```

Now our data processing is done, here we reproduce McElreath’s Figure 4.9.a.

```
%>%
d2 ggplot(aes(x = weight, y = height)) +
geom_point(data = mu %>% filter(iter < 101),
color = "navyblue", alpha = .05) +
coord_cartesian(xlim = c(30, 65)) +
theme(panel.grid = element_blank())
```

With `fitted()`

, it’s quite easy to plot a regression line and its intervals. Just omit the `summary = T`

argument.

```
<-
mu_summary fitted(b4.3,
newdata = weight_seq) %>%
data.frame() %>%
bind_cols(weight_seq)
head(mu_summary)
```

```
## Estimate Est.Error Q2.5 Q97.5 weight weight_c
## 1 136.5262 0.8833715 134.7405 138.1822 25 -19.99049
## 2 137.4303 0.8437461 135.7240 139.0135 26 -18.99049
## 3 138.3343 0.8043407 136.7118 139.8447 27 -17.99049
## 4 139.2384 0.7651894 137.7003 140.6785 28 -16.99049
## 5 140.1424 0.7263331 138.6846 141.5160 29 -15.99049
## 6 141.0465 0.6878220 139.6690 142.3502 30 -14.99049
```

Here it is, our analogue to Figure 4.9.b.

```
%>%
d2 ggplot(aes(x = weight, y = height)) +
geom_smooth(data = mu_summary,
aes(y = Estimate, ymin = Q2.5, ymax = Q97.5),
stat = "identity",
fill = "grey70", color = "black", alpha = 1, size = 1/2) +
geom_point(color = "navyblue", shape = 1, size = 1.5, alpha = 2/3) +
coord_cartesian(xlim = range(d2$weight)) +
theme(text = element_text(family = "Times"),
panel.grid = element_blank())
```

if you wanted to use intervals other than the default 95% ones, you’d enter a `probs`

argument like this: `fitted(b4.3, newdata = weight.seq, probs = c(.25, .75))`

. The resulting third and fourth vectors from the `fitted()`

object would be named `Q25`

and `Q75`

instead of the default `Q2.5`

and `Q97.5`

. The `Q`

prefix stands for quantile.

##### 4.4.3.4.1 Rethinking: Overconfident intervals.

The compatibility interval for the regression line in Figure 4.9 clings tightly to the MAP line. Thus there is very little uncertainty about the average height as a function of average weight. But you have to keep in mind that these inferences are always conditional on the model. Even a very bad model can have very tight compatibility intervals. It may help if you think of the regression line in Figure 4.9 as saying:

Conditional on the assumption that height and weight are related by a straight line, then this is the most plausible line, and these are its plausible bounds. (p. 107,emphasisin the original)

##### 4.4.3.4.2 Overthinking: How ~~link~~ `fitted()`

works.

Similar to `rethinking::link()`

, `brms::fitted()`

uses the formula from your model to compute the model expectations for a given set of predictor values. I use it a lot in this project. If you follow along, you’ll get a good handle on it. But to dive deeper, you can go here for the documentation. Though we won’t be using it in this project, **brms** users might want to know that `fitted()`

is also an alias for the `posterior_epred()`

function, about which you might learn more here. Users can always learn more about them and other functions in the **brms** reference manual.

#### 4.4.3.5 Prediction intervals.

Even though our statistical model (omitting priors for the sake of simplicity) is

\[\text{height}_i \sim \operatorname{Normal}(\mu_i = \alpha + \beta x_, \sigma),\]

we’ve only been plotting the \(\mu\) part. In order to bring in the variability expressed by \(\sigma\), we’ll have to switch to the `predict()`

function. Much as `brms::fitted()`

was our analogue to `rethinking::link()`

, `brms::predict()`

is our analogue to `rethinking::sim()`

.

We can reuse our `weight_seq`

data from before. But in case you forgot, here’s that code again.

```
<-
weight_seq tibble(weight = 25:70) %>%
mutate(weight_c = weight - mean(d2$weight))
```

The `predict()`

code looks a lot like what we used for `fitted()`

.

```
<-
pred_height predict(b4.3,
newdata = weight_seq) %>%
data.frame() %>%
bind_cols(weight_seq)
%>%
pred_height slice(1:6)
```

```
## Estimate Est.Error Q2.5 Q97.5 weight weight_c
## 1 136.5890 5.204946 126.6381 146.9790 25 -19.99049
## 2 137.2489 5.166086 126.9521 147.3029 26 -18.99049
## 3 138.3838 5.246919 128.1976 148.8397 27 -17.99049
## 4 139.2259 5.143560 129.2588 149.1741 28 -16.99049
## 5 139.9700 5.170863 129.7009 149.9936 29 -15.99049
## 6 141.0105 5.101771 130.6494 151.2377 30 -14.99049
```

This time the summary information in our data frame is for, as McElreath put it, “simulated heights, not distributions of plausible average height, \(\mu\)” (p. 108). Another way of saying that is that these simulations are the joint consequence of both \(\mu\) and \(\sigma\), unlike the results of `fitted()`

, which only reflect \(\mu\). Figure 4.10 shows how you might visualize them.

```
%>%
d2 ggplot(aes(x = weight)) +
geom_ribbon(data = pred_height,
aes(ymin = Q2.5, ymax = Q97.5),
fill = "grey83") +
geom_smooth(data = mu_summary,
aes(y = Estimate, ymin = Q2.5, ymax = Q97.5),
stat = "identity",
fill = "grey70", color = "black", alpha = 1, size = 1/2) +
geom_point(aes(y = height),
color = "navyblue", shape = 1, size = 1.5, alpha = 2/3) +
coord_cartesian(xlim = range(d2$weight),
ylim = range(d2$height)) +
theme(text = element_text(family = "Times"),
panel.grid = element_blank())
```

Notice that the outline for the wide shaded interval is a little rough. This is the simulation variance in the tails of the sampled Gaussian values. If it really bothers you, increase the number of samples you take from the posterior distribution. (p. 109)

With our **brms** model fitting approach, that would mean we’d have to refit `b4.3`

after specifying a larger number of post-warmup iterations with alterations to the `iter`

and `warmup`

parameters.

##### 4.4.3.5.1 Overthinking: Rolling your own ~~sim~~ `predict()`

.

Here we follow McElreath’s example and do our model-based predictions by hand. Instead of relying on base **R** `apply()`

and `sapply()`

, here the main action is in `expand()`

, the second `mutate()`

line and the `group_by()`

+ `summarise()`

combination.

```
# `predict()` by hand
%>%
post expand(nesting(b_Intercept, b_weight_c, sigma),
weight = 25:70) %>%
mutate(weight_c = weight - mean(d2$weight)) %>%
mutate(sim_height = rnorm(n(),
mean = b_Intercept + b_weight_c * weight_c,
sd = sigma)) %>%
group_by(weight) %>%
summarise(mean = mean(sim_height),
ll = quantile(sim_height, prob = .025),
ul = quantile(sim_height, prob = .975)) %>%
# plot
ggplot(aes(x = weight)) +
geom_smooth(aes(y = mean, ymin = ll, ymax = ul),
stat = "identity",
fill = "grey83", color = "black", alpha = 1, size = 1/2) +
geom_point(data = d2,
aes(y = height),
color = "navyblue", shape = 1, size = 1.5, alpha = 2/3) +
coord_cartesian(xlim = range(d2$weight),
ylim = range(d2$height)) +
theme(text = element_text(family = "Times"),
panel.grid = element_blank())
```

We specifically left out the `fitted()`

intervals to make it more apparent what we were simulating. You might also note that we could have easily replaced that three-line `summarise()`

code with a single line of `tidybayes::mean_qi(sim_height)`

, or whatever combination of central tendency and interval type you wanted (e.g., `mode_hdi(sim_height, .width = .89)`

).

## 4.5 Curves from lines

“The models so far all assume that a straight line describes the relationship. But there’s nothing special about straight lines, aside from their simplicity.” (p. 110).

### 4.5.1 Polynomial regression.

Remember `d`

?

```
%>%
d glimpse()
```

```
## Rows: 544
## Columns: 4
## $ height <dbl> 151.7650, 139.7000, 136.5250, 156.8450, 145.4150, 163.8300, 149.2250, 168.9100, 147…
## $ weight <dbl> 47.82561, 36.48581, 31.86484, 53.04191, 41.27687, 62.99259, 38.24348, 55.47997, 34.…
## $ age <dbl> 63.0, 63.0, 65.0, 41.0, 51.0, 35.0, 32.0, 27.0, 19.0, 54.0, 47.0, 66.0, 73.0, 20.0,…
## $ male <int> 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0,…
```

McElreath suggested we plot `height`

against `weight`

using the full sample.

```
%>%
d ggplot(aes(x = weight, y = height)) +
geom_point(color = "navyblue", shape = 1, size = 1.5, alpha = 2/3) +
annotate(geom = "text",
x = 42, y = 115,
label = "This relation is\nvisibly curved.",
family = "Times") +
theme(text = element_text(family = "Times"),
panel.grid = element_blank())
```

Those variables two appear to follow an orderly relation, but whatever it is, it’s clearly not a simple straight line. The quadratic model is probably the most commonly used polynomial regression model. It follows the generic form

\[\mu = \alpha + \beta_1 x_i + \color{navy}{\beta_2 x_i^2}.\]

McElreath warned: “Fitting these models to data is easy. Interpreting them can be hard” (p. 111). Standardizing will help `brm()`

fit the model. We might standardize our `weight`

variable like so.

```
<-
d %>%
d mutate(weight_s = (weight - mean(weight)) / sd(weight)) %>%
mutate(weight_s2 = weight_s^2)
```

While we were at it, we just went ahead and computed the `weight_s2`

variable. We can express our statistical model as

\[\begin{align*} \text{height}_i & \sim \operatorname{Normal}(\mu_i, \sigma) \\ \mu_i & = \alpha + \beta_1 \text{weight_s}_i + \color{navy}{\beta_2 \text{weight_s}^2_i} \\ \alpha & \sim \operatorname{Normal}(178, 20) \\ \beta_1 & \sim \operatorname{Log-Normal}(0, 1) \\ \color{navy}{\beta_2} & \color{navy}\sim \color{navy}{\operatorname{Normal}(0, 1)} \\ \sigma & \sim \operatorname{Uniform}(0, 50). \end{align*}\]

Here’s how we might fit the quadratic model with **brms**.

```
.5 <-
b4brm(data = d,
family = gaussian,
~ 1 + weight_s + weight_s2,
height prior = c(prior(normal(178, 20), class = Intercept),
prior(lognormal(0, 1), class = b, coef = "weight_s"),
prior(normal(0, 1), class = b, coef = "weight_s2"),
prior(uniform(0, 50), class = sigma)),
iter = 30000, warmup = 29000, chains = 4, cores = 4,
seed = 4,
file = "fits/b04.05")
```

Note our use of the `coef`

argument within our prior statements. Since \(\beta_1\) and \(\beta_2\) are both parameters of `class = b`

within the **brms** set-up, we need to use the `coef`

argument in cases when we want their priors to differ.

`plot(b4.5)`

`print(b4.5)`

```
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: height ~ 1 + weight_s + weight_s2
## Data: d (Number of observations: 544)
## Samples: 4 chains, each with iter = 30000; warmup = 29000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 146.03 0.38 145.29 146.78 1.01 1369 2019
## weight_s 21.75 0.29 21.17 22.33 1.00 1773 2031
## weight_s2 -7.77 0.28 -8.34 -7.25 1.01 901 935
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 5.81 0.18 5.48 6.18 1.00 2472 2150
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

Our quadratic plot requires new `fitted()`

- and `predict()`

-oriented wrangling.

```
<-
weight_seq tibble(weight_s = seq(from = -2.5, to = 2.5, length.out = 30)) %>%
mutate(weight_s2 = weight_s^2)
<-
fitd_quad fitted(b4.5,
newdata = weight_seq) %>%
data.frame() %>%
bind_cols(weight_seq)
<-
pred_quad predict(b4.5,
newdata = weight_seq) %>%
data.frame() %>%
bind_cols(weight_seq)
```

Behold the code for our version of Figure 4.11.b.

```
<-
p2 ggplot(data = d,
aes(x = weight_s)) +
geom_ribbon(data = pred_quad,
aes(ymin = Q2.5, ymax = Q97.5),
fill = "grey83") +
geom_smooth(data = fitd_quad,
aes(y = Estimate, ymin = Q2.5, ymax = Q97.5),
stat = "identity",
fill = "grey70", color = "black", alpha = 1, size = 1/2) +
geom_point(aes(y = height),
color = "navyblue", shape = 1, size = 1.5, alpha = 1/3) +
labs(subtitle = "quadratic",
y = "height") +
coord_cartesian(xlim = range(d$weight_s),
ylim = range(d$height)) +
theme(text = element_text(family = "Times"),
panel.grid = element_blank())
p2
```

From a formula perspective, the cubic model is a simple extension of the quadratic:

\[\mu = \alpha + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3.\]

Before we fit the model, we need to wrangle the data again.

```
<-
d %>%
d mutate(weight_s3 = weight_s^3)
```

Now fit the model like so.

```
.6 <-
b4brm(data = d,
family = gaussian,
~ 1 + weight_s + weight_s2 + weight_s3,
height prior = c(prior(normal(178, 20), class = Intercept),
prior(lognormal(0, 1), class = b, coef = "weight_s"),
prior(normal(0, 1), class = b, coef = "weight_s2"),
prior(normal(0, 1), class = b, coef = "weight_s3"),
prior(uniform(0, 50), class = sigma)),
iter = 40000, warmup = 39000, chains = 4, cores = 4,
seed = 4,
file = "fits/b04.06")
```

And now we’ll fit the good old linear model.

```
.7 <-
b4brm(data = d,
family = gaussian,
~ 1 + weight_s,
height prior = c(prior(normal(178, 20), class = Intercept),
prior(lognormal(0, 1), class = b, coef = "weight_s"),
prior(uniform(0, 50), class = sigma)),
iter = 40000, warmup = 39000, chains = 4, cores = 4,
seed = 4,
file = "fits/b04.07")
```

Here’s the `fitted()`

, `predict()`

, and **ggplot2** code for Figure 4.11.c, the cubic model.

```
<-
weight_seq %>%
weight_seq mutate(weight_s3 = weight_s^3)
<-
fitd_cub fitted(b4.6,
newdata = weight_seq) %>%
as_tibble() %>%
bind_cols(weight_seq)
<-
pred_cub predict(b4.6,
newdata = weight_seq) %>%
as_tibble() %>%
bind_cols(weight_seq)
<-
p3 ggplot(data = d,
aes(x = weight_s)) +
geom_ribbon(data = pred_cub,
aes(ymin = Q2.5, ymax = Q97.5),
fill = "grey83") +
geom_smooth(data = fitd_cub,
aes(y = Estimate, ymin = Q2.5, ymax = Q97.5),
stat = "identity",
fill = "grey70", color = "black", alpha = 1, size = 1/4) +
geom_point(aes(y = height),
color = "navyblue", shape = 1, size = 1.5, alpha = 1/3) +
labs(subtitle = "cubic",
y = "height") +
coord_cartesian(xlim = range(d$weight_s),
ylim = range(d$height)) +
theme(text = element_text(family = "Times"),
panel.grid = element_blank())
p3
```

And here’s the `fitted()`

, `predict()`

, and **ggplot2** code for Figure 4.11.a, the linear model.

```
<-
fitd_line fitted(b4.7,
newdata = weight_seq) %>%
as_tibble() %>%
bind_cols(weight_seq)
<-
pred_line predict(b4.7,
newdata = weight_seq) %>%
as_tibble() %>%
bind_cols(weight_seq)
<-
p1 ggplot(data = d,
aes(x = weight_s)) +
geom_ribbon(data = pred_line,
aes(ymin = Q2.5, ymax = Q97.5),
fill = "grey83") +
geom_smooth(data = fitd_line,
aes(y = Estimate, ymin = Q2.5, ymax = Q97.5),
stat = "identity",
fill = "grey70", color = "black", alpha = 1, size = 1/4) +
geom_point(aes(y = height),
color = "navyblue", shape = 1, size = 1.5, alpha = 1/3) +
labs(subtitle = "linear",
y = "height") +
coord_cartesian(xlim = range(d$weight_s),
ylim = range(d$height)) +
theme(text = element_text(family = "Times"),
panel.grid = element_blank())
p1
```

Did you notice how we labeled each of the past three plots as `p1`

, `p2`

, and `p3`

? Here we use those names to plot them all together with **patchwork** syntax.

`| p2 | p3 p1 `

As fun as this all has been,

it’s not clear that any of these models make a lot of sense. They are good geocentric descriptions of the sample, yes. But there are two problems. First, a better fit to the sample might not actually be a better model. That’s the subject of Chapter 7. Second, the model contains no biological information. We aren’t learning any causal relationship between height and weight. We’ll deal with this second problem much later, in Chapter 16. (p. 113)

##### 4.5.1.0.1 Overthinking: Converting back to natural scale.

You can apply McElreath’s conversion trick within the **ggplot2** environment, too. Here it is with the cubic model.

```
<- c(-2, -1, 0, 1, 2)
at
ggplot(data = d,
aes(x = weight_s)) +
geom_ribbon(data = pred_cub,
aes(ymin = Q2.5, ymax = Q97.5),
fill = "grey83") +
geom_point(aes(y = height),
color = "navyblue", shape = 1, size = 1.5, alpha = 1/3) +
coord_cartesian(xlim = range(d$weight_s)) +
theme(text = element_text(family = "Times"),
panel.grid = element_blank()) +
# here it is!
scale_x_continuous("standardized weight converted back",
breaks = at,
labels = round(at*sd(d$weight) + mean(d$weight), 1))
```

### 4.5.2 Splines.

Load the `cherry_blossoms`

data (Aono, 2012; Aono & Kazui, 2008; Aono & Saito, 2010).

```
library(rethinking)
data(cherry_blossoms)
<- cherry_blossoms
d rm(cherry_blossoms)
detach(package:rethinking, unload = T)
```

Minus the mini histograms, here is our ground-up **tidyverse** way to summarize our new `d`

data the way McElreath did with his `precis()`

.

```
%>%
d gather() %>%
group_by(key) %>%
summarise(mean = mean(value, na.rm = T),
sd = sd(value, na.rm = T),
ll = quantile(value, prob = .055, na.rm = T),
ul = quantile(value, prob = .945, na.rm = T)) %>%
mutate_if(is.double, round, digits = 2)
```

```
## # A tibble: 5 x 5
## key mean sd ll ul
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 doy 105. 6.41 94.4 115
## 2 temp 6.14 0.66 5.15 7.29
## 3 temp_lower 5.1 0.85 3.79 6.37
## 4 temp_upper 7.19 0.99 5.9 8.9
## 5 year 1408 351. 868. 1948.
```

McElreath encouraged us to plot `doy`

against `year`

.

```
%>%
d ggplot(aes(x = year, y = doy)) +
# color from here: https://www.colorhexa.com/ffb7c5
geom_point(color = "#ffb7c5", alpha = 1/2) +
theme_bw() +
theme(panel.grid = element_blank(),
# color from here: https://www.colordic.org/w/, inspired by https://chichacha.netlify.com/2018/11/29/plotting-traditional-colours-of-japan/
panel.background = element_rect(fill = "#4f455c"))
```

It looks like there are some wiggly trends, but it’s hard to tell with a scatter plot.

Our goal is to approximate the blossom trend with a wiggly function. With B-splines, just like with polynomial regression, we do this by generating new predictor variables and using those in the linear model, \(\mu_i\). Unlike polynomial regression, B-splines do not directly transform the predictor by squaring or cubing it. Instead they invent a series of entirely new, synthetic predictor variables. Each of these synthetic variables exists only to gradually turn a specific parameter on and off within a specific range of the real predictor variable. Each of the synthetic variables is called a

basis function. The linear model ends up looking very familiar:\[\mu_i = \alpha + w_1 B_{i, 1} + w_2 B_{i, 2} + w_3 B_{i, 3} + \dots\]

where \(B_{i,n}\) is the \(n\)-th basis function’s value on row \(i\), and the \(w\) parameters are corresponding weights for each. The parameters act like slopes, adjusting the influence of each basis function on the mean \(\mu_i\). So really this is just another linear regression, but with some fancy, synthetic predictor variables. (p. 115,

emphasisin the original)

It turns out there are cases with missing data for the `doy`

variable.

```
%>%
d count(is.na(doy)) %>%
mutate(percent = 100 * n / sum(n))
```

```
## is.na(doy) n percent
## 1 FALSE 827 68.06584
## 2 TRUE 388 31.93416
```

Let’s follow McElreath and make a subset of the data that excludes cases with missing data in `doy`

. Within the **tidyverse**, we might do so with the `tidyr::drop_na()`

function.

```
<-
d2 %>%
d drop_na(doy)
```

On page 117 in the text, McElreath indirectly explained how to make Figure 4.12 by walking through the workflow for making Figure 4.13. Here we mimic that ordering.

First, we choose the knots. Remember, the knots are just values of year that serve as pivots for our spline. Where should the knots go? There are different ways to answer this question. You can, in principle, put the knots wherever you like. Their locations are part of the model, and you are responsible for them. Let’s do what we did in the simple example above, place the knots at different evenlyspaced quantiles of the predictor variable. This gives you more knots where there are more observations. We used only 5 knots in the first example. Now let’s go for 15:

```
<- 15
num_knots <- quantile(d2$year, probs = seq(from = 0, to = 1, length.out = num_knots)) knot_list
```

Our `knot_list`

contains 15 `year`

values.

` knot_list`

```
## 0% 7.142857% 14.28571% 21.42857% 28.57143% 35.71429% 42.85714% 50% 57.14286% 64.28571%
## 812 1036 1174 1269 1377 1454 1518 1583 1650 1714
## 71.42857% 78.57143% 85.71429% 92.85714% 100%
## 1774 1833 1893 1956 2015
```

Here’s what it looks like if we use those `knot_list`

values to chop up our `year`

/`doy`

scatter plot, from above.

```
%>%
d ggplot(aes(x = year, y = doy)) +
geom_vline(xintercept = knot_list, color = "white", alpha = 1/2) +
geom_point(color = "#ffb7c5", alpha = 1/2) +
theme_bw() +
theme(panel.background = element_rect(fill = "#4f455c"),
panel.grid = element_blank())
```

The next choice is polynomial degree. This determines how basis functions combine, which determines how the parameters interact to produce the spline. For degree 1, as in Figure 4.12, two basis functions combine at each point. For degree 2, three functions combine at each point. For degree 3, four combine. R already has a nice function that will build basis functions for any list of knots and degree. This code will construct the necessary basis functions for a degree 3 (cubic) spline: (p. 117)

```
library(splines)
<- bs(d2$year,
B knots = knot_list[-c(1, num_knots)],
degree = 3,
intercept = TRUE)
```

Look closely at McElreath’s tricky `knot_list[-c(1, num_knots)]`

code. Whereas `knot_list`

contains 15 ordered `year`

values, McElreath shaved off the first and last `year`

values with `knot_list[-c(1, num_knots)]`

, leaving 13. This is because, by default, the `bs()`

function places knots at the boundaries. Since the first and 15^{th} values in `knot_list`

were boundary values for `year`

, we removed them to avoid redundancies. We can confirm this with the code, below.

`%>% str() B `

```
## 'bs' num [1:827, 1:17] 1 0.96 0.767 0.563 0.545 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:17] "1" "2" "3" "4" ...
## - attr(*, "degree")= int 3
## - attr(*, "knots")= Named num [1:13] 1036 1174 1269 1377 1454 ...
## ..- attr(*, "names")= chr [1:13] "7.142857%" "14.28571%" "21.42857%" "28.57143%" ...
## - attr(*, "Boundary.knots")= int [1:2] 812 2015
## - attr(*, "intercept")= logi TRUE
```

Look at the second to last line, `- attr(*, "Boundary.knots")= int [1:2] 812 2015`

. Those default `"Boundary.knots"`

are the same as `knot_list[c(1, num_knots)]`

. Let’s confirm.

`c(1, num_knots)] knot_list[`

```
## 0% 100%
## 812 2015
```

By the `degree = 3`

argument, we indicated we wanted a cubic spline. McElreath used `degree = 1`

for Figure 4.12. For reasons I’m not prepared to get into, here, splines don’t always include intercept parameters. Indeed, the `bs()`

default is `intercept = FALSE`

. McElreath’s code indicated he wanted to fit a B-spline that included an intercept. Thus: `intercept = TRUE`

.

Here’s how we might make our version of the top panel of Figure 4.13.

```
# wrangle a bit
<-
b %>%
B data.frame() %>%
set_names(str_c(0, 1:9), 10:17) %>%
bind_cols(select(d2, year)) %>%
pivot_longer(-year,
names_to = "bias_function",
values_to = "bias")
# plot
%>%
b ggplot(aes(x = year, y = bias, group = bias_function)) +
geom_vline(xintercept = knot_list, color = "white", alpha = 1/2) +
geom_line(color = "#ffb7c5", alpha = 1/2, size = 1.5) +
ylab("bias value") +
theme_bw() +
theme(panel.background = element_rect(fill = "#4f455c"),
panel.grid = element_blank())
```

To elucidate what’s going on in that plot, we might break it up with `facet_wrap()`

.

```
%>%
b mutate(bias_function = str_c("bias function ", bias_function)) %>%
ggplot(aes(x = year, y = bias)) +
geom_vline(xintercept = knot_list, color = "white", alpha = 1/2) +
geom_line(color = "#ffb7c5", size = 1.5) +
ylab("bias value") +
theme_bw() +
theme(panel.background = element_rect(fill = "#4f455c"),
panel.grid = element_blank(),
strip.background = element_rect(fill = scales::alpha("#ffb7c5", .25), color = "transparent"),
strip.text = element_text(size = 8, margin = margin(0.1, 0, 0.1, 0, "cm"))) +
facet_wrap(~ bias_function, ncol = 1)
```

Now to get the parameter weights for each basis function, we need to actually define the model and make it run. The model is just a linear regression. The synthetic basis functions do all the work. We’ll use each column of the matrix

`B`

as a variable. We’ll also have an intercept to capture the average blossom day. This will make it easier to define priors on the basis weights, because then we can just conceive of each as a deviation from the intercept. (p. 117)

That last line is another indication for why we set `intercept = TRUE`

. Our model will follow the form

\[\begin{align*} \text{day_in_year}_i & \sim \operatorname{Normal} (\mu_i, \sigma) \\ \mu_i & = \alpha + \color{#4f455c}{\sum_{k=1}^K w_k B_{k, i}} \\ \alpha & \sim \operatorname{Normal}(100, 10) \\ \color{#4f455c}{w_j} & \color{#4f455c}\sim \color{#4f455c}{\operatorname{Normal}(0, 10)} \\ \sigma & \sim \operatorname{Exponential}(1), \end{align*}\]

where \(\alpha\) is the intercept, \(B_{k, i}\) is the value of the \(k^\text{th}\) bias function on the \(i^\text{th}\) row of the data, and \(w_k\) is the estimated regression weight for the corresponding \(k^\text{th}\) bias function.

Throughout this chapter, I’ve griped a bit about using the uniform prior for \(\sigma\). Now that McElreath has introduced the exponential distribution as an alternative, those gripes are coming to an end. The exponential distribution is controlled by a single parameter, \(\lambda\), which is also called the rate. As it turns out, the mean of the exponential distribution is the inverse of the rate, \(1 / \lambda\). Here we use the `dexp()`

function to get a sense of what that prior looks like.

```
tibble(x = seq(from = 0, to = 10, by = 0.1)) %>%
mutate(d = dexp(x, rate = 1)) %>%
ggplot(aes(x = x, y = d)) +
geom_area(fill = "#ffb7c5") +
scale_y_continuous(NULL, breaks = NULL) +
theme_bw() +
theme(panel.background = element_rect(fill = "#4f455c"),
panel.grid = element_blank())
```

“We’ll use exponential priors for the rest of the book, in place of uniform priors. It is much more common to have a sense of the average deviation than of the maximum” (p. 119). 🎉

**Acknowledgment**: The workflow to follow is heavily influenced by the helpful contributions from Stephen Wild. My first pass through the material in this section was a mess. Wild’s insights knocked it out of the park and I couldn’t be more grateful. 🍻

Before fitting this model in **brms**, well take a minor detour on the data structure. In his **R** code 4.76, McElreath defined his data in a list, `list( D=d2$doy , B=B )`

. Our approach will be a little different. Here, we’ll add the `B`

matrix to our `d2`

data frame and name the results as `d3`

.

```
<-
d3 %>%
d2 mutate(B = B)
# take a look at the structure of `d3
%>% glimpse() d3
```

```
## Rows: 827
## Columns: 6
## $ year <int> 812, 815, 831, 851, 853, 864, 866, 869, 889, 891, 892, 894, 895, 896, 902, 908,…
## $ doy <int> 92, 105, 96, 108, 104, 100, 106, 95, 104, 109, 108, 106, 104, 104, 102, 98, 95,…
## $ temp <dbl> NA, NA, NA, 7.38, NA, 6.42, 6.44, NA, 6.83, 6.98, 7.11, 6.98, 7.08, 7.20, 7.50,…
## $ temp_upper <dbl> NA, NA, NA, 12.10, NA, 8.69, 8.11, NA, 8.48, 8.96, 9.11, 8.40, 8.57, 8.69, 8.95…
## $ temp_lower <dbl> NA, NA, NA, 2.66, NA, 4.14, 4.77, NA, 5.19, 5.00, 5.11, 5.55, 5.58, 5.72, 6.06,…
## $ B <dbl[,17]> <matrix[33 x 17]>
```

In our `d3`

data, columns `year`

through `temp_lower`

are all standard data columns. The `B`

column is a *matrix column*, which contains the same number of rows as the others, but also smuggled in 17 columns *within* that column. Each of those 17 columns corresponds to one of our synthetic \(B_k\) variables. The advantage of such a data structure is we can simply define our `formula`

argument as `doy ~ 1 + B`

, where `B`

is a stand-in for `B.1 + B.2 + ... + B.17`

.

Here’s how to fit the model.

```
.8 <-
b4brm(data = d3,
family = gaussian,
~ 1 + B,
doy prior = c(prior(normal(100, 10), class = Intercept),
prior(normal(0, 10), class = b),
prior(exponential(1), class = sigma)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 4,
file = "fits/b04.08")
```

Here’s the model summary.

`print(b4.8)`

```
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: doy ~ 1 + B
## Data: d3 (Number of observations: 827)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 103.59 2.49 98.58 108.45 1.01 761 1071
## B1 -3.19 3.88 -10.80 4.21 1.00 1591 2435
## B2 -1.11 3.92 -8.82 6.62 1.00 1463 1859
## B3 -1.27 3.68 -8.58 6.07 1.00 1332 1833
## B4 4.56 2.96 -1.13 10.54 1.00 1046 1648
## B5 -1.08 2.99 -6.89 4.78 1.00 954 1399
## B6 4.03 3.02 -1.86 10.07 1.00 1056 1515
## B7 -5.55 2.91 -11.23 0.30 1.00 981 1672
## B8 7.57 2.92 1.92 13.29 1.00 1005 1583
## B9 -1.22 2.97 -6.90 4.58 1.01 1001 1458
## B10 2.79 3.03 -3.17 9.02 1.00 1019 1527
## B11 4.42 3.00 -1.49 10.27 1.00 1056 1846
## B12 -0.37 2.99 -5.98 5.70 1.00 1013 1551
## B13 5.28 2.99 -0.52 11.07 1.00 1024 1768
## B14 0.46 3.09 -5.44 6.74 1.00 1036 1712
## B15 -1.05 3.37 -7.77 5.59 1.00 1215 2147
## B16 -7.25 3.44 -14.01 -0.49 1.00 1287 1936
## B17 -7.88 3.26 -14.24 -1.29 1.00 1246 1824
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 5.94 0.14 5.67 6.23 1.00 4173 2921
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

Look at that. Each of the 17 columns in our `B`

matrix was assigned its own parameter. If you fit this model using McElreath’s **rethinking** code, you’ll see the results are very similar. Anyway, McElreath’s comments are in line with the general consensus on spline modes: the parameter estimates are very difficult to interpret directly. It’s often easier to just plot the results. First we’ll use `posterior_samples()`

.

```
<- posterior_samples(b4.8)
post
glimpse(post)
```

```
## Rows: 4,000
## Columns: 20
## $ b_Intercept <dbl> 103.74086, 104.25796, 103.13392, 104.66690, 105.20821, 105.41704, 105.65525, 1…
## $ b_B1 <dbl> -6.4826048, -5.4990824, -2.1646605, 0.4607746, -9.0040921, -7.3966727, -11.508…
## $ b_B2 <dbl> 3.74875270, -0.53550232, -0.42138745, -8.06386888, 0.66505588, 0.72880418, 2.9…
## $ b_B3 <dbl> -4.22042929, -2.26081740, -0.95183959, -1.78965585, -8.02914674, -9.16372510, …
## $ b_B4 <dbl> 5.8833689, 4.7218957, 3.5397449, 3.2324508, 5.1640256, 5.9370012, 2.3368581, 3…
## $ b_B5 <dbl> -4.6338971, -2.0490845, -1.7740736, -1.4582729, -4.3913120, -3.1619283, -4.307…
## $ b_B6 <dbl> 6.00593380, 2.37093832, 6.60788828, 0.95493974, 2.78863329, 2.14199241, 1.6839…
## $ b_B7 <dbl> -6.9436490, -3.8421888, -7.9974119, -6.7766133, -5.6201809, -6.7678482, -8.336…
## $ b_B8 <dbl> 7.316514, 5.119854, 8.393274, 5.799375, 6.329872, 5.731784, 5.708979, 6.760733…
## $ b_B9 <dbl> -1.9490036, -0.2646558, -1.4861210, -1.3906578, -3.7449285, -3.6035334, -4.008…
## $ b_B10 <dbl> 3.9720418, 2.9088482, 4.1508398, 1.3121326, 2.3465406, 3.1685000, 1.6009567, 0…
## $ b_B11 <dbl> 3.6933735, 3.2552081, 2.3290076, 3.5780345, 1.4853387, 1.4431854, 1.8636113, 5…
## $ b_B12 <dbl> -0.2419564, -2.2181982, -0.6924766, -1.5380999, -1.3345390, -1.7848732, -1.199…
## $ b_B13 <dbl> 5.416313, 3.645593, 6.208850, 4.199717, 4.373958, 3.727896, 1.780963, 4.056961…
## $ b_B14 <dbl> 1.5023931, 2.6503143, 1.1680142, 1.9840733, -3.6880151, -4.3624263, -1.3456437…
## $ b_B15 <dbl> -1.51102386, -4.81636227, 1.75411291, -3.42256204, -1.68102337, -2.74340145, -…
## $ b_B16 <dbl> -8.4839941, -7.0335783, -8.7948696, -4.4579658, -10.4273168, -10.3023468, -13.…
## $ b_B17 <dbl> -9.947067, -10.641582, -8.199229, -8.768240, -7.976792, -8.981154, -7.529841, …
## $ sigma <dbl> 6.077514, 5.949900, 6.119600, 5.906621, 5.937513, 5.884939, 5.901013, 5.992085…
## $ lp__ <dbl> -2711.191, -2710.897, -2714.579, -2714.185, -2711.390, -2713.697, -2712.118, -…
```

With a little wrangling, we can use summary information from `post`

to make our version of the middle panel of Figure 4.13.

```
%>%
post select(b_B1:b_B17) %>%
set_names(c(str_c(0, 1:9), 10:17)) %>%
pivot_longer(everything(), names_to = "bias_function") %>%
group_by(bias_function) %>%
summarise(weight = mean(value)) %>%
full_join(b, by = "bias_function") %>%
# plot
ggplot(aes(x = year, y = bias * weight, group = bias_function)) +
geom_vline(xintercept = knot_list, color = "white", alpha = 1/2) +
geom_line(color = "#ffb7c5", alpha = 1/2, size = 1.5) +
theme_bw() +
theme(panel.background = element_rect(fill = "#4f455c"),
panel.grid = element_blank())
```

In case you missed it, the main action in the **ggplot2** code was `y = bias * weight`

, where we defined the \(y\)-axis as the product of `bias`

and `weight`

. This is fulfillment of the \(w_k B_{k, i}\) parts of the model. Now here’s how we might use `brms::fitted()`

to make the lower plot of Figure 4.13.

```
<- fitted(b4.8)
f
%>%
f data.frame() %>%
bind_cols(d2) %>%
ggplot(aes(x = year, y = doy, ymin = Q2.5, ymax = Q97.5)) +
geom_vline(xintercept = knot_list, color = "white", alpha = 1/2) +
geom_hline(yintercept = fixef(b4.8)[1, 1], color = "white", linetype = 2) +
geom_point(color = "#ffb7c5", alpha = 1/2) +
geom_ribbon(fill = "white", alpha = 2/3) +
labs(x = "year",
y = "day in year") +
theme_bw() +
theme(panel.background = element_rect(fill = "#4f455c"),
panel.grid = element_blank())
```

If it wasn’t clear, the dashed horizontal line intersecting a little above 100 on the \(y\)-axis is the poster mean for the intercept. Now let’s use our skills to remake the simpler model expressed in Figure 4.12. This model, recall, is based on 5 knots.

```
# redo the `B` splines
<- 5
num_knots <- quantile(d2$year, probs = seq(from = 0, to = 1, length.out = num_knots))
knot_list
<- bs(d2$year,
B knots = knot_list[-c(1, num_knots)],
# this makes the splines liner rater than cubic
degree = 1,
intercept = TRUE)
# define a new `d4` data
<-
d4 %>%
d2 mutate(B = B)
.9 <-
b4brm(data = d4,
family = gaussian,
formula = doy ~ 1 + B,
prior = c(prior(normal(100, 10), class = Intercept),
prior(normal(0, 10), class = b),
prior(exponential(1), class = sigma)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 4,
file = "fits/b04.09")
```

Review the new model summary.

`print(b4.9)`

```
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: doy ~ 1 + B
## Data: d4 (Number of observations: 827)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 103.46 4.57 94.27 112.27 1.01 673 1077
## B1 -0.32 4.65 -9.48 9.16 1.01 694 1124
## B2 1.49 4.60 -7.52 10.84 1.01 692 1048
## B3 1.57 4.60 -7.40 10.90 1.01 681 1117
## B4 3.99 4.59 -5.00 13.02 1.01 689 1140
## B5 -5.56 4.63 -14.64 3.76 1.01 685 1175
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 6.10 0.15 5.82 6.41 1.00 2537 1993
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

Here we do all the work in bulk to make and save the three subplots for Figure 4.12.

```
## top
# wrangle a bit
<-
b invoke(data.frame, d4) %>%
pivot_longer(starts_with("B"),
names_to = "bias_function",
values_to = "bias")
# plot
<-
p1 %>%
b ggplot(aes(x = year, y = bias, group = bias_function)) +
geom_vline(xintercept = knot_list, color = "white", alpha = 1/2) +
geom_line(color = "#ffb7c5", alpha = 1/2, size = 1.5) +
scale_x_continuous(NULL, breaks = NULL) +
ylab("bias value")
## middle
# wrangle
<-
p2 posterior_samples(b4.9) %>%
select(b_B1:b_B5) %>%
set_names(str_c("B.", 1:5)) %>%
pivot_longer(everything(), names_to = "bias_function") %>%
group_by(bias_function) %>%
summarise(weight = mean(value)) %>%
full_join(b, by = "bias_function") %>%
# plot
ggplot(aes(x = year, y = bias * weight, group = bias_function)) +
geom_vline(xintercept = knot_list, color = "white", alpha = 1/2) +
geom_line(color = "#ffb7c5", alpha = 1/2, size = 1.5) +
scale_x_continuous(NULL, breaks = NULL)
## bottom
# wrangle
<- fitted(b4.9)
f
<-
p3 %>%
f data.frame() %>%
bind_cols(d2) %>%
# plot
ggplot(aes(x = year, y = doy, ymin = Q2.5, ymax = Q97.5)) +
geom_vline(xintercept = knot_list, color = "white", alpha = 1/2) +
geom_hline(yintercept = fixef(b4.9)[1, 1], color = "white", linetype = 2) +
geom_point(color = "#ffb7c5", alpha = 1/2) +
geom_ribbon(fill = "white", alpha = 2/3) +
labs(x = "year",
y = "day in year")
```

Now combine the subplots with **patchwork** syntax and behold their glory.

```
/ p2 / p3) &
(p1 theme_bw() &
theme(panel.background = element_rect(fill = "#4f455c"),
panel.grid = element_blank())
```

### 4.5.3 Smooth functions for a rough world.

The splines in the previous section are just the beginning. A entire class of models,

generalized additive models(GAMs), focuses on predicting an outcome variable using smooth functions of some predictor variables. The topic is deep enough to deserve its own book. (p. 120,emphasisin the original)

McElreath ended that block quote with a reference to his endnote #78. On page 562, we read: “A very popular and comprehensive text is Wood (2017a).”

## 4.6 ~~Summary~~ First bonus: Smooth functions with `brms::s()`

It’s convenient for us how McElreath ended that last section with a reference to Simon Wood’s work because **brms** allows for a variety of non-linear models by borrowing functions from Woods’s **mgcv** package (Wood, 2003, 2004, 2011, 2017b, 2019; Wood et al., 2016). The two smooth functions **brms** imports from **mgcv** are `s()`

and `t2()`

. We’ll be exploring `s()`

. We might use the `brms::get_prior()`

function to get a sense of how to set up the priors when using `s()`

.

```
get_prior(data = d2,
family = gaussian,
~ 1 + s(year)) doy
```

```
## prior class coef group resp dpar nlpar bound source
## (flat) b default
## (flat) b syear_1 (vectorized)
## student_t(3, 105, 5.9) Intercept default
## student_t(3, 0, 5.9) sds default
## student_t(3, 0, 5.9) sds s(year) (vectorized)
## student_t(3, 0, 5.9) sigma default
```

We have an overall intercept (`class = Intercept`

), a single \(\beta\) parameter for `year`

(`class = b`

), a \(\sigma\) parameter (`class = sigma`

), and an unfamiliar parameter of `class = sds`

. I’m not going to go into that last parameter in any detail, here. We’ll need to work our way up through Chapter 13 and the multilevel model to get a full picture of what it means. The important thing to note here is that the priors for our `s()`

-based alternative to the B-spline models, above, are going to look a little different. Here’s how we might fit an alternative to `b4.8`

```
.10 <-
b4brm(data = d2,
family = gaussian,
~ 1 + s(year),
doy prior = c(prior(normal(100, 10), class = Intercept),
prior(normal(0, 10), class = b),
prior(student_t(3, 0, 5.9), class = sds),
prior(exponential(1), class = sigma)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 4,
control = list(adapt_delta = .99),
file = "fits/b04.10")
```

Check out the model summary.

`print(b4.10)`

```
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: doy ~ 1 + s(year)
## Data: d2 (Number of observations: 827)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Smooth Terms:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sds(syear_1) 22.36 7.56 11.87 40.48 1.00 1115 1863
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 104.54 0.22 104.13 104.98 1.00 4095 2264
## syear_1 -9.87 8.94 -27.42 7.43 1.00 3123 2818
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 6.04 0.15 5.75 6.36 1.00 4305 2900
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

Our intercept and \(\sigma\) summaries are similar to those we got from `b4.8`

. The rest looks different. Here’s what happens when we use `brms::fitted()`

to plot the implications of the model.

```
fitted(b4.10) %>%
data.frame() %>%
bind_cols(select(d2, year, doy)) %>%
ggplot(aes(x = year, y = doy, ymin = Q2.5, ymax = Q97.5)) +
geom_hline(yintercept = fixef(b4.10)[1, 1], color = "white", linetype = 2) +
geom_point(color = "#ffb7c5", alpha = 1/2) +
geom_ribbon(fill = "white", alpha = 2/3) +
labs(subtitle = "b4.7 using s(year)",
y = "day in year") +
theme_bw() +
theme(panel.background = element_rect(fill = "#4f455c"),
panel.grid = element_blank())
```

That smooth doesn’t look quite the same. Hopefully this isn’t terribly surprising. We used a function from a different package and ended up with a different underlying statistical model. In fact, we didn’t even use a B-spline. The default for `s()`

is to use what’s called a *thin plate* regression spline. If we’d like to fit a B-spline, we have to set `bs = "bs"`

. Here’s an example.

```
.11 <-
b4brm(data = d2,
family = gaussian,
~ 1 + s(year, bs = "bs", k = 19),
doy prior = c(prior(normal(100, 10), class = Intercept),
prior(normal(0, 10), class = b),
prior(student_t(3, 0, 5.9), class = sds),
prior(exponential(1), class = sigma)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 4,
control = list(adapt_delta = .99),
file = "fits/b04.11")
```

`print(b4.11)`

```
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: doy ~ 1 + s(year, bs = "bs", k = 19)
## Data: d2 (Number of observations: 827)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Smooth Terms:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sds(syear_1) 1.31 0.59 0.54 2.80 1.01 622 988
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 104.54 0.20 104.14 104.95 1.00 5150 2533
## syear_1 -0.09 0.33 -0.75 0.57 1.00 1873 2216
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 5.98 0.15 5.70 6.28 1.00 4065 3050
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

Now here’s the depiction of our `s()`

-based B-spline model.

```
fitted(b4.11) %>%
data.frame() %>%
bind_cols(select(d2, year, doy)) %>%
ggplot(aes(x = year, y = doy, ymin = Q2.5, ymax = Q97.5)) +
geom_hline(yintercept = fixef(b4.11)[1, 1], color = "white", linetype = 2) +
geom_point(color = "#ffb7c5", alpha = 1/2) +
geom_ribbon(fill = "white", alpha = 2/3) +
labs(subtitle = 'b4.7_bs using s(year, bs = "bs")',
y = "day in year") +
theme_bw() +
theme(panel.background = element_rect(fill = "#4f455c"),
panel.grid = element_blank())
```

There are still other important differences between the underlying statistical model for `b4.11`

and the earlier `b4.8`

that I’m just not going to go into, here.

For more on the B-splines and smooths, more generally, check out the blog post by the great Gavin Simpson, *Extrapolating with B splines and GAMs*. For a high-level introduction to the models you can fit with **mgcv**, check out the nice talk by Noam Ross, *Nonlinear models in R: The wonderful world of mgcv*, or the equally-nice presentation by Simpson, *Introduction to generalized additive models with R and mgcv*. Ross offers a free online course covering **mgcv**, called GAMS in R, and he maintains a GitHub repo cataloguing other GAM-related resources, called *Resources for learning about and using GAMs in R*. For specific examples of fitting various GAMS with **brms**, check out Simpson’s blog post, **Fitting GAMs with brms: part 1**. Finally, Tristan Mahr has a nice blog post called *Random effects and penalized splines are the same thing*, where he outlined the connections between penalized smooths, such as you might fit with **mgcv**, with the multilevel model, which we’ll learn all about starting in Chapter 13, which helps explain what’s going on with the `s()`

function in our last two models, `b4.10`

and `b4.11`

.

## 4.7 Second bonus: Group predictors with matrix columns

When we fit `b4.8`

, our direct **brms** analogue to McElreath’s `m4.7`

^{2}, we used a compact syntax to pass a matrix column of predictors into the `formula`

. If memory serves, this is one of the only places in the text where we see this. It would be easy for the casual reader to think this was only appropriate for something like a spline model. But that’s not the case. One could use the matrix-column trick as a general approach. In this bonus section, we’ll explore how.

In Section 11.2.6 of their (2020) text, Gelman, Hill, and Vehtari worked through an example of a multiple regression model,

\[ \begin{align*} y_i & \sim \operatorname{Normal}(\mu_i, \sigma) \\ \mu_i & = \alpha + \theta z_i + \sum_{k = 1}^K b_k x_{k, i} \\ & \text{<priors>}, \end{align*} \]

where \(y_i\) was some continuous variable collected across participants, \(i\). The \(\alpha\) term was the intercept and the \(\theta\) term was the regression slope for a binary variable \(z\)–we’ll practice with binary predictors in Section 5.3. More to our interest, the last portion of the equation is a compact way to convey there are \(K\) additional predictors and their associated regression coefficients, which we might more explicitly express as \(\beta_1 x_{1, i} + \cdots + \beta_k x_{k, i}\), where \(K \geq 1\). In this particular example, \(K = 10\), meaning there were ten \(x_{k, i}\) predictors, making this an example of a model with 11 total predictor variables.

Riffing off of Gelman and colleagues, here’s how you might simulate data of this kind.

```
# how many cases would you like?
<- 100
n
# how many continuous x predictor variables would you like?
<- 10
k
# simulate a dichotomous dummy variable for z
# simulate an n by k array for X
set.seed(4)
<-
d tibble(z = sample(0:1, size = n, replace = T),
X = array(runif(n * k, min = 0, max = 1), dim = c(n, k)))
# set the data-generating parameter values
<- 1
a <- 5
theta <- 1:k
b <- 2
sigma
# simulate the criterion
<-
d %>%
d mutate(y = as.vector(a + X %*% b + theta * z + rnorm(n, mean = 0, sd = sigma)))
# check the data structure
%>%
d glimpse()
```

```
## Rows: 100
## Columns: 3
## $ z <int> 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1…
## $ X <dbl[,10]> <matrix[33 x 10]>
## $ y <dbl> 19.51231, 26.75217, 28.92495, 20.63144, 29.18219, 43.67198, 36.98107, 36.68241, 28.…
```

Although our `d`

tibble has only three columns, the `X`

column is a matrix column into which we’ve smuggled ten columns more. Here’s how we might access them more directly.

```
%>%
d pull(X) %>%
glimpse()
```

`## num [1:100, 1:10] 0.253 0.63 0.266 0.532 0.468 ...`

See? There’s an \(100 \times 10\) data matrix in there. *Tricky*. Here’s how to fit the full model with **brms** where we use the compact matrix-column syntax in the `formula`

argument.

```
.12 <-
b4brm(data = d,
family = gaussian,
~ 1 + z + X,
y prior = c(prior(normal(0, 2), class = Intercept),
prior(normal(0, 10), class = b),
prior(exponential(1), class = sigma)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 4,
file = "fits/b04.12")
```

Check the parameter summary.

`print(b4.12)`

```
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: y ~ 1 + z + X
## Data: d (Number of observations: 100)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 0.95 1.21 -1.42 3.24 1.00 7096 3172
## z 4.74 0.42 3.92 5.57 1.00 7132 2651
## X1 0.56 0.74 -0.88 2.04 1.00 7515 3646
## X2 0.88 0.68 -0.47 2.24 1.00 7180 3427
## X3 3.40 0.72 1.99 4.82 1.00 7592 3281
## X4 2.82 0.73 1.38 4.26 1.00 7795 3134
## X5 5.75 0.71 4.35 7.15 1.00 6154 3273
## X6 6.41 0.74 4.98 7.85 1.00 6828 3383
## X7 8.49 0.73 7.06 9.92 1.00 7494 3063
## X8 8.40 0.71 7.02 9.78 1.00 7389 3078
## X9 8.83 0.80 7.28 10.39 1.00 6807 3026
## X10 9.33 0.70 7.94 10.69 1.00 6668 2701
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 1.99 0.16 1.71 2.32 1.00 5356 3113
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

**brms** automatically numbered our \(K = 10\) `X`

variables as `X1`

through `X10`

. As far as applications go, I’m not sure where I’d use this way of storing and modeling data in real life. But maybe some of y’all work in domains where this is just the right way to approach your data needs. If so, good luck and happy modeling.

## Session info

`sessionInfo()`

```
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] splines parallel stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tidybayes_2.3.1 brms_2.15.0 Rcpp_1.0.6 rstan_2.21.2
## [5] StanHeaders_2.21.0-7 patchwork_1.1.1 forcats_0.5.1 stringr_1.4.0
## [9] dplyr_1.0.5 purrr_0.3.4 readr_1.4.0 tidyr_1.1.3
## [13] tibble_3.1.0 ggplot2_3.3.3 tidyverse_1.3.0
##
## loaded via a namespace (and not attached):
## [1] readxl_1.3.1 backports_1.2.1 plyr_1.8.6 igraph_1.2.6
## [5] svUnit_1.0.3 crosstalk_1.1.0.1 TH.data_1.0-10 rstantools_2.1.1
## [9] inline_0.3.17 digest_0.6.27 htmltools_0.5.1.1 rsconnect_0.8.16
## [13] fansi_0.4.2 magrittr_2.0.1 modelr_0.1.8 RcppParallel_5.0.2
## [17] matrixStats_0.57.0 xts_0.12.1 sandwich_3.0-0 prettyunits_1.1.1
## [21] colorspace_2.0-0 rvest_0.3.6 ggdist_2.4.0.9000 haven_2.3.1
## [25] xfun_0.22 callr_3.5.1 crayon_1.4.1 jsonlite_1.7.2
## [29] lme4_1.1-25 survival_3.2-7 zoo_1.8-8 glue_1.4.2
## [33] gtable_0.3.0 emmeans_1.5.2-1 V8_3.4.0 distributional_0.2.2
## [37] pkgbuild_1.2.0 shape_1.4.5 abind_1.4-5 scales_1.1.1
## [41] mvtnorm_1.1-1 emo_0.0.0.9000 DBI_1.1.0 miniUI_0.1.1.1
## [45] isoband_0.2.3 viridisLite_0.3.0 xtable_1.8-4 HDInterval_0.2.2
## [49] stats4_4.0.4 DT_0.16 htmlwidgets_1.5.2 httr_1.4.2
## [53] threejs_0.3.3 arrayhelpers_1.1-0 ellipsis_0.3.1 pkgconfig_2.0.3
## [57] loo_2.4.1 farver_2.0.3 dbplyr_2.0.0 utf8_1.1.4
## [61] tidyselect_1.1.0 labeling_0.4.2 rlang_0.4.10 reshape2_1.4.4
## [65] later_1.1.0.1 munsell_0.5.0 cellranger_1.1.0 tools_4.0.4
## [69] cli_2.3.1 generics_0.1.0 broom_0.7.5 ggridges_0.5.2
## [73] evaluate_0.14 fastmap_1.0.1 processx_3.4.5 knitr_1.31
## [77] fs_1.5.0 nlme_3.1-152 mime_0.10 projpred_2.0.2
## [81] xml2_1.3.2 compiler_4.0.4 bayesplot_1.8.0 shinythemes_1.1.2
## [85] rstudioapi_0.13 curl_4.3 gamm4_0.2-6 reprex_0.3.0
## [89] statmod_1.4.35 stringi_1.5.3 highr_0.8 ps_1.6.0
## [93] Brobdingnag_1.2-6 lattice_0.20-41 Matrix_1.3-2 nloptr_1.2.2.2
## [97] markdown_1.1 shinyjs_2.0.0 vctrs_0.3.6 pillar_1.5.1
## [101] lifecycle_1.0.0 bridgesampling_1.0-0 estimability_1.3 httpuv_1.5.4
## [105] R6_2.5.0 bookdown_0.21 promises_1.1.1 gridExtra_2.3
## [109] codetools_0.2-18 boot_1.3-26 colourpicker_1.1.0 MASS_7.3-53
## [113] gtools_3.8.2 assertthat_0.2.1 withr_2.4.1 shinystan_2.5.0
## [117] multcomp_1.4-16 mgcv_1.8-33 hms_0.5.3 grid_4.0.4
## [121] coda_0.19-4 minqa_1.2.4 rmarkdown_2.7 shiny_1.5.0
## [125] lubridate_1.7.9.2 base64enc_0.1-3 dygraphs_1.1.1.6
```

In the first edition of his text, McElreath (2015) started out with the uniform prior for \(\sigma\) and transitioned to the half Cauchy when he introduced HMC in Chapter 8. In this edition, he largely replaced the half Cauchy with the exponential distribution. Both are fine. If you’d like to learn more about the half Cauchy, you might check out Section 8.4.3.1 of my (2020a) translation of his first edition. We’ll learn about the exponential distribution just a little later in this chapter.↩︎

I know. I’m sorry our model numbering diverged from the numbering in the text. The problem popped up in Section 4.5.1. where McElreath fit a linear model with the

`weight_s`

predictor, but didn’t show the code in the text and, as a consequence, didn’t assign that model a number. Since we explicitly fit that linear model, we assigned it the next number in the sequence. So it goes. But yes, I am correct in comparing our`b4.8`

to McElreath’s`m4.7`

.↩︎