4  Geocentric Models

Learning Objectives:

“This chapter introduces linear regression as a Bayesian procedure. Under a probability interpretation, which is necessary for Bayesian work, linear regression uses a Gaussian (normal) distribution to describe our golem’s uncertainty about some measurement of interest.” (McElreath, 2020, p. 71) (pdf)

4.1 Why normal distributions?

Why are there so many distribution approximately normal, resulting in a Gaussian curve? Because there will be more combinations of outcomes that sum up to a “central” value, rather than to some extreme value.

Why are normal distributions normal?

“Any process that adds together random values from the same distribution converges to a normal.” (McElreath, 2020, p. 73) (pdf)

4.1.1 Normal by addition

“Whatever the average value of the source distribution, each sample from it can be thought of as a fluctuation from that average value. When we begin to add these fluctuations together, they also begin to cancel one another out. A large positive fluctuation will cancel a large negative one. The more terms in the sum, the more chances for each fluctuation to be canceled by another, or by a series of smaller ones in the opposite direction. So eventually the most likely sum, in the sense that there are the most ways to realize it, will be a sum in which every fluctuation is canceled by another, a sum of zero (relative to the mean).” (McElreath, 2020, p. 73) (pdf)

“It doesn’t matter what shape the underlying distribution possesses. It could be uniform, like in our example above, or it could be (nearly) anything else. Depending upon the underlying distribution, the convergence might be slow, but it will be inevitable. Often, as in this example, convergence is rapid.” (McElreath, 2020, p. 74) (pdf)

Resources: Why normal distributions?

See the excellent article Why is normal distribution so ubiquitous? which also explains the example of random walks from SR2. See also the scientific paper Why are normal distribution normal? of the The British Journal for the Philosophy of Science.

4.1.2 Normal by multiplication

This is not only valid for addition but also for multiplication of small values: Multiplying small numbers is approximately the same as addition.

4.1.3 Normal by log-multipliation

But even the multiplication of large values tend to produce Gaussian distributions on the log scale.

4.1.4 Using Gaussian distributions

The justifications for using the Gaussian distribution fall into two broad categories:

  1. Ontological justification: The Gaussian distributions is a widespread pattern, appearing again and again at different scales and in different domains.
  2. Epistemological justification: When all we know is the mean and variance of a distribution then the Gaussian distribution arises as the most consistent with these assumptions.

WATCH OUT! Many processes have heavier tails than the Gaussian distribution

“… the Gaussian distribution has some very thin tails—there is very little probability in them. Instead most of the mass in the Gaussian lies within one standard deviation of the mean. Many natural (and unnatural) processes have much heavier tails.” (McElreath, 2020, p. 76) (pdf)

Definition 4.1 : Probability mass and probability density

  • Probability distributions with only discrete outcomes, like the binomial, are called probability mass functions and denoted Pr.

  • Continuous ones like the Gaussian are called probability density functions, denoted with \(p\) or just plain old \(f\), depending upon author and tradition.

“Probability density is the rate of change in cumulative probability. So where cumulative probability is increasing rapidly, density can easily exceed 1. But if we calculate the area under the density function, it will never exceed 1. Such areas are also called probability mass.” (McElreath, 2020, p. 76) (pdf)

For example dnorm(0,0,0.1) which is the way to make R calculate \(p(0 \mid 0, 0.1)\) results to 3.9894228.

“The Gaussian distribution is routinely seen without σ but with another parameter, \(\tau\) . The parameter \(\tau\) in this context is usually called precision and defined as \(\tau = 1/σ^2\). When \(\sigma\) is large, \(\tau\) is small.” (McElreath, 2020, p. 76) (pdf)

“This form is common in Bayesian data analysis, and Bayesian model fitting software, such as BUGS or JAGS, sometimes requires using \(\tau\) rather than \(\sigma\).” (McElreath, 2020, p. 76) (pdf)

4.2 Model describing language

Procedure 4.1 : General receipt for describing models

  1. First, we recognize a set of variables to work with. Some of these variables are observable. We call these data. Others are unobservable things like rates and averages. We call these parameters.
  2. We define each variable either in terms of the other variables or in terms of a probability distribution.
  3. The combination of variables and their probability distributions defines a joint generative model that can be” (McElreath, 2020, p. 77) (pdf)

Definition 4.2 : What are statistical models?

Models are “mappings of one set of variables through a probability distribution onto another set of variables.” (McElreath, 2020, p. 77) (pdf)

4.2.1 Re-describing the globe tossing model

Example 4.1 : Describe the globe tossing model from Chapter 3

\[ \begin{align*} W \sim \operatorname{Binomial}(N, p) \space \space (1)\\ p \sim \operatorname{Uniform}(0, 1) \space \space (2) \end{align*} \tag{4.1}\]

  • W: observed count of water
  • N: total number of tosses
  • p: proportion of water on the globe

The first line in these kind of models always defines the likelihood function used in Bayes’ theorem. The other lines define priors.

Read the above statement as:

  1. First line: The count W is distributed binomially with sample size N and probability p.
  2. Second line: The prior for p is assumed to be uniform between zero and one.

Both of the lines in the model of Equation 4.1 are stochastic, as indicated by the ~ symbol. A stochastic relationship is just a mapping of a variable or parameter onto a distribution. It is stochastic because no single instance of the variable on the left is known with certainty. Instead, the mapping is probabilistic: Some values are more plausible than others, but very many different values are plausible under any model. Later, we’ll have models with deterministic definitions in them.

4.3 Gaussian model of height

In this section we want a single measurement variable to model as a Gaussian distribution. It is a preparation for the linear regression model in Section 4.4 where we will construct and add a predictor variable to the model.

“There will be two parameters describing the distribution’s shape, the mean μ and the standard deviation σ. Bayesian updating will allow us to consider every possible combination of values for μ and σ and to score each combination by its relativ // plausibility, in light of the data. These relative plausibilities are the posterior probabilities of each combination of values μ, σ.” (McElreath, 2020, p. 78/79) (pdf)

4.3.1 The data

Resource: Nancy Howell data

  • Howell, N. (2001). Demography of the Dobe! Kung (2nd ed.). Routledge.
  • Howell, N. (2010). Life Histories of the Dobe !kung: Food, Fatness, and Well-being over the Life Span: Food, Fatness, and Well-Being Over the Life-Span Volume 4. University of California Press.

The data contained in data(Howell1) are partial census data for the Dobe area !Kung San, compiled from interviews conducted by Nancy Howell in the late 1960s.

Much more raw data is available for download from the University of Toronto Library

“For the non-anthropologists reading along, the !Kung San are the most famous foraging population of the twentieth century, largely because of detailed quantitative studies by people like Howell.” (McElreath, 2020, p. 79) (pdf)

WATCH OUT! Loading data without attaching the package with library()

Loading data from a package with the data() function is only possible if you have already loaded the package.

R Code 4.1 : Data loading from a package – Standard procedure

Code
## R code 4.7 not executed! #######################
# library(rethinking)
# data(Howell1)
# d_a <- Howell1

The standard loading of data from packages with

`library(rethinking)`
`data(Howell1)`

is in this book not executed: I want to prevent clashes with loading {rethinking} and {brms} at the same time, because of their similar functions.

Because of many function name conflicts with {brms} I do not want to load {rethinking} and will call the function of these conflicted packages with <package name>::<function name>() Therefore I have to use another, not so usual loading strategy of the data set.

R Code 4.2 a: Data loading from a package – Unusual procedure (Original)

Code
data(package = "rethinking", list = "Howell1")
d_a <- Howell1
head(d_a)
#>    height   weight age male
#> 1 151.765 47.82561  63    1
#> 2 139.700 36.48581  63    0
#> 3 136.525 31.86484  65    0
#> 4 156.845 53.04191  41    1
#> 5 145.415 41.27687  51    0
#> 6 163.830 62.99259  35    1

The advantage of this unusual strategy is that I have not always to detach the {rethinking} package and to make sure {rethinking} is detached before using {brms} as it is necessary in the Kurz’s {tidyverse} / {brms} version.

Because of many function name conflicts with {brms} I do not want to load {rethinking} and will call the function of these conflicted packages with <package name>::<function name>() Therefore I have to use another, not so usual loading strategy of the data set.

R Code 4.3 b: Data loading from a package – Unusual procedure (Tidyverse)

Code
data(package = "rethinking", list = "Howell1")
d_b <- tibble::as_tibble(Howell1)
head(d_b)
#> # A tibble: 6 × 4
#>   height weight   age  male
#>    <dbl>  <dbl> <dbl> <int>
#> 1   152.   47.8    63     1
#> 2   140.   36.5    63     0
#> 3   137.   31.9    65     0
#> 4   157.   53.0    41     1
#> 5   145.   41.3    51     0
#> 6   164.   63.0    35     1

The advantage of this unusual strategy is that I have not always to detach the {rethinking} package and to make sure {rethinking} is detached before using {brms} as it is necessary in the Kurz’s {tidyverse} / {brms} version.

4.3.1.1 Show the data

Example 4.2 : Show and inspect the data

R Code 4.4 a: Compactly Display the Structure of an Arbitrary R Object (Original)

Code
## R code 4.8 ####################
str(d_a)
#> 'data.frame':    544 obs. of  4 variables:
#>  $ height: num  152 140 137 157 145 ...
#>  $ weight: num  47.8 36.5 31.9 53 41.3 ...
#>  $ age   : num  63 63 65 41 51 35 32 27 19 54 ...
#>  $ male  : int  1 0 0 1 0 1 0 1 0 1 ...

utils::str() displays compactly the internal structure of any reasonable R object.

Our Howell1 data contains four columns. Each column has 544 entries, so there are 544 individuals in these data. Each individual has a recorded height (centimeters), weight (kilograms), age (years), and “maleness” (0 indicating female and 1 indicating male).

R Code 4.5 a: Displays concise parameter estimate information for an existing model fit (Original)

Code
## R code 4.9 ###################
rethinking::precis(d_a)
#>               mean         sd      5.5%     94.5%     histogram
#> height 138.2635963 27.6024476 81.108550 165.73500 ▁▁▁▁▁▁▁▂▁▇▇▅▁
#> weight  35.6106176 14.7191782  9.360721  54.50289 ▁▂▃▂▂▂▂▅▇▇▃▂▁
#> age     29.3443934 20.7468882  1.000000  66.13500     ▇▅▅▃▅▂▂▁▁
#> male     0.4724265  0.4996986  0.000000   1.00000    ▇▁▁▁▁▁▁▁▁▇

rethinking::precis() creates a table of estimates and standard errors, with optional confidence intervals and parameter correlations.

In this case we see the mean, the standard deviation, the width of a 89% posterior interval and a small histogram of four variables: height (centimeters), weight (kilograms), age (years), and “maleness” (0 indicating female and 1 indicating male).

Additionally there is also a console output. In our case: 'data.frame': 544 obs. of 4 variables: .

R Code 4.6 b: Get a glimpse of your data (Tidyverse)

Code
d_b |>
    dplyr::glimpse()
#> Rows: 544
#> Columns: 4
#> $ height <dbl> 151.7650, 139.7000, 136.5250, 156.8450, 145.4150, 163.8300, 149…
#> $ weight <dbl> 47.82561, 36.48581, 31.86484, 53.04191, 41.27687, 62.99259, 38.…
#> $ age    <dbl> 63.0, 63.0, 65.0, 41.0, 51.0, 35.0, 32.0, 27.0, 19.0, 54.0, 47.…
#> $ male   <int> 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, …

pillar::glimpse() is re-exported by {dplyr} and is the tidyverse analogue for str(). It works like a transposed version of print(): columns run down the page, and data runs across.

dplyr::glimpse() shows that the Howell1 data contains four columns. Each column has 544 entries, so there are 544 individuals in these data. Each individual has a recorded height (centimeters), weight (kilograms), age (years), and “maleness” (0 indicating female and 1 indicating male).

R Code 4.7 : Object summaries (Tidyverse)

Code
d_b |> 
    base::summary()
#>      height           weight            age             male       
#>  Min.   : 53.98   Min.   : 4.252   Min.   : 0.00   Min.   :0.0000  
#>  1st Qu.:125.09   1st Qu.:22.008   1st Qu.:12.00   1st Qu.:0.0000  
#>  Median :148.59   Median :40.058   Median :27.00   Median :0.0000  
#>  Mean   :138.26   Mean   :35.611   Mean   :29.34   Mean   :0.4724  
#>  3rd Qu.:157.48   3rd Qu.:47.209   3rd Qu.:43.00   3rd Qu.:1.0000  
#>  Max.   :179.07   Max.   :62.993   Max.   :88.00   Max.   :1.0000

Kurz tells us that the {brms} package does not have a function that works like rethinking::precis() for providing numeric and graphical summaries of variables, as seen in R Code 4.5 Kurz suggests therefore to use base::summary() to get some of the information from rethinking::precis().

R Code 4.8 b: Skim a data frame, getting useful summary statistics

I think skimr::skim() is a better option as an alternative to rethinking::precis() as base::summary() because it also has a graphical summary of the variables. {skimr} has many other useful functions and is very adaptable. I propose to install and to try it out.

Code
d_b |>            
    skimr::skim() 
Data summary
Name d_b
Number of rows 544
Number of columns 4
_______________________
Column type frequency:
numeric 4
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
height 0 1 138.26 27.60 53.98 125.10 148.59 157.48 179.07 ▁▂▂▇▇
weight 0 1 35.61 14.72 4.25 22.01 40.06 47.21 62.99 ▃▂▃▇▂
age 0 1 29.34 20.75 0.00 12.00 27.00 43.00 88.00 ▇▆▅▂▁
male 0 1 0.47 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▇

R Code 4.9 : Show a random number of data records

Code
set.seed(4)
d_b |> 
  dplyr::slice_sample(n = 6)
#> # A tibble: 6 × 4
#>   height weight   age  male
#>    <dbl>  <dbl> <dbl> <int>
#> 1  124.   21.5  11        1
#> 2  130.   24.6  13        1
#> 3   83.8   9.21  1        0
#> 4  131.   25.3  15        0
#> 5  112.   17.8   8.90     1
#> 6  162.   51.6  36        1

R Code 4.10 b: Randomly select a small number of observations and put it into knitr::kable()

Code
set.seed(4)
d_b |>            
    bayr::as_tbl_obs() 
Data set with 5 variables, showing 8 of 544 observations.
Obs height weight age male
62 164.4650 45.897841 50.0 1
71 129.5400 24.550667 13.0 1
130 149.2250 42.155707 27.0 0
307 130.6068 25.259404 15.0 0
312 111.7600 17.831836 8.9 1
371 83.8200 9.213587 1.0 0
414 161.9250 51.596090 36.0 1
504 123.8250 21.545620 11.0 1

I just learned another method to print variables from a data frame. In base R there is utils::head() and utils::tail() with the disadvantage that the start resp. the end of data file could be atypical for the variable values. The standard tibble printing method has the same problem. In contrast bayr::as_tbl_obs() prints a random selection of maximal 8 rows as a compact and nice output, that works on both, console and {knitr} output.

Although bayr::as_tbl_obs() does not give a data summary as discussed here in Example 4.2 but I wanted mention this printing method as I have always looked for an easy way to display a representative sample of some values of data frame.

R Code 4.11 b: Show data with the internal printing method of tibbles

Code
print(d_b, n = 10)
#> # A tibble: 544 × 4
#>    height weight   age  male
#>     <dbl>  <dbl> <dbl> <int>
#>  1   152.   47.8    63     1
#>  2   140.   36.5    63     0
#>  3   137.   31.9    65     0
#>  4   157.   53.0    41     1
#>  5   145.   41.3    51     0
#>  6   164.   63.0    35     1
#>  7   149.   38.2    32     0
#>  8   169.   55.5    27     1
#>  9   148.   34.9    19     0
#> 10   165.   54.5    54     1
#> # ℹ 534 more rows

Another possibility is to use the tbl_df internal printing method, one of the main features of tibbles. Printing can be tweaked for a one-off call by calling print() explicitly and setting arguments like \(n\) and \(width\). More persistent control is available by setting the options described in pillar::pillar_options.

Again this printing method does not give a data summary as is featured in Example 4.2. But it is an easy method – especially as you are already working with tibbles – and sometimes this method is enough to get a sense of the data.

4.3.1.2 Select the height data of adults

“All we want for now are heights of adults in the sample. The reason to filter out nonadults for now is that height is strongly correlated with age, before adulthood.” (McElreath, 2020, p. 80) (pdf)

Example 4.3 : Select the height data of adults (individuals older or equal than 18 years)

R Code 4.12 a: Select individuals older or equal than 18 years (Original)

Code
## R code 4.11a ###################
d2_a <- d_a[d_a$age >= 18, ]
str(d2_a)
#> 'data.frame':    352 obs. of  4 variables:
#>  $ height: num  152 140 137 157 145 ...
#>  $ weight: num  47.8 36.5 31.9 53 41.3 ...
#>  $ age   : num  63 63 65 41 51 35 32 27 19 54 ...
#>  $ male  : int  1 0 0 1 0 1 0 1 0 1 ...

R Code 4.13 b: Select individuals older or equal than 18 years (Tidyverse)

Code
## R code 4.11b ###################
d2_b <- 
  d_b |> 
  dplyr::filter(age >= 18) 

dplyr::glimpse(d2_b)
#> Rows: 352
#> Columns: 4
#> $ height <dbl> 151.7650, 139.7000, 136.5250, 156.8450, 145.4150, 163.8300, 149…
#> $ weight <dbl> 47.82561, 36.48581, 31.86484, 53.04191, 41.27687, 62.99259, 38.…
#> $ age    <dbl> 63.0, 63.0, 65.0, 41.0, 51.0, 35.0, 32.0, 27.0, 19.0, 54.0, 47.…
#> $ male   <int> 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, …

4.3.2 The model

Our goal is to model the data using a Gaussian distribution.

4.3.2.1 Plot the distribution of heights

Example 4.4 : Plotting the distribution of height

R Code 4.14 a: Plot the distribution of the heights of adults, overlaid by an ideal Gaussian distribution (Original)

Code
rethinking::dens(d2_a$height, adj = 1, norm.comp = TRUE)

Graph 4.1: The distribution of the heights data, overlaid by an ideal Gaussian distribution (Original)

With the option norm.comp = TRUE I have overlaid a Gaussian distribution to see the differences to the actual data. There are some differences locally, especially on the peak of the distribution. But the tails looks nice and we can say that the overall impression of the curve is Gaussian.

R Code 4.15 b: Plot the distribution of the heights of adults, overlaid by an ideal Gaussian distribution (Tidyverse)

Code
d2_b |> 
    ggplot2::ggplot(ggplot2::aes(height)) +
    ggplot2::geom_density() +

    ggplot2::stat_function(
        fun = dnorm,
        args = with(d2_b, c(mean = mean(height), sd = sd(height)))
        ) +
    ggplot2::labs(
      x = "Height in cm",
      y = "Density"
    ) +
    ggplot2::theme_bw()

Graph 4.2: The distribution of the heights data, overlaid by an ideal Gaussian distribution: tidyverse version

The plot of the heights distribution compared with the standard Gaussian distribution is missing in Kurz’s version. I added this plot after an internet research by using the last example of How to Plot a Normal Distribution in R. It uses the ggplot2::stat_function() to compute and draw a function as a continuous curve. This makes it easy to superimpose a function on top of an existing plot.

WATCH OUT! Looking at the raw data is not enough for a model decision

“Gawking at the raw data, to try to decide how to model them, is usually not a good idea. The data could be a mixture of different Gaussian distributions, for example, and in that case you won’t be able to detect the underlying normality just by eyeballing the outcome distribution.” (McElreath, 2020, p. 81) (pdf)

Theorem 4.1 : Define the heights as normally distributed with a mean \(\mu\) and standard deviation \(\sigma\)

\[ h_{i} \sim \operatorname{Normal}(σ, μ) \tag{4.2}\]

  • Symbol h: refers to the list of heights
  • Subscript i: refers to each individual element of the list. It is conventional to use \(i\) because it stands for index. The index \(i\) takes on row numbers, and so in this example can take any value from 1 to 352 (the number of heights in d2_a$height). As such, the model above is saying that all the golem knows about each height measurement is defined by the same normal distribution, with mean \(\mu\) and standard deviation \(\sigma\).

Equation 4.2 assumes that the values \(h_{i}\) are i.i.d. (independent and identically distributed)

“The i.i.d. assumption doesn’t have to seem awkward, as long as you remember that probability is inside the golem, not outside in the world. The i.i.d. assumption is about how the golem represents its uncertainty. It is an epistemological assumption. It is not a physical assumption about the world, an ontological one. E. T. Jaynes (1922–1998) called this the mind projection fallacy, the mistake of confusing epistemological claims with ontological claims.” (McElreath, 2020, p. 81) (pdf)

“To complete the model, we’re going to need some priors. The parameters to be estimated are both \(\mu\) and \(\sigma\), so we need a prior \(Pr(\mu, \sigma)\), the joint prior probability for all parameters. In most cases, priors are specified independently for each parameter, which amounts to assuming \(Pr(\mu, \sigma) = Pr(\mu)Pr(\sigma)\).” (McElreath, 2020, p. 82) (pdf)

Theorem 4.2 : Define the linear heights model

\[ \begin{align*} h_{i} \sim \operatorname{Normal}(\mu, \sigma) \space \space (1) \\ \mu \sim \operatorname{Normal}(178, 20) \space \space (2) \\ \sigma \sim \operatorname{Uniform}(0, 50) \space \space (3) \end{align*} \tag{4.3}\]


  1. First line represents the likelihood.
  2. Second line is the chosen \(\mu\) (mu, mean) prior. It is a broad Gaussian prior, centered on 178 cm, with 95% of probability between 178 ± 40 cm.
  3. Third line is the chosen \(\sigma\) (sigma, standard deviation) prior.

Let’s think about the chosen value for the priors more in detail:

1. Choosing the mean prior

  • Why normal distribution?: As we have stated before in Theorem 4.1 the heights distribution of adults is a Gaussian distribution.
  • Why 178cm?: “Your author is 178 cm tall. And the range from 138 cm to 218 cm encompasses a huge range of plausible mean heights for human populations. So domain-specific information has gone into this prior.” (McElreath, 2020, p. 82) (pdf)
  • Why 95% and 40cm?: 40cm (T= twice sigma = 2* 20) and 95% is a reference to the 68–95–99.7 rule that helps to remember how many percentages of values lie within an interval estimate in a normal distribution:

\[ \begin{align} \Pr(\mu-1\sigma \le X \le \mu+1\sigma) & \approx 68.27\% \\ \Pr(\mu-2\sigma \le X \le \mu+2\sigma) & \approx 95.45\% \\ \Pr(\mu-3\sigma \le X \le \mu+3\sigma) & \approx 99.73\% \end{align} \]

2. Choosing the sigma prior

  • Why uniform distribution?: We assume i.i.d., e.g., the standard deviation is over the whole distribution identical.
  • Why 0 as lower limit?: “A standard deviation like σ must be positive, so bounding it at zero makes sense.” (McElreath, 2020, p. 82) (pdf)
  • Why 50 as upper limit? “… a standard deviation of 50 cm would imply that 95% of individual heights lie within 100 cm of the average height.” (McElreath, 2020, p. 82) (pdf) Thas is a range large enough to include variation of heights.

Plot the chosen priors!

It is important to plot the priors to get an idea about the assumptions they build into your model.

Example 4.5 : Numbered Example Title

R Code 4.16 a: Plot the chosen mean prior (Original)

Code
## R code 4.12 ###############################
graphics::curve(stats::dnorm(x, 178, 20), from = 100, to = 250)

Graph 4.3: Plot of the chosen mean prior (Original)

You can see that the golem is assuming that the average height (not each individual height) is almost certainly between 140 cm and 220 cm. So this prior carries a little information, but not a lot.

R Code 4.17 a: Plot chosen prior for the standard deviation (Original)

Code
## R code 4.13 ###########################
graphics::curve(stats::dunif(x, 0, 50), from = -10, to = 60)

Graph 4.4: Plot the chosen prior for the standard deviation (Original)

b: Plot the chosen mean prior (Tidyverse)

Code
tibble::tibble(x = base::seq(from = 100, to = 250, by = .1)) |> 
    
  ggplot2::ggplot(ggplot2::aes(x = x, y = stats::dnorm(x, mean = 178, sd = 20))) +
  ggplot2::geom_line() +
  ggplot2::scale_x_continuous(breaks = base::seq(from = 100, to = 250, by = 25)) +
  ggplot2::labs(title = "mu ~ dnorm(178, 20)",
       y = "density") +
  ggplot2::theme_bw()

Graph 4.5: Plot of the chosen mean prior (Tidyverse)

As there is only one variable \(y\) (= dnorm(x, mean = 178, sd = 20)) we need to specify \(x\) as a sequence of 1501 points to provide a \(x\) and \(y\) aesthetic for the plot.

b: Plot the chosen prior for the standard deviation (Tidyverse)

Code
tibble::tibble(x = base::seq(from = -10, to = 60, by = .1)) |>

  ggplot2::ggplot(ggplot2::aes(x = x, y = stats::dunif(x, min = 0, max = 50))) +
  ggplot2::geom_line() +
  ggplot2::scale_x_continuous(breaks = c(0, 50)) +
  ggplot2::scale_y_continuous(NULL, breaks = NULL) +
  ggplot2::ggtitle("sigma ~ dunif(0, 50)") +
  ggplot2::theme_bw()

Graph 4.6: Plot the chosen prior for the standard deviation (Tidyverse)

We don’t really need the \(y\)-axis when looking at the shapes of a density, so we’ll just remove it with scale_y_continuous(NULL, breaks = NULL).

4.3.2.2 Prior predictive simulation

Simulate the prior predictive distribution!

Prior predictive simulation is very useful for assigning sensible priors, because it can be quite hard to anticipate how priors influence the observable variables.” (McElreath, 2020, p. 83) (pdf)

To see the difference we will look at two prior predictive distributions:

  1. The first one with our reflections and data from Equation 4.3.
  2. For the second predictive distribution we will choose a much flatter and less informative prior for \(\mu\), like \(\mu \sim Normal(178, 100)\). Priors with such large standard deviations are quite common in Bayesian models, but they are hardly ever sensible.

“Okay, so how to do this? You can quickly simulate heights by sampling from the prior, like you sampled from the posterior back in Chapter 3. Remember, every posterior is also potentially a prior for a subsequent analysis, so you can process priors just like posteriors.” (McElreath, 2020, p. 82) (pdf)

Example 4.6 : Prior Predictive Simulation

R Code 4.18 a: Simulate heights by sampling from the priors with \(\mu \sim Normal(178, 20)\) (Original)

Code
N_sim_height_a <- 1e4
set.seed(4) # to make example reproducible

## R code 4.14a adapted #######################################
sample_mu_a <- rnorm(N_sim_height_a, 178, 20)
sample_sigma_a <- runif(N_sim_height_a, 0, 50)
priors_height_a <- rnorm(N_sim_height_a, sample_mu_a, sample_sigma_a)
rethinking::dens(priors_height_a, 
                 adj = 1, 
                 norm.comp = TRUE,
                 show.zero = TRUE,
                 col = "red")
graphics::abline(v = 272, lty = 2)

Graph 4.7: Simulate heights by sampling from the priors with µ ~ Normal(178,20) (Original)

The prior predictive simulation generates a plausible distribution; There are no values negative (left dashed vertical line at 0) and one of the tallest people in recorded history, Robert Pershing Wadlow (1918–1940) with 272 cm (right dashed vertical line) has only a small probability.

The prior probability distribution of height is not itself Gaussian because it is approaching the mean too thin and to high, respectively its tails are too thick. But this is ok.

“The distribution you see is not an empirical expectation, but rather the distribution of relative plausibilities of different heights, before seeing the data.” (McElreath, 2020, p. 83) (pdf)

R Code 4.19 a: Simulate heights by sampling from the priors with \(\mu \sim Normal(178, 100)\) (Original)

Code
N_sim_height_a <- 1e4
set.seed(4) # to make example reproducible

## R code 4.14a adapted #######################################
sample_mu2_a <- rnorm(N_sim_height_a, 178, 100)
sample_sigma2_a <- runif(N_sim_height_a, 0, 50)
priors_height2_a <- rnorm(N_sim_height_a, sample_mu2_a, sample_sigma2_a)
rethinking::dens(priors_height2_a, 
                 adj = 1, 
                 norm.comp = TRUE,
                 show.zero = TRUE,
                 col = "red")
graphics::abline(v = 272, lty = 2)

Graph 4.8: Simulate heights by sampling from the priors with µ ~ Normal(178,100) (Original)

The results of Graph 4.8 contradicts our scientific knowledge — but also our common sense — about possible height values of humans. Now the model, before seeing the data, expects people to have negative height. It also expects some giants. One of the tallest people in recorded history, Robert Pershing Wadlow (1918–1940) stood 272 cm tall. In our prior predictive simulation many people are taller than this.

“Does this matter? In this case, we have so much data that the silly prior is harmless. But that won’t always be the case. There are plenty of inference problems for which the data alone are not sufficient, no matter how numerous. Bayes lets us proceed in these cases. But only if we use our scientific knowledge to construct sensible priors. Using scientific knowledge to build priors is not cheating. The important thing is that your prior not be based on the values in the data, but only on what you know about the data before you see it.” (McElreath, 2020, p. 84) (pdf)

R Code 4.20 b: Simulate heights by sampling from the priors with \(\mu \sim Normal(178, 20)\) (Tidyverse)

Code
N_sim_height_b <- 1e4
set.seed(4) # to make example reproducible

## R code 4.14b #######################################

sim_height_b <-
  tibble::tibble(sample_mu_b = stats::rnorm(N_sim_height_b, mean = 178, sd  = 20),
                 sample_sigma_b = stats::runif(N_sim_height_b, min = 0, max = 50)) |> 
  dplyr::mutate(priors_height_b = stats::rnorm(N_sim_height_b, 
                                               mean = sample_mu_b, 
                                               sd = sample_sigma_b))
  
sim_height_b |> 
  ggplot2::ggplot(ggplot2::aes(x = priors_height_b)) +
  ggplot2::geom_density(color = "red") +
  ggplot2::stat_function(
        fun = dnorm,
        args = with(sim_height_b, c(mean = mean(priors_height_b), sd = sd(priors_height_b)))
        ) +
  ggplot2::geom_vline(xintercept = c(0, 272), linetype = "dashed") +
  ggplot2::ggtitle("height ~ dnorm(178, 20)") +
  ggplot2::labs(x = "Height in cm", y = "Density") +
  ggplot2::theme_bw()

Graph 4.9: Simulate heights by sampling from the priors with µ ~ Normal(178,20) (Tidyverse)

ggplot2::geom_density() computes and draws kernel density estimates, which is a smoothed version of the histogram. Note that there is no data mentioned explicitly in the call of ggplot2::geom_density(). When this is the case (data = NULL) then the data will be inherited from the plot data as specified in the call to ggplot2::ggplot(). Otherwise the function needs a data frame or a function with a single argument to override the plot data. (geom_density() help file).

The prior predictive simulation generates a plausible distribution; There are no values negative (left dashed vertical line at 0) and one of the tallest people in recorded history, Robert Pershing Wadlow (1918–1940) with 272 cm (right dashed vertical line) has only a small probability.

The prior probability distribution of height is not itself Gaussian because it is approaching the mean too thin and to high, respectively its tails are too thick. But this is ok.

“The distribution you see is not an empirical expectation, but rather the distribution of relative plausibilities of different heights, before seeing the data.” (McElreath, 2020, p. 83) (pdf)

R Code 4.21 b: Simulate heights by sampling from the priors with \(\mu \sim Normal(178, 100)\) (Tidyverse)

Code
N_sim_height_b <- 1e4
set.seed(4) # to make example reproducible

## R code 4.14b #######################################

sim_height2_b <-
  tibble::tibble(sample_mu2_b = stats::rnorm(N_sim_height_b, mean = 178, sd  = 100),
                 sample_sigma2_b = stats::runif(N_sim_height_b, min = 0, max = 50)) |> 
  dplyr::mutate(priors_height2_b = stats::rnorm(N_sim_height_b, 
                                               mean = sample_mu2_b, 
                                               sd = sample_sigma2_b))
  
sim_height2_b |> 
  ggplot2::ggplot(ggplot2::aes(x = priors_height2_b)) +
  ggplot2::geom_density(color = "red") +
  ggplot2::stat_function(
        fun = dnorm,
        args = with(sim_height2_b, c(mean = mean(priors_height2_b), sd = sd(priors_height2_b)))
        ) +
  ggplot2::geom_vline(xintercept = c(0, 272), linetype = "dashed") +
  ggplot2::ggtitle("height ~ dnorm(178, 100)") +
  ggplot2::labs(x = "Height in cm", y = "Density") +
  ggplot2::theme_bw()

Graph 4.10: Simulate heights by sampling from the priors with µ ~ Normal(178,100) (Tidyverse)

The results of Graph 4.10 contradicts our scientific knowledge — but also our common sense — about possible height values of humans. Now the model, before seeing the data, expects people to have negative height. It also expects some giants. One of the tallest people in recorded history, Robert Pershing Wadlow (1918–1940) stood 272 cm tall. In our prior predictive simulation many people are taller than this.

“Does this matter? In this case, we have so much data that the silly prior is harmless. But that won’t always be the case. There are plenty of inference problems for which the data alone are not sufficient, no matter how numerous. Bayes lets us proceed in these cases. But only if we use our scientific knowledge to construct sensible priors. Using scientific knowledge to build priors is not cheating. The important thing is that your prior not be based on the values in the data, but only on what you know about the data before you see it.” (McElreath, 2020, p. 84) (pdf)

4.3.3 Grid approximation of the posterior distribution

We are going to map out the posterior distribution through brute force calculations.

This is not recommended because it is

  • laborious and computationally expensive
  • usually so impractical as to be essentially impossible.

Therefore the grid approximation technique has limited relevance. Later on we will use the quadratic approximation with rethinking::quap().

The strategy is the same grid approximation strategy as before in Section 2.4.3. But now there are two dimensions, and so there is a geometric (literally) increase in bother.

Procedure 4.2 : Grid approximation as R code (compare with Procedure 2.3)

  1. Define the grid - the first two code lines: Establish the range of \(\mu\) and \(\sigma\) values that you want to calculate over and decide how many points to calculate in-between.
  2. Compute the values of all combinations of the priors on the grid - third line: Expands the \(\mu\) and \(\sigma\) values into a matrix of all combinations of \(\mu\) and \(\sigma\). The matrix is stored in the data frame post_a.
  3. Compute the log-likelihood at each parameter value - fourth line: Because of rounding error to zero, we need to do all calculations at the log scale. base::sapply() passes the unique combination of \(\mu\) and \(\sigma\) on each row of post_a to a function that computes the log-likelihood of each observed height, and adds all of these log-likelihoods together with base::sum().
  4. Compute the unstandardized posterior at each parameter value - fifth line: Multiply prior by likelihood by adding them, because we are using the log scale.
  5. Return to probability scale - sixth line: To prevent rounding to zero we can’t use the standard approach with base::exp(post_a$prod) but have to scale all log-products by the maximum log-product.

Use log-probability to prevent rounding to zero!

“Remember, in large samples, all unique samples are unlikely. This is why you have to work with log-probability.” (McElreath, 2020, p. 562) (pdf)

: Grid approximation of the posterior distribution

R Code 4.22 a: Grid approximation of the posterior distribution (Original)

Code
## R code 4.16a Grid approx. ##################################

## 1. Define the grid ##########
mu.list_a <- seq(from = 150, to = 160, length.out = 100)  
sigma.list_a <- seq(from = 7, to = 9, length.out = 100)   

## 2. All Combinations of μ & σ ##########
post_a <- expand.grid(mu_a = mu.list_a, sigma_a = sigma.list_a) 

## 3. Compute log-likelihood #######
post_a$LL <- sapply(1:nrow(post_a), function(i) {                 
  sum(                                                            
    dnorm(d2_a$height, post_a$mu[i], post_a$sigma[i], log = TRUE) 
  )                                                               
})                                                                

## 4. Multiply prior by likelihood ##########
## as the priors are on the log scale adding = multiplying
post_a$prod <- post_a$LL + dnorm(post_a$mu_a, 178, 20, TRUE) +    
  dunif(post_a$sigma_a, 0, 50, TRUE)                              

## 5. Back to probability scale #########
## without rounding error 
post_a$prob <- exp(post_a$prod - max(post_a$prod))       

## define plotting area as one row and two columns
par(mfrow = c(1, 2))

## R code 4.17a Contour plot ##################################
rethinking::contour_xyz(post_a$mu_a, post_a$sigma_a, post_a$prob)

## R code 4.18a Heat map ##################################
rethinking::image_xyz(post_a$mu_a, post_a$sigma_a, post_a$prob)

Contour plot (left) and heat map (right)

b: Numbered R Code Title (Tidyverse)

Code
## R code 4.16b Grid approx ##################################

## 1./2. Define grid & combinations ##########
d_grid_b <-
  tidyr::expand_grid(mu_b = base::seq(from = 150, to = 160, length.out = 100),
                  sigma_b = base::seq(from = 7, to = 9, length.out = 100))

## 3a. Compute log-likelihood #######
grid_function <- function(mu, sigma) {
  stats::dnorm(d2_b$height, mean = mu, sd = sigma, log = TRUE) |> 
    sum()
}

d_grid2_b <-
  d_grid_b |> 
  
  ## 3b. Compute log-likelihood #######
  dplyr::mutate(log_likelihood_b = purrr::map2(mu_b, sigma_b, grid_function)) |>
  tidyr::unnest(log_likelihood_b) |> 
  dplyr::mutate(prior_mu_b  = stats::dnorm(mu_b, mean = 178, sd = 20, log = T),
         prior_sigma_b      = stats::dunif(sigma_b, min = 0, max = 50, log = T)) |> 
  
  ## 4. Multiply prior by likelihood ##########
  ## as the priors are on the log scale adding = multiplying
  dplyr::mutate(product_b = log_likelihood_b + prior_mu_b + prior_sigma_b) |> 
  
  ## 5. Back to probability scale #########
  dplyr::mutate(probability_b = exp(product_b - max(product_b)))


## R code 4.17b Contour plot ##################################
p1_b <- 
  d_grid2_b |> 
    ggplot2::ggplot(ggplot2::aes(x = mu_b, y = sigma_b, z = probability_b)) + 
    ggplot2::geom_contour() +
    ggplot2::labs(x = base::expression(mu),
                  y = base::expression(sigma)) +
    ggplot2::coord_cartesian(xlim = c(153.5, 155.7),
                         ylim = c(7, 8.5)) +
    ggplot2::theme_bw()

## R code 4.18b Heat map ##################################
p2_b <- 
  d_grid2_b |> 
    ggplot2::ggplot(ggplot2::aes(x = mu_b, y = sigma_b, fill = probability_b)) + 
    ggplot2::geom_raster(interpolate = TRUE) +
    ggplot2::scale_fill_viridis_c(option = "B") +
    ggplot2::labs(x = base::expression(mu),
         y = base::expression(sigma)) +
    ggplot2::coord_cartesian(xlim = c(153.5, 155.7),
                         ylim = c(7, 8.5)) +
    ggplot2::theme_bw()

library(patchwork)
p1_b + p2_b

Contour plot (left) and heat map (right)
  • The produced tibble contains data frames in its cells, so that we have to use the tidyr::unnest() function to expand the list-column containing data frames into rows and columns.
  • With ggplot2::coord_cartesian() I zoomed into the graph to concentrate on the most important x/y ranges.
  • The axes uses the symbols of \(\mu\) and \(\sigma\) provided by unevaluated expressions through base::expression().
Table 4.1: Function equivalence between Original and Tidyverse
PURPOSE ORIGINAL TIDYVERSE
All combinations base::expand.grid() tidyr::expand_grid()
Apply function to each element base::sapply() purrr::map2()
Contour plot rethinking::contour_xyz() ggplot2::geom_contour()
Heat map rethinking::image_xyz() ggplot2::geom_raster()

Creating data frame resp. tibble of all combinations

There are several related function for base::expand.grid() in {tidyverse}:

  • tidyr::expand_grid(): Create a tibble from all combination of inputs. This is the most similar function to base::expand.grid() as its input are vectors rather than a data frame. But it different in five aspects:
    1. Produces sorted output (by varying the first column the slowest, rather than the fastest).
    2. Returns a tibble, not a data frame.
    3. Never converts strings to factors.
    4. Does not add any additional attributes.
    5. Can expand any generalised vector, including data frames.
  • tidyr::expand(): Generates all combination of variables found in a dataset. It is paired with
    1. tidyr::crossing(): A wrapper around tidyr::expand_grid() the de-duplicates and sorts the inputs.
    2. tidyr::nesting(): Finds only combinations already present in the data.

4.3.4 Sampling from the posterior

“To study this posterior distribution in more detail, again I’ll push the flexible approach of sampling parameter values from it. This works just like it did in R Code 3.6, when you sampled values of \(p\) from the posterior distribution for the globe tossing example.” (McElreath, 2020, p. 85) (pdf)

Procedure 4.3 : Sampling from the posterior

Since there are two parameters, and we want to sample combinations of them:

  1. Randomly sample row numbers in post_a in proportion to the values in post_a$prob.
  2. Pull out the parameter values on those randomly sampled rows.

Example 4.7 : Sampling from the posterior

R Code 4.23 a: Samples from the posterior distribution for the heights data (Original)

Code
## R code 4.19a ###########################

# 1. Sample row numbers #########
# randomly sample row numbers in post_a 
# in proportion to the values in post_a$prob. 
sample.rows <- sample(1:nrow(post_a),
  size = 1e4, replace = TRUE,
  prob = post_a$prob
)

# 2. pull out parameter values ########
sample.mu_a <- post_a$mu[sample.rows]
sample.sigma_a <- post_a$sigma[sample.rows]

## R code 4.20a ###########################
plot(sample.mu_a, sample.sigma_a, 
     cex = 0.8, pch = 21, 
     col = rethinking::col.alpha(rethinking:::rangi2, 0.1)
     )

Graph 4.11: Samples from the posterior distribution for the heights data. (Original)

Adjust the plot to your tastes by playing around with cex (character expansion, the size of the points), pch (plot character), and the \(0.1\) transparency value.

The density of points is highest in the center, reflecting the most plausible combinations of \(\mu\) and \(\sigma\). There are many more ways for these parameter values to produce the data, conditional on the model.

R Code 4.24 b: Samples from the posterior distribution for the heights data (Tidyverse)

Code
set.seed(4)

d_grid_samples_b <- 
  d_grid2_b |> 
  dplyr::slice_sample(n = 1e4, 
               replace = TRUE, 
               weight_by = probability_b
               )

d_grid_samples_b |> 
  ggplot2::ggplot(ggplot2::aes(x = mu_b, y = sigma_b)) + 
  ggplot2::geom_point(size = 1.8, alpha = 1/15, color = "#8080FF") +
  ggplot2::scale_fill_viridis_c() +
  ggplot2::labs(x = expression(mu[samples]),
       y = expression(sigma[samples])) +
  ggplot2::theme_bw()

Graph 4.12: Samples from the posterior distribution for the heights data. (Tidyverse)

Kurz used the superseded dplyr::sample_n() to sample rows with replacement from d_grid2_b. I used instead the newer dplyr::slice_sample() that should used to sample rows.

The density of points is highest in the center, reflecting the most plausible combinations of μ and σ. There are many more ways for these parameter values to produce the data, conditional on the model.

4.3.4.1 Marginal posterior densities of μ and σ

The jargon marginal here means “averaging over the other parameters.”

We described the distribution of confidence in each combination of \(\mu\) and \(\sigma\) by summarizing the samples. Think of them like data and describe them, just like in Section 3.2. For example, to characterize the shapes of the marginal posterior densities of \(\mu\) and \(\sigma\), all we need to do is to call rethinking::dens() with the appropriate vector sample.mu_a resp. sample.sigma_a.

Example 4.8 : Marginal posterior densities of \(\mu\) and \(\sigma\) for the heights data

R Code 4.25 a: Marginal posterior densities of \(\mu\) and \(\sigma\) for the heights data (Original)

Code
## define plotting area as one row and two columns
par(mfrow = c(1, 2))

## R code 4.21a adapted #########################
rethinking::dens(sample.mu_a, adj = 1, show.HPDI = 0.89, 
                 norm.comp = TRUE, col = "red")
rethinking::dens(sample.sigma_a, adj = 1, show.HPDI = 0.89,
                 norm.comp = TRUE, col = "red")

Graph 4.13: Marginal posterior densities of μ and σ for the heights data (Original)

For a comparison I have overlaid the normal distribution and shown the .89% HPDI. Compare the grayed area with the calculation of the values in Example 4.9.

R Code 4.26 b: Marginal posterior densities of \(\mu\) and \(\sigma\) for the heights data (Tidyverse)

Code
d_grid_samples_b |> 
  tidyr::pivot_longer(mu_b:sigma_b) |> 

  ggplot2::ggplot(ggplot2::aes(x = value)) + 
  ggplot2::geom_density(color = "red") +
  # ggplot2::scale_y_continuous(NULL, breaks = NULL) +
  # ggplot2::xlab(NULL) +
  ggplot2::stat_function(
      fun = dnorm,
      args = with(d_grid_samples_b, c(
        mean = mean(mu_b),
        sd = sd(mu_b)))
      ) +
  ggplot2::stat_function(
    fun = dnorm,
    args = with(d_grid_samples_b, c(
      mean = mean(sigma_b),
      sd = sd(sigma_b)))
    ) +
  ggplot2::labs(x = "mu (left), sigma (right)",
                y = "Density") +
  ggplot2::theme_bw() +
  ggplot2::facet_wrap(~ name, scales = "free",
                      labeller = ggplot2::label_parsed)

Graph 4.14: Marginal posterior densities of μ and σ for the heights data (Tidyverse)

Kurz used tidyr::pivot_longer() and then ggplot2::facet_wrap() to plot the densities for both mu and sigma at once. For a comparison I have overlaid the normal distribution. But I do not know how to prevent the base line at density = 0. See Tidyverse 2 (Graph 4.15) where I have constructed the plots of both distribution separately.

R Code 4.27 b: Marginal posterior densities of \(\mu\) and \(\sigma\) for the heights data (Tidyverse)

Code
plot_mu_b <-
  d_grid_samples_b |> 
    ggplot2::ggplot(ggplot2::aes(x = mu_b)) + 
    ggplot2::geom_density(color = "red") +
    ggplot2::stat_function(
        fun = dnorm,
        args = with(d_grid_samples_b, c(
          mean = mean(mu_b), 
          sd = sd(mu_b)))
        ) +
    ggplot2::labs(x = expression(mu),
                  y = "Density") +
    ggplot2::theme_bw()

plot_sigma_b <- 
  d_grid_samples_b |> 
    ggplot2::ggplot(ggplot2::aes(x = sigma_b)) + 
    ggplot2::geom_density(color = "red") +
    ggplot2::stat_function(
        fun = dnorm,
        args = with(d_grid_samples_b, c(
          mean = mean(sigma_b), 
          sd = sd(sigma_b)))
        ) +
    ggplot2::labs(x = expression(sigma),
                  y = "Density") +
    ggplot2::theme_bw()

library(patchwork)
plot_mu_b + plot_sigma_b

Graph 4.15: Marginal posterior densities of μ and σ for the heights data (Tidyverse)

“These densities are very close to being normal distributions. And this is quite typical. As sample size increases, posterior densities approach the normal distribution. If you look closely, though, you’ll notice that the density for \(\sigma\) has a longer right-hand tail. I’ll exaggerate this tendency a bit later, to show you that this condition is very common for standard deviation parameters.” (McElreath, 2020, p. 86) (pdf)

4.3.4.2 Posterior compatibility intervals (PIs)

Since the drawn samples of Example 4.7 are just vectors of numbers, you can compute any statistic from them that you could from ordinary data: mean, median, or quantile, for example.

As examples we will compute PI and HPDI/HDCI.

We’ll use the {tidybayes} resp. {ggdist} package to compute their posterior modes of the 89% HDIs (and not the standard 95% intervals, as recommended by McElreath).

{tidybayes} has a companion package {ggdist}

There is a companion package {ggdist} which is imported completely by {tidybayes}. Whenever you cannot find the function in {tidybayes} then look at the documentation of {ggdist}. This is also the case for the tidybayes::mode_hdi() function. In the help files of {tidybayes} you will just find notes about a deprecated tidybayes::mode_hdih() function but not the arguments of its new version without the last h (for horizontal) tidybayes::mode_hdi(). But you can look up these details in the {ggdist} documentation. This observation is valid for many families of deprecated functions.

There is a division of functionality between {tidybayes} and {ggdist}:

  • {tidybayes}: Tidy Data and ‘Geoms’ for Bayesian Models: Compose data for and extract, manipulate, and visualize posterior draws from Bayesian models in a tidy data format. Functions are provided to help extract tidy data frames of draws from Bayesian models and that generate point summaries and intervals in a tidy format.
  • {ggdist}: Visualizations of Distributions and Uncertainty: Provides primitives for visualizing distributions using {ggplot2} that are particularly tuned for visualizing uncertainty in either a frequentist or Bayesian mode. Both analytical distributions (such as frequentist confidence distributions or Bayesian priors) and distributions represented as samples (such as bootstrap distributions or Bayesian posterior samples) are easily visualized.

Example 4.9 : Summarize the widths with posterior compatibility intervals

R Code 4.28 a: Posterior compatibility interval (PI) (Original)

Code
## R code 4.22a ####################
rethinking::PI(sample.mu_a)
rethinking::PI(sample.sigma_a)
#>       5%      94% 
#> 153.9394 155.2525 
#>       5%      94% 
#> 7.323232 8.252525

The first two lines refers to the heights \(\mu\) samples, the other lines to the heights \(\sigma\) samples.

R Code 4.29 a: Highest Posterior Density Interval (HPDI) (Original)

Code
## R code 4.22a ####################
rethinking::HPDI(sample.mu_a)
rethinking::HPDI(sample.sigma_a)
#>    |0.89    0.89| 
#> 153.8384 155.1515 
#>    |0.89    0.89| 
#> 7.303030 8.212121

The first two lines refers to the heights \(\mu\) samples, the other lines to the heights \(\sigma\) samples.

R Code 4.30 b: Posterior compatibility interval (PI) (Tidyverse)

Code
d_grid_samples_b |> 
  tidyr::pivot_longer(mu_b:sigma_b) |> 
  dplyr::group_by(name) |> 
  ggdist::mode_hdi(value, .width = 0.89) 
#> # A tibble: 2 × 7
#>   name     value .lower .upper .width .point .interval
#>   <chr>    <dbl>  <dbl>  <dbl>  <dbl> <chr>  <chr>    
#> 1 mu_b    155.   154.   155.     0.89 mode   hdi      
#> 2 sigma_b   7.75   7.30   8.23   0.89 mode   hdi

R Code 4.31 b: Highest Density Continuous Interval (HDCI) (Tidyverse)

Code
d_grid_samples_b |> 
  tidyr::pivot_longer(mu_b:sigma_b) |> 
  dplyr::group_by(name) |> 
  ggdist::mode_hdci(value, .width = 0.89) 
#> # A tibble: 2 × 7
#>   name     value .lower .upper .width .point .interval
#>   <chr>    <dbl>  <dbl>  <dbl>  <dbl> <chr>  <chr>    
#> 1 mu_b    155.   154.   155.     0.89 mode   hdci     
#> 2 sigma_b   7.75   7.30   8.23   0.89 mode   hdci

In {tidybayes} resp. {ggdist} the shortest probability interval (= Highest Posterior Density Interval (HPDI)) is called Highest Density Continuous Interval (HDCI).

{rethinking} versus {tidybayes/ggdist}

There are small differences in the results of both packages (rethinking) and {ggdist/tidybayes} that are not important.

“Before moving on to using quadratic approximation (quap) as shortcut to all of this inference, it is worth repeating the analysis of the height data above, but now with only a fraction of the original data. The reason to do this is to demonstrate that, in principle, the posterior is not always so Gaussian in shape. There’s no trouble with the mean, μ. For a Gaussian likelihood and a Gaussian prior on μ, the posterior distribution is always Gaussian as well, regardless of sample size. It is the standard deviation σ that causes problems. So if you care about σ—often people do not—you do need to be careful of abusing the quadratic approximation.” (McElreath, 2020, p. 86) (pdf)

Example 4.10 : Sample size and the normality of \(\sigma\)’s posterior

R Code 4.32 a: Sample size and the normality of \(\sigma\)’s posterior (Original)

Code
## R code 4.23a ######################################
d3_a <- sample(d2_a$height, size = 20)

## R code 4.24a ######################################
mu2_a.list <- seq(from = 150, to = 170, length.out = 200)
sigma2_a.list <- seq(from = 4, to = 20, length.out = 200)
post2_a <- expand.grid(mu = mu2_a.list, sigma = sigma2_a.list)
post2_a$LL <- sapply(1:nrow(post2_a), function(i) {
  sum(dnorm(d3_a,
    mean = post2_a$mu[i], sd = post2_a$sigma[i],
    log = TRUE
  ))
})
post2_a$prod <- post2_a$LL + dnorm(post2_a$mu, 178, 20, TRUE) +
  dunif(post2_a$sigma, 0, 50, TRUE)
post2_a$prob <- exp(post2_a$prod - max(post2_a$prod))
sample2_a.rows <- sample(1:nrow(post2_a),
  size = 1e4, replace = TRUE,
  prob = post2_a$prob
)
sample2_a.mu <- post2_a$mu[sample2_a.rows]
sample2_a.sigma <- post2_a$sigma[sample2_a.rows]

## define plotting area as one row and two columns
par(mfrow = c(1, 2))
plot(sample2_a.mu, sample2_a.sigma,
  cex = 0.5,
  col = rethinking::col.alpha(rethinking:::rangi2, 0.1),
  xlab = "mu", ylab = "sigma", pch = 16
)

## R code 4.25a ############
rethinking::dens(sample2_a.sigma, 
                 norm.comp = TRUE,
                 col = "red")

Graph 4.16: Sample 20 heights to see that sigma is not Gaussian anymore (Original)

R Code 4.33 b: Sample size and the normality of \(\sigma\)’s posterior (Tidyverse)

Code
set.seed(4)
d3_b <- sample(d2_b$height, size = 20)

n <- 200

# note we've redefined the ranges of `mu` and `sigma`
d3_grid_b <-
  tidyr::crossing(mu3_b    = seq(from = 150, to = 170, length.out = n),
           sigma3_b = seq(from = 4, to = 20, length.out = n))

grid_function3_b <- function(mu, sigma) {
  dnorm(d3_b, mean = mu, sd = sigma, log = T) |> 
    sum()
}

d3_grid_b <-
  d3_grid_b |>  
  dplyr::mutate(log_likelihood3_b = 
         purrr::map2_dbl(mu3_b, sigma3_b, grid_function3_b))  |>  
  dplyr::mutate(prior3_mu_b    = stats::dnorm(mu3_b, mean = 178, sd = 20, log = T),
         prior3_sigma_b = stats::dunif(sigma3_b, min = 0, max = 50, log = T)) |> 
  dplyr::mutate(product3_b = log_likelihood3_b + prior3_mu_b + prior3_sigma_b) |> 
  dplyr::mutate(probability3_b = base::exp(product3_b - base::max(product3_b)))

set.seed(4)

d3_grid_samples_b <- 
  d3_grid_b |> 
  dplyr::slice_sample(n = 1e4, 
           replace = T, 
           weight_by = probability3_b)

plot3_d3_scatterplot <- 
  d3_grid_samples_b |> 
    ggplot2::ggplot(ggplot2::aes(x = mu3_b, y = sigma3_b)) + 
    ggplot2::geom_point(size = 1.8, alpha = 1/15, color = "#8080FF") +
    ggplot2::labs(x = base::expression(mu[samples]),
         y = base::expression(sigma[samples])) +
    ggplot2::theme_bw()

plot3_d3_sigma3_b <- 
  d3_grid_samples_b |> 
    ggplot2::ggplot(ggplot2::aes(x = sigma3_b)) + 
    ggplot2::geom_density(color = "red") +
    ggplot2::stat_function(
        fun = dnorm,
        args = with(d3_grid_samples_b, c(
          mean = mean(sigma3_b), 
          sd = sd(sigma3_b)))
        ) +
    ggplot2::labs(x = expression(sigma),
                  y = "Density") +
    ggplot2::theme_bw()

library(patchwork)
plot3_d3_scatterplot + plot3_d3_sigma3_b

Graph 4.17: ?(caption)

Compare the left panel with Graph 4.12 and the right panel with the right panel of Graph 4.15 to see that now \(\sigma\) has a long right tail and does not follow a Gaussian distribution.

“The deep reasons for the posterior of σ tending to have a long right-hand tail are complex. But a useful way to conceive of the problem is that variances must be positive. As a result, there must be more uncertainty about how big the variance (or standard deviation) is than about how small it is.” (McElreath, 2020, p. 86) (pdf)

“For example, if the variance is estimated to be near zero, then you know for sure that it can’t be much smaller. But it could be a lot bigger.” (McElreath, 2020, p. 87) (pdf)

4.3.5 Finding the posterior distribution with quadratic approximation

“To build the quadratic approximation, we’ll use quap(), a command in the {rethinking} package. The quap() function works by using the model definition you were introduced to earlier in this chapter. Each line in the definition has a corresponding definition in the form of R code. The engine inside quap then uses these definitions to define the posterior probability at each combination of parameter values. Then it can climb the posterior distribution and find the peak, its MAP (Maximum A Posteriori estimate). Finally, it estimates the quadratic curvature at the MAP to produce an approximation of the posterior distribution. Remember: This procedure is very similar to what many non-Bayesian procedures do, just without any priors.” (McElreath, 2020, p. 87, parenthesis and emphasis are mine) (pdf)

Procedure 4.4 : Finding the posterior distribution

  1. We start with the Howell1 data frame for adults (age >= 18) (Code 4.26).
  2. We place the R code equivalents into base::alist() We are going to use the Equation 4.3 (Code 4.27).
  3. We can add some additional options like start values (Code 4.30)
  4. We fit the model to the data of our data frame and store the fitted model (Code 4.28).
  5. Now we can have a look at the posterior distribution (Code 4.29).

Example 4.11 : Finding the posterior distribution

R Code 4.34 a: Finding the posterior distribution with rethinking::quap() (Original)

Code
## R code 4.26a ######################
data(package = "rethinking", list = "Howell1")
d_a <- Howell1
d2_a <- d_a[d_a$age >= 18, ]

## R code 4.27a ######################
flist <- alist(
  height ~ dnorm(mu, sigma),
  mu ~ dnorm(178, 20),
  sigma ~ dunif(0, 50)
)

## R code 4.30a #####################
start <- list(
  mu = mean(d2_a$height),
  sigma = sd(d2_a$height)
)

## R code 4.28a ######################
m4.1a <- rethinking::quap(flist, data = d2_a, start = start)
m4.1a
#> 
#> Quadratic approximate posterior distribution
#> 
#> Formula:
#> height ~ dnorm(mu, sigma)
#> mu ~ dnorm(178, 20)
#> sigma ~ dunif(0, 50)
#> 
#> Posterior means:
#>         mu      sigma 
#> 154.607024   7.731333 
#> 
#> Log-likelihood: -1219.41

The rethinking::quap() function returns a “map” object.

Sometimes I got an error message when computing this code chunk. The reason was that quap() has chosen an inconvenient value to start for its estimation of the posterior. I believe that one could visualize the problem with a metaphor: Instead of climbing up the hill quap() started with a value where it was captured in a narrow valley.

The three parts of rethinking::quap()

  1. A formula or base::alist() of formulas that define the likelihood and priors.
  2. A data frame or list containing the data.
  3. Some options like start values of method for search optimization. (Not used here in this example). Note that the list of start values is a regular list(), not an alist() like the formula list is.

How to use formulae in statistical models in R

To learn more about using formulae read “Statistical Models in R” online or chapter 11 in the PDF.

R Code 4.35 a: Printing with rethinking::precis (Original)

Code
## R code 4.29a ######################
(precis_m4.1a <- rethinking::precis(m4.1a))
#>             mean        sd       5.5%      94.5%
#> mu    154.607024 0.4119947 153.948576 155.265471
#> sigma   7.731333 0.2913860   7.265642   8.197024

“These numbers provide Gaussian approximations for each parameter’s marginal distribution. This means the plausibility of each value of μ, after averaging over the plausibilities of each value of σ, is given by a Gaussian distribution with mean 154.6 and standard deviation 0.4. The 5.5% and 94.5% quantiles are percentile interval boundaries, corresponding to an 89% compatibility interval. Why 89%? It’s just the default. It displays a quite wide interval, so it shows a high-probability range of parameter values. If you want another interval, such as the conventional and mindless 95%, you can use precis(m4.1,prob=0.95). But I don’t recommend 95% intervals, because readers will have a hard time not viewing them as significance tests. 89 is also a prime number, so if someone asks you to justify it, you can stare at them meaningfully and incant,”Because it is prime.” That’s no worse justification than the conventional justification for 95%.” (McElreath, 2020, p. 88) (pdf)

When you compare the 89% boundaries with the result of the grid approximation in Example 4.9 you will see that they are almost identical as the posterior is approximately Gaussian.

R Code 4.36 b: Finding the posterior distribution with brms::brm()

Code
## R code 4.26b ######################
data(package = "rethinking", list = "Howell1")
d_b <- Howell1
d2_b <- 
  d_b |> 
  dplyr::filter(age >= 18) 


## R code 4.27b ######################
## R code 4.28b ######################
m4.1b <- 
  brms::brm(
      formula = height ~ 1,                                           # 1
      data = d2_b,                                                    # 2
      family = gaussian(),                                            # 3
      prior = c(brms::prior(normal(178, 20), class = Intercept),      # 4
                brms::prior(uniform(0, 50), class = sigma, ub = 50)), # 4
      iter = 2000,               # 5
      warmup = 1000,             # 6
      chains = 4,                # 7
      cores = 4,                 # 8 
      seed = 4,                  # 9
      file = "brm_fits/m04.01b") # 10
m4.1b
#>  Family: gaussian 
#>   Links: mu = identity; sigma = identity 
#> Formula: height ~ 1 
#>    Data: d2_b (Number of observations: 352) 
#>   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#>          total post-warmup draws = 4000
#> 
#> Population-Level Effects: 
#>           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept   154.60      0.41   153.81   155.42 1.00     2763     2635
#> 
#> Family Specific Parameters: 
#>       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sigma     7.77      0.29     7.21     8.39 1.00     3400     2561
#> 
#> Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).

The brms::brm() function returns a “brmsfit” object.

If you don’t want to specify the result more in detail (for instance to change the PI) than you get the same result with brms:::print.brmsfit(m4.1b) and brms:::summary.brmsfit(m4.1b).

Description of the convergence diagnostics Rhat, Bulk_ESS, and Tail_ESS

  • The convergence diagnostics Rhat, Bulk_ESS, and Tail_ESS are described in detail in (Vehtari et al. 2021).
  • The code for the paper is available on Github (https://github.com/avehtari/rhat_ess).
  • Examples and even a larger variety of numerical experiments are available in the online appendix at https://avehtari.github.io/rhat_ess/rhat_ess.html.

brms::brm() has more than 40 arguments but only three (formula, data and prior) are for our case mandatory. All the other have sensible default values. The correspondence of these three arguments to the {rethinking} version is obvious.

In the simple demonstration of brms::brm() in the toy globe example (R Code 2.25), I have just used Kurz’ code lines without any explanation. This time I will explain all 10 arguments from Kurz’ example.


  1. formula describes the relation between dependent and independent variables in the form of a linear model. The left hand side are the dependent variables, the right hand side the independent. The independent variables are used to calculate the trend component of the linear model, the residuals are then assumed to have some kind of distribution. When the independent are equal to one ~ 1, the trend component is a single value, e.g. the mean value of the data, i.e. the linear model only has an intercept. (StackOverflow) In other words, it is the value the dependent variable is expected to have when the independent variables are zero or have no influence. (StackOverflow). The formula y ~ 1 is just a model with a constant (intercept) and no regressor (StackOverflow). Or more understandable for our case are Kurz’ explication: “… the intercept of a typical regression model with no predictors is the same as its mean. In the special case of a model using the binomial likelihood, the mean is the probability of a 1 in a given trial, \(\theta\).” (Kurz) .
  2. data: A data frame that contains all the variables used in the model.
  3. family: A description of the response distribution and link function to be used in the model. This can be a family function, a call to a family function or a character string naming the family. By default a linear gaussian model is applied. So this line would not have been necessary. There are standard family functions stats::family()that will work with {brms}, but there are also special family functions brms::brmsfamily() that work only for {brms} models. Additionally you can specify custom families for use in brms with the brms::custom_family() function.
  4. prior: The next two lines specify priors for the normal and the uniform distribution. As you can see this is another place where parts for the formula are provided for the brms::brm() function work. — class specifies the parameter class. It defaults to “b” ((i.e. population-level – ‘fixed’ – effects)). (There is also the argument group for grouping of factors for group-level effects. Not used in this code example.) — Besides the “b” class there are other classes for the “Intercept” and the standard deviation “Sigma” on the population level: There is also a “sd” class for the standard deviation of group-level effects. Finally there is the special case of class = "cor" to set the same prior on every correlation matrix. — ub = 50 sets the upper bound to 50. There is also a lb (lower bound). Both bounds are for parameter restriction, so that population-level effects must fall within a certain interval using the lb and ub arguments. lb and ub default to NULL, i.e. there is no restriction.
  5. iter: Number of total iterations per chain (including warmup; defaults to 2000).
  6. warmup: A positive integer specifying number of warmup iterations. This also specifies the number of iterations used for stepsize adaptation, so warmup draws should not be used for inference. The number of warmup should not be larger than iter and the default is \(iter/2\).
  7. chains: Number of Markov chains (defaults to 4).
  8. cores: Number of cores to use when executing the chains in parallel, which defaults to 1 but we recommend setting the mc.cores option to be as many processors as the hardware and RAM allow (up to the number of chains).
  9. seed: The seed for random number generation to make results reproducible. Kurz has always used for set.seed() in other code chunks the chapter number. If NA (the default), Stan will set the seed randomly.
  10. file: Either NULL or a character string. In the latter case, the fitted model object is saved via base::saveRDS() in a file named after the string supplied in file. The .rds extension is added automatically. If the file already exists, brms::brm() will load and return the saved model object instead of refitting the model. Unless you specify the file_refit argument as well, the existing files won’t be overwritten, you have to manually remove the file in order to refit and save the model under an existing file name. The file name is stored in the brmsfit object for later usage.
  • init: Not used here: Within the brm() function, you use the init argument for the start values for the sampler. “If NULL (the default) or”random”, Stan will randomly generate initial values for parameters in a reasonable range. If 0, all parameters are initialized to zero on the unconstrained space. This option is sometimes useful for certain families, as it happens that default random initial values cause draws to be essentially constant. Generally, setting init = 0 is worth a try, if chains do not initialize or behave well. Alternatively, init can be a list of lists containing the initial values …” (Help file)

R Code 4.37 b: Print the specified results of the brmsfit object

Code
## R code 4.29b print ######################
brms:::print.brmsfit(m4.1b, prob = .89)

## brms:::summary.brmsfit(m4.1b, prob = .89) # (same result)
#>  Family: gaussian 
#>   Links: mu = identity; sigma = identity 
#> Formula: height ~ 1 
#>    Data: d2_b (Number of observations: 352) 
#>   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#>          total post-warmup draws = 4000
#> 
#> Population-Level Effects: 
#>           Estimate Est.Error l-89% CI u-89% CI Rhat Bulk_ESS Tail_ESS
#> Intercept   154.60      0.41   153.94   155.25 1.00     2763     2635
#> 
#> Family Specific Parameters: 
#>       Estimate Est.Error l-89% CI u-89% CI Rhat Bulk_ESS Tail_ESS
#> sigma     7.77      0.29     7.33     8.26 1.00     3400     2561
#> 
#> Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).

Description of the convergence diagnostics Rhat, Bulk_ESS, and Tail_ESS

  • The convergence diagnostics Rhat, Bulk_ESS, and Tail_ESS are described in detail in (Vehtari et al. 2021).
  • The code for the paper is available on Github (https://github.com/avehtari/rhat_ess).
  • Examples and even a larger variety of numerical experiments are available in the online appendix at https://avehtari.github.io/rhat_ess/rhat_ess.html.

brms:::summary.brmsfit() results in the same output as brms:::print.brmsfit(). Both return a brmssummary object. But there are some internal differences:

There is also a ⁠\(print()⁠ method that prints the same summary stats but removes the extra formatting used for printing tibbles and returns the fitted model object itself. The ⁠\)print()⁠ method may also be faster than ⁠\(summary()⁠ because it is designed to only compute the summary statistics for the variables that will actually fit in the printed output whereas ⁠\)summary()⁠ will compute them for all of the specified variables in order to be able to return them to the user.

Using print.brmsfit() or summary.brmsfit() defaults to 95% intervals. As {rethinking} defaults to 89% intervals, I have changed the prob parameter of the print method also to 89%.

WATCH OUT! Using three colons to address generic functions of S3 classes

As I have learned shortly: print() or summary() are generic functions where one can add new printing methods with new classes. In this case class(m4.1b) = brmsfit. This means I do not need to add brms:: to secure that I will get the {brms} printing or summary method as I didn’t load the {brms} package. Quite the contrary: Adding brms:: would result into the message: “Error: ‘summary’ is not an exported object from ‘namespace:brms’”.

As I really want to specify explicitly the method these generic functions should use, I need to use the syntax with three colons, like brms:::print.brmsfit() or brms:::summary.brmsfit() respectively.

Learning more about S3 classes in R

In this respect I have to learn more about S3 classes. There are many important web resources about this subject that I have found with the search string “r what is s3 class”. Maybe I should start with the S3 chapter in Advanced R.

For the interpretation of this output I am going to use the explication in the How to use brms section of the {brms} GitHup page.

  1. Top: On the top of the output, some general information on the model is given, such as family, formula, number of iterations and chains.
  2. Upper Middle: If the data were grouped the next part would display group-level effects separately for each grouping factor in terms of standard deviations and (in case of more than one group-level effect per grouping factor) correlations between group-level effects. (This part is absent above as there are no grouping factors.)
  3. Lower Middle: here Middle: Next follow the display of the population-level effects (i.e. regression coefficients). If incorporated, autocorrelation effects and family specific parameters (e.g., the residual standard deviation ‘sigma’ in normal models) are also given. In general, every parameter is summarized using the mean (Estimate) and the standard deviation (Est.Error) of the posterior distribution as well as two-sided 95% credible intervals (l-95% CI and u-95% CI) based on quantiles. The last three values (ESS_bulk, ESS_tail, and Rhat) provide information on how well the algorithm could estimate the posterior distribution of this parameter. If Rhat is considerably greater than 1, the algorithm has not yet converged and it is necessary to run more iterations and / or set stronger priors.
  4. Bottom: The last part is some short explanation of the sampling procedure. NUTS stands for No U-Turn Sampler and is a Hamiltonian Monte Carlo (HMC Method. This means that it is not a Markov Chain method and thus, this algorithm avoids the random walk part, which is often deemed as inefficient and slow to converge. Instead of doing the random walk, NUTS does jumps of length x. Each jump doubles as the algorithm continues to run. This happens until the trajectory reaches a point where it wants to return to the starting point.

R Code 4.38 b: Print a Stan like summary

Code
## R code 4.29b stan like summary ######################
m4.1b$fit
#> Inference for Stan model: anon_model.
#> 4 chains, each with iter=2000; warmup=1000; thin=1; 
#> post-warmup draws per chain=1000, total post-warmup draws=4000.
#> 
#>                 mean se_mean   sd     2.5%      25%      50%      75%    97.5%
#> b_Intercept   154.60    0.01 0.41   153.81   154.31   154.59   154.88   155.42
#> sigma           7.77    0.01 0.29     7.21     7.57     7.76     7.97     8.39
#> lprior         -8.51    0.00 0.02    -8.56    -8.53    -8.51    -8.50    -8.46
#> lp__        -1227.04    0.02 0.97 -1229.63 -1227.44 -1226.75 -1226.33 -1226.06
#>             n_eff Rhat
#> b_Intercept  2778    1
#> sigma        3404    1
#> lprior       2773    1
#> lp__         1798    1
#> 
#> Samples were drawn using NUTS(diag_e) at Mon Dec  4 22:27:38 2023.
#> For each parameter, n_eff is a crude measure of effective sample size,
#> and Rhat is the potential scale reduction factor on split chains (at 
#> convergence, Rhat=1).

Kurz refers to RStan: the R interface to Stan for a detailed description. I didn’t study the extensive documentation but I found out two items:

  • lp__ is the logarithm of the (unnormalized) posterior density as calculated by Stan. This log density can be used in various ways for model evaluation and comparison.
  • Rhat estimates the degree of convergence of a random Markov Chain based on the stability of outcomes between and within chains of the same length. Values close to one indicate convergence to the underlying distribution. Values greater than 1.1 indicate inadequate convergence.

R Code 4.39 b: Plot the visual chain diagnostics of the brmsfit object

Code
## R code 4.29b plot ######################
brms:::plot.brmsfit(m4.1b)

Graph 4.18: Plot model m4.1b

After running a model fit with HMC, it’s a good idea to inspect the chains. As we’ll see, McElreath covered visual chain diagnostics in ?sec-chap09. … If you want detailed diagnostics for the HMC chains, call launch_shinystan(m4.1b). (Kurz)

Just a quick preview: It is important over the whole length of the samples (not counting the warmups, e.g., in our case: 1000 iteration) the width is constant and that the graph show all (four) chains at the top resp. bottom. Unfortunately is looking at the (small) graph not always conclusive. (See also tab “trace plot” in Example 4.19.)

Learn more about {shinystan}

I haven’t applied launch_shinystan(m4.1b) as it takes much time and I do not (yet) understand the detailed report anyway. To learn to work with {shinystan} see the ShinyStan website and the vignettes of the R package vignettes (Deploying to shinyapps.io, Getting Started) and documentation.

Package documentation of Stan and friends

Ooops, this opens up Pandora’s box!

I do not even understand completely what the different packages do. What follows is a first try where I copied from the documentation pages:

{rstanarm} is an R package that emulates other R model-fitting functions but uses Stan (via the {rstan} package) for the back-end estimation. The primary target audience is people who would be open to Bayesian inference if using Bayesian software were easier but would use frequentist software otherwise.

RStan is the R interface to Stan. It is distributed on CRAN as the {rstan} package and its source code is hosted on GitHub.

Stan is a state-of-the-art platform for statistical modeling and high-performance statistical computation.

The next example shows the effect a very narrow \(\mu\) prior has. Instead of a sigma of 20 we provide only a standard deviation of 0.1.

Example 4.12 : The same model but with a more informative, e.g., very narrow \(\mu\) prior

R Code 4.40 a: Model with a very narrow \(\mu\) prior (Original)

Code
## R code 4.27a ######################
flist <- alist(
  height ~ dnorm(mu, sigma),
  mu ~ dnorm(178, .1),
  sigma ~ dunif(0, 50)
)

## R code 4.30a #####################
start <- list(
  mu = mean(d2_a$height),
  sigma = sd(d2_a$height)
)

## R code 4.28a ######################
m4.2a <- rethinking::quap(flist, data = d2_a, start = start)
rethinking::precis(m4.2a)
#>            mean        sd      5.5%    94.5%
#> mu    177.86382 0.1002354 177.70363 178.0240
#> sigma  24.51934 0.9290875  23.03448  26.0042

“Notice that the estimate for μ has hardly moved off the prior. The prior was very concentrated around 178. So this is not surprising. But also notice that the estimate for σ has changed quite a lot, even though we didn’t change its prior at all. Once the golem is certain that the mean is near 178—as the prior insists—then the golem has to estimate σ conditional on that fact. This results in a different posterior for σ, even though all we changed is prior information about the other parameter.” (McElreath, 2020, p. 89) (pdf)

R Code 4.41 b: Model with a very narrow \(\mu\) prior (brms)

Code
m4.2b <- 
  brms::brm(
      formula = height ~ 1,                               
      data = d2_b,                                                   
      family = gaussian(),      
      prior = c(prior(normal(178, 0.1), class = Intercept),
                prior(uniform(0, 50), class = sigma, ub = 50)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 4,
      file = "brm_fits/m04.02b")

base::rbind(
      brms:::summary.brmsfit(m4.1b, prob = .89)$fixed,
      brms:::summary.brmsfit(m4.2b, prob = .89)$fixed
      )
#>            Estimate Est.Error l-89% CI u-89% CI     Rhat Bulk_ESS Tail_ESS
#> Intercept  154.5959 0.4144632 153.9365 155.2540 1.000708 2762.982 2635.159
#> Intercept1 177.8647 0.1052528 177.6973 178.0278 1.001224 2720.796 2586.491

Subsetting the summary.brmsfit() output of the brmssummary object with $fixed provides a convenient way to compare the Intercept summaries between m4.1b and m4.2b.

4.3.6 Sampling

4.3.6.1 Sampling from rethinking::quap()

How do we get samples from the quadratic approximate posterior distribution?

“… a quadratic approximation to a posterior distribution with more than one parameter dimension—μ and σ each contribute one dimension—is just a multi-dimensional Gaussian distribution.” (McElreath, 2020, p. 90) (pdf)

“As a consequence, when R constructs a quadratic approximation, it calculates not only standard deviations for all parameters, but also the covariances among all pairs of parameters. Just like a mean and standard deviation (or its square, a variance) are sufficient to describe a one-dimensional Gaussian distribution, a list of means and a matrix of variances and covariances are sufficient to describe a multi-dimensional Gaussian distribution.” (McElreath, 2020, p. 90) (pdf)

WATCH OUT! Rethinking and Tidyverse (brms) Codes in separate examples

As there are quite big differences in the calculation of the variance-covariance matrix, I will explain the appropriate steps in different examples.

Example 4.13 a: Sampling from a quap() (Original)

R Code 4.42 a: Variance-covariance matrix (Original)

Code
## R code 4.32a vcov #########
rethinking::vcov(m4.1a)
#>                 mu        sigma
#> mu    0.1697396109 0.0002180307
#> sigma 0.0002180307 0.0849058224

“The above is a variance-covariance matrix. It is the multi-dimensional glue of a quadratic approximation, because it tells us how each parameter relates to every other parameter in the posterior distribution. A variance-covariance matrix can be factored into two elements: (1) a vector of variances for the parameters and (2) a correlation matrix that tells us how changes in any parameter lead to correlated changes in the others. This decomposition is usually easier to understand.” (McElreath, 2020, p. 90) (pdf)

vcov() returns the variance-covariance matrix of the main parameters of a fitted model object. In the above {rethinking} version is uses the class map2stan for a fitted Stan model as m4.1a is of class map.

WATCH OUT! Two different vcov() functions

I am explicitly using the package {rethinking} for the vcov() function. The same function is also available as a base R function with stats::vcov(). But this generates an error because there is no method known for an object of class map from the {rethinking} package. The help file for stats::vcov() only says that the vcov object is an S3 method for classes lm, glm, mlm and aov but not for map.

Error in UseMethod(“vcov”) : no applicable method for ‘vcov’ applied to an object of class “map”

I could have used only vcov(). But this only works when the {rethinking} package is already loaded. In that case R knows because of the class of the object which vcov() version to use. In this case: class of object = class(m4.1a) map.

R Code 4.43 a: List of variances (Original)

Code
## R code 4.33a var adapted ##########
(var_list <- base::diag(rethinking::vcov(m4.1a)))
#>         mu      sigma 
#> 0.16973961 0.08490582

“The two-element vector in the output is the list of variances. If you take the square root of this vector, you get the standard deviations that are shown in rethinking::precis() (Example 4.11) output.” (McElreath, 2020, p. 90) (pdf)

Let’s check this out:

Convert list of variances to standard deviations and compare with the precis result
Result / Parameter mu sd, sigma sd
sqrt(base::unname(var_list)) 0.4119947, 0.291386
precis_m4.1a[["sd"]] 0.4119947, 0.291386

R Code 4.44 a: Correlation matrix (Original)

Code
## R code 4.33a cor #############
stats::cov2cor(rethinking::vcov(m4.1a))
#>                mu       sigma
#> mu    1.000000000 0.001816174
#> sigma 0.001816174 1.000000000

“The two-by-two matrix in the output is the correlation matrix. Each entry shows the correlation, bounded between −1 and +1, for each pair of parameters. The 1’s indicate a parameter’s correlation with itself. If these values were anything except 1, we would be worried. The other entries are typically closer to zero, and they are very close to zero in this example. This indicates that learning μ tells us nothing about σ and likewise that learning σ tells us nothing about μ. This is typical of simple Gaussian models of this kind. But it is quite rare more generally, as you’ll see in later chapters.” (McElreath, 2020, p. 90) (pdf)

a: Compute covariance matrix from the correlation using base::sweep() (Original)

Code
R <- stats::cov2cor(rethinking::vcov(m4.1a))
S <- sqrt(base::diag(rethinking::vcov(m4.1a)))

sweep(sweep(R, 1, S, "*"), 2, S, "*")
#>                 mu        sigma
#> mu    0.1697396109 0.0002180307
#> sigma 0.0002180307 0.0849058224

I wonder how to compute the correlation matrix by hand form the covariance-variance matrix. I thought that I have to use sqrt(), but it didn’t work. After I inspected the code of the cov2cor() function I noticed that it uses the expression sqrt(1/diag(V)).

From the stats::cor() help file:

Scaling a covariance matrix into a correlation one can be achieved in many ways, mathematically most appealing by multiplication with a diagonal matrix from left and right, or more efficiently by using base::sweep(.., FUN = "/") twice. The stats::cov2cor() function is even a bit more efficient, and provided mostly for didactical reasons.

For computing the covariance matrix with base::sweep() see the answer in StackOverflow.

R Code 4.45 a: Samples from the multi-dimensional posterior (Original)

Code
## R code 4.34a ########################
set.seed(4)
post3_a <- rethinking::extract.samples(m4.1a, n = 1e4)

bayr::as_tbl_obs(post3_a)
Data set with 3 variables, showing 8 of 10000 observations.
Obs mu sigma
1349 153.8802 7.432261
2281 154.8268 7.526618
3240 154.1099 7.540981
3418 154.1668 7.634100
4467 155.0349 7.321992
6128 154.0472 7.421992
7317 154.7729 8.097497
7892 154.1290 7.631920

“You end up with a data frame, post, with 10,000 (1e4) rows and two columns, one column for μ and one for σ. Each value is a sample from the posterior, so the mean and standard deviation of each column will be very close to the MAP values from before.” (McElreath, 2020, p. 91) (pdf)

R Code 4.46 a: Summary from the samples of the multi-dimensional posterior (Original)

Code
## R code 4.35a precis ##################
rethinking::precis(post3_a)
#>            mean        sd       5.5%      94.5%     histogram
#> mu    154.61196 0.4106276 153.960323 155.272265       ▁▁▅▇▂▁▁
#> sigma   7.73298 0.2936393   7.267091   8.199897 ▁▁▁▁▂▅▇▇▃▁▁▁▁

Compare these values to the output from the summaries with rethinking::precis() in Example 4.11.

R Code 4.47 a: Plot the samples distribution from the multi-dimensional posterior (Original)

Code
## plot sample posterior a  ##################
plot(post3_a, col = rethinking::col.alpha(rethinking:::rangi2, 0.1))

Graph 4.19: Samples distribution from the multi-dimensional posterior

“see how much they resemble the samples from the grid approximation in Graph 4.11. These samples also preserve the covariance between \(\mu\) and \(\sigma\). This hardly matters right now, because \(\mu\) and \(\sigma\) don’t covary at all in this model. But once you add a predictor variable to your model, covariance will matter a lot.” (McElreath, 2020, p. 91) (pdf)

R Code 4.48 a: Extract samples from the vectors of values from a multi-dimensional Gaussian distribution with MASS::mvrnorm() and plot the result (Original)

Code
## R code 4.36a ######################
post4_a <- MASS::mvrnorm(n = 1e4, mu = rethinking::coef(m4.1a), 
                      Sigma = rethinking::vcov(m4.1a))

plot(post4_a, col = rethinking::col.alpha(rethinking:::rangi2, 0.1))

Graph 4.20: Samples distribution from a multi-dimensional Gaussian distribution with MASS::mvrnorm()

The function rethinking::extract.samples() in the “Sample1” tab is for convenience. It is just running a simple simulation of the sort you conducted near the end of Chapter 3 with R Code 3.52.

Under the hood the work of rethinking::extract.samples() is done by a multi-dimensional version of stats::rnorm(), MASS::mvrnorm(). The function stats::rnorm() simulates random Gaussian values, while MASS::mvrnorm() simulates random vectors of multivariate Gaussian values.

How to interpret covariances?

A large covariance can mean a strong relationship between variables. However, you can’t compare variances over data sets with different scales (like pounds and inches). A weak covariance in one data set may be a strong one in a different data set with different scales.

The main problem with interpretation is that the wide range of results that it takes on makes it hard to interpret. For example, your data set could return a value of 3, or 3,000. This wide range of values is caused by a simple fact: The larger the X and Y values, the larger the covariance. A value of 300 tells us that the variables are correlated, but unlike the correlation coefficient, that number doesn’t tell us exactly how strong that relationship is. The problem can be fixed by dividing the covariance by the standard deviation to get the correlation coefficient. (Statistics How To)

WATCH OUT! Confusing m4.1a with m4.3a

Frankly speaking I had troubles to understand why the correlation in m4.1a is almost 0. It turned out that I had unconsciously in mind a correlation between height and weight, an issue that is raised later in the chapter with m4.3a.

Although with m4.1bis a multi-dimensional Gaussian distribution in discussion but only with the height correlation of \(\mu\) and \(\sigma\). \(\mu\) of the height distribution does not help you in the estimation of \(\sigma\) in this distribution – and vice-versa.

4.3.6.2 Sampling from a brms::brm() fit

In contrast to the {rethinking} approach the {brms} doesn’t seem to have the same convenience functions and therefore we have to use different workarounds to get the same results. To get equivalent output it is the best strategy to put the Hamilton Monte Carlo (HMC) chains in a data frame and then apply the appropriate functions.

Example 4.14 b: Sampling with as_draw() functions from a brms::brm() fit

R Code 4.49 b: Variance-covariance matrix (Original)

Code
brms:::vcov.brmsfit(m4.1b)
#>           Intercept
#> Intercept 0.1717797

The vcov() function working with brmsfit objects only returns the first element in the matrix it did for {rethinking}. That is, it appears brms::vcov.brmsfit() only returns the variance/covariance matrix for the single-level _β_ parameters.

If we want the same information as with rethinking::vcov(), we have to put the HMC chains in a data frame with the brms::as_draws_df() function as shown in the next tab “draws”.

R Code 4.50 b: Extract the iteration of the Hamilton Monte Carlo (HMC) chains into a data frame (Tidyverse)

Code
set.seed(4)
post_b <- brms::as_draws_df(m4.1b)

bayr::as_tbl_obs(post_b)
Data set with 8 variables, showing 8 of 4000 observations.
Obs b_Intercept sigma lprior lp__ .chain .iteration .draw
71 154.9925 7.868132 -8.488375 -1226.553 1 71 71
587 154.3334 7.626869 -8.526828 -1226.342 1 587 587
684 154.8032 8.290861 -8.499311 -1227.720 1 684 684
1528 154.4008 7.591280 -8.522846 -1226.302 2 528 1528
1795 155.7369 7.777169 -8.446249 -1229.762 2 795 1795
2038 154.7231 7.355423 -8.503959 -1227.029 3 38 2038
2419 154.5437 7.982696 -8.514439 -1226.377 3 419 2419
2867 154.8770 7.750134 -8.495038 -1226.252 3 867 2867

Family of as_draws() functions and the {posterior} package

The functions of the family as_draws() transform brmsfit objects to draws objects, a format supported by the {posterior} package. {brms} currently imports the family of as_draws()functions from the {posterior} package, a tool for working with posterior distributions, i.e. for fitting Bayesian models or working with output from Bayesian models. (See as an introduction The posterior R package)


It’s also noteworthy that the as_draws_df() is part of a larger class of as_draws() functions {brms} currently imports from the {posterior} package.

R Code 4.51 b: Class of as_draws() functions {brms}

Code
class(post_b)
#> [1] "draws_df"   "draws"      "tbl_df"     "tbl"        "data.frame"

Besides of class tbl_df and tbl, subclasses of data.frame with different behavior the as_draws_df() function has created the draws class, the parent class of all supported draws formats.

b: Vector of variances and correlation matrix for b_Intercept and \(\sigma\) (Original)

Code
dplyr::select(post_b, b_Intercept:sigma) |>
  stats::cov() |>
  base::diag()
#> Warning: Dropping 'draws_df' class as required metadata was removed.
#> b_Intercept       sigma 
#>  0.17177971  0.08681616

This result is now the equivalent of the rethinking::vcov() in panel “vcov” of Example 4.13.

R Code 4.52 b: Covariance matrix (Tidyverse)

Code
brms:::vcov.brmsfit(m4.1b)
#>           Intercept
#> Intercept 0.1717797

R Code 4.53 b: Corrrelation (Tidyverse)

Code
post_b |>
  dplyr::select(b_Intercept, sigma) |>
  stats::cor()
#> Warning: Dropping 'draws_df' class as required metadata was removed.
#>              b_Intercept        sigma
#> b_Intercept  1.000000000 -0.004462902
#> sigma       -0.004462902  1.000000000

R Code 4.54 b: Variance-covariance matrix (Original)

Code
utils::str(post_b)
#> draws_df [4,000 × 7] (S3: draws_df/draws/tbl_df/tbl/data.frame)
#>  $ b_Intercept: num [1:4000] 155 155 154 154 155 ...
#>  $ sigma      : num [1:4000] 7.55 7.02 7.58 7.95 7.63 ...
#>  $ lprior     : num [1:4000] -8.49 -8.49 -8.52 -8.52 -8.52 ...
#>  $ lp__       : num [1:4000] -1227 -1230 -1226 -1226 -1226 ...
#>  $ .chain     : int [1:4000] 1 1 1 1 1 1 1 1 1 1 ...
#>  $ .iteration : int [1:4000] 1 2 3 4 5 6 7 8 9 10 ...
#>  $ .draw      : int [1:4000] 1 2 3 4 5 6 7 8 9 10 ...

The post_b object is not just a data frame, but also of class draws_df, which means it contains three metadata variables —- .chain, .iteration, and .draw — which are often hidden from view, but are there in the background when needed. As you’ll see, we’ll make good use of the .draw variable in the future. Notice how our post data frame also includes a vector named lp__. That’s the log posterior. (Kurz, Sec.4.3.6)

For details, see: - The Log-Posterior (function and gradient) section of the Stan Development Team’s (2023) vignette RStan: the R interface to Stan and - Stephen Martin’s explanation of the log posterior on the Stan Forums.

R Code 4.55 b: Summarize the extracted iterations of the HMC chains: base() version

Code
base::summary(post_b[, 1:2])
#> Warning: Dropping 'draws_df' class as required metadata was removed.
#>   b_Intercept        sigma      
#>  Min.   :153.3   Min.   :6.841  
#>  1st Qu.:154.3   1st Qu.:7.567  
#>  Median :154.6   Median :7.757  
#>  Mean   :154.6   Mean   :7.771  
#>  3rd Qu.:154.9   3rd Qu.:7.971  
#>  Max.   :156.2   Max.   :8.864

b: Summarize the extracted iterations of the HMC chains: posterior() version

Code
posterior:::summary.draws(post_b[, 1:2])
#> Warning: Dropping 'draws_df' class as required metadata was removed.
#> # A tibble: 2 × 10
#>   variable      mean median    sd   mad     q5    q95  rhat ess_bulk ess_tail
#>   <chr>        <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>    <dbl>    <dbl>
#> 1 b_Intercept 155.   155.   0.414 0.421 154.   155.    1.00    2722.    2624.
#> 2 sigma         7.77   7.76 0.295 0.301   7.31   8.27  1.00    3369.    2537.

R Code 4.56 b: Summarize with skimr::skim()

Code
skimr::skim(post_b[, 1:2])
#> Warning: Dropping 'draws_df' class as required metadata was removed.
Data summary
Name post_b[, 1:2]
Number of rows 4000
Number of columns 2
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
b_Intercept 0 1 154.60 0.41 153.29 154.31 154.59 154.88 156.22 ▁▆▇▂▁
sigma 0 1 7.77 0.29 6.84 7.57 7.76 7.97 8.86 ▁▆▇▂▁

Create summaries of samples that include tiny histograms

Kurz didn’t mention my suggestion to use skimr::skim() to get tiny histograms as part of the summary but proposes several other methods:

  • A base R approach by using the transpose of a stats::quantile() call nested within base::apply()
  • A {tidyverse} approach
  • A {brms} approach by just putting the brm() fit object into posterior_summary()
  • A {tidybayes} approach using tidybayes::mean_hdi() if you’re willing to drop the posterior sd and
  • Using additionally the function histospark() (from the unfinished {precis} package by Hadley Wickham supposed to replace base::summary()) to get the tiny histograms and to add them into the tidyverse approach.

4.4 Linear prediction

“What we’ve done above is a Gaussian model of height in a population of adults. But it doesn’t really have the usual feel of”regression” to it. Typically, we are interested in modeling how an outcome is related to some other variable, a predictor variable. If the predictor variable has any statistical association with the outcome variable, then we can use it to predict // the outcome. When the predictor variable is built inside the model in a particular way, we’ll have linear regression.” (McElreath, 2020, p. 91/92) (pdf)

Example 4.15 : Scatterplot of adult height versus weight

R Code 4.57 a: Scatterplot of adult height versus weight (Original)

Code
## R code 4.37a #####################
plot(d2_a$height ~ d2_a$weight)

Graph 4.21: Adult height and weight against one another

R Code 4.58 b: Scatterplot of adult height versus weight (Tidyverse)

Code
## R code 4.37b #####################
d2_b |> 
    ggplot2::ggplot(ggplot2::aes(height, weight)) + 
    ggplot2::geom_point() +
    ggplot2::theme_bw()

Graph 4.22: Adult height and weight against one another

There’s obviously a relationship: Knowing a person’s weight helps to predict height.

“To make this vague observation into a more precise quantitative model that relates values of weight to plausible values of height, we need some more technology. How do we take our Gaussian model from Section 4.3 and incorporate predictor variables?” (McElreath, 2020, p. 92) (pdf)

4.4.1 The linear model strategy

4.4.1.1 Model definition

“The linear model strategy instructs the golem to assume that the predictor variable has a constant and additive relationship to the mean of the outcome. The golem then computes the posterior distribution of this constant relationship.” (McElreath, 2020, p. 92) (pdf)

“For each combination of values, the machine computes the posterior probability, which is a measure of relative plausibility, given the model and data. So the posterior distribution ranks the infinite possible combinations of parameter values by their logical plausibility.” (McElreath, 2020, p. 92) (pdf)

Example 4.16 : Linear model definition: Height against weight

: Define the linear heights model

\[ \begin{align*} h_{i} \sim \operatorname{Normal}(\mu, \sigma) \space \space (1) \\ μ \sim \operatorname{Normal}(178, 20) \space \space (2) \\ μ \sim \operatorname{Uniform}(0, 50) \space \space (3) \end{align*} \tag{4.4}\]

Remember Equation 4.3:

  1. Likelihood: Represented by the first line.
  2. Mean prior: Second line is the chosen \(\mu\) (mu, mean) prior. It is a broad Gaussian prior, centered on 178 cm, with 95% of probability between 178 ± 40 cm.
  3. Standard deviation prior: Third line is the chosen \(\sigma\) (sigma, standard deviation) prior.

: Define the linear model heights against weights (V1)

\[ \begin{align*} h_{i} \sim \operatorname{Normal}(μ_{i}, σ) \space \space (1) \\ μ_{i} = \alpha + \beta(x_{i}-\overline{x}) \space \space (2) \\ \alpha \sim \operatorname{Normal}(178, 20) \space \space (3) \\ \beta \sim \operatorname{Normal}(0,10) \space \space (4) \\ \sigma \sim \operatorname{Uniform}(0, 50) \space \space (5) \end{align*} \tag{4.5}\]

Compare the differences with definition inside the Tab “only height”.

  1. Likelihood (Probability of the data): The first line is nearly identical to before, except now there is a little index \(i\) on the \(μ\) as well as on the \(h\). You can read \(h_{i}\) as “each height” and \(\mu_{i}\) as “each \(μ\)” The mean \(μ\) now depends upon unique values on each row \(i\). So the little \(i\) on \(\mu_{i}\) indicates that the mean depends upon the row.

  2. Linear model: The mean \(μ\) is no longer a parameter to be estimated. Rather, as seen in the second line of the model, \(\mu_{i}\) is constructed from other parameters, \(\alpha\) and \(\beta\), and the observed variable \(x\). This line is not a stochastic relationship —– there is no ~ in it, but rather an = in it —– because the definition of \(\mu_{i}\) is deterministic. That is to say that, once we know \(\alpha\) and \(\beta\) and \(x_{i}\), we know \(\mu_{i}\) with certainty. (More details follow in Section 4.4.1.2.)

  3. includes (3),(4) and(5) with \(\alpha, \beta, \sigma\) priors: The remaining lines in the model define distributions for the unobserved variables. These variables are commonly known as parameters, and their distributions as priors. There are three parameters: \(\alpha, \beta, \sigma\). You’ve seen priors for \(\alpha\) and \(\sigma\) before, although \(\alpha\) was called \(\mu\) back then. (More details in Section 4.4.1.3)

4.4.1.2 Linear model

“The value \(x_{i}\) in the second line of 4.5 is just the weight value on row \(i\). It refers to the same individual as the height value, \(h_{i}\), on the same row. The parameters \(\alpha\) and \(\beta\) are more mysterious. Where did they come from? We made them up. The parameters \(\mu\) and \(\sigma\) are necessary and sufficient to describe a Gaussian distribution. But \(\alpha\) and \(\beta\) are instead devices we invent for manipulating \(\mu\), allowing it to vary systematically across cases in the data.” (McElreath, 2020, p. 93) (pdf)

The second line \(μ_{i} = \alpha + \beta(x_{i}-\overline{x})\) of the model definition in Equation 4.7 tells us

“that you are asking two questions about the mean of the outcome:

  1. What is the expected height when \(x_{i} = \overline{x}\)? The parameter \(\alpha\) answers this question, because when \(x_{i} = \overline{x}\), \(\mu_{i} = \alpha\). For this reason, \(\alpha\) is often called the intercept. But we should think not in terms of some abstract line, but rather in terms of the meaning with respect to the observable variables.
  2. What is the change in expected height, when \(x_{i}\) changes by 1 unit? The parameter \(\beta\) answers this question. It is often called a slope, again because of the abstract line. Better to think of it as a rate of change in expectation.

Jointly these two parameters ask the golem to find a line that relates x to h, a line that passes through α when \(x_{i} = \overline{x}\) and has slope \(\beta\).” (McElreath, 2020, p. 94) (pdf)

4.4.1.3 Priors

“The prior for \(\beta\) in Equation 4.5 deserves explanation. Why have a Gaussian prior with mean zero? This prior places just as much probability below zero as it does above zero, and when \(\beta = 0\), // weight has no relationship to height. To figure out what this prior implies, we have to simulate the prior predictive distribution.” (McElreath, 2020, p. 94/95) (pdf)

The goal in Example 4.17 is to simulate heights from the model, using only the priors.

Example 4.17 : Simulating heights from the model, using only the priors

R Code 4.59 a: Simulating heights from the model, using only the priors (Original)

Code
## range of weight values to simulate 
## R code 4.38a #####################
N_100_a <- 100 # 100 lines

## set seed for exact reproduction
set.seed(2971)

## simulate lines implied by the priors for alpha and beta
a <- rnorm(N_100_a, 178, 20)
b <- rnorm(N_100_a, 0, 10)


## R code 4.39a #####################
plot(NULL,
  xlim = range(d2_a$weight), ylim = c(-100, 400),
  xlab = "weight", ylab = "height"
)

## added reference line for 0 and biggest man ever
abline(h = 0, lty = 2)
abline(h = 272, lty = 1, lwd = 0.5)
mtext("b ~ dnorm(0,10)")
xbar <- mean(d2_a$weight)
for (i in 1:N_100_a) {
  curve(a[i] + b[i] * (x - xbar),
    from = min(d2_a$weight), to = max(d2_a$weight), add = TRUE,
    col = rethinking::col.alpha("black", 0.2)
  )
}

Graph 4.23: Simulating heights from the model, using only the priors (Original)

R Code 4.60 b: Simulating heights from the model, using only the priors (Tidyverse)

Code
set.seed(2971)
# how many lines would you like?
n_lines <- 100

lines <-
  tibble::tibble(n = 1:n_lines,
         a = stats::rnorm(n_lines, mean = 178, sd = 20),
         b = stats::rnorm(n_lines, mean = 0, sd = 10)) |> 
  tidyr::expand_grid(weight = base::range(d2_b$weight)) |> 
  dplyr::mutate(height = a + b * (weight - base::mean(d2_b$weight)))


lines |> 
  ggplot2::ggplot(ggplot2::aes(x = weight, y = height, group = n)) +
  ggplot2::geom_hline(yintercept = c(0, 272), linetype = 2:1, linewidth = 1/3) +
  ggplot2::geom_line(alpha = 1/10) +
  ggplot2::coord_cartesian(ylim = c(-100, 400)) +
  ggplot2::ggtitle("b ~ dnorm(0, 10)") +
  ggplot2::theme_classic()

Graph 4.24: Simulating heights from the model, using only the priors (Tidyverse)

The dashed line are reference lines. One at zero—no one is shorter than zero—and one at 272 cm for Robert Wadlow the world’s tallest person.

“The pattern doesn’t look like any human population at all. It essentially says that the relationship // between weight and height could be absurdly positive or negative. Before we’ve even seen the data, this is a bad model.” (McElreath, 2020, p. 94/95) (pdf)

Theorem 4.3 : Defining the \(\beta\) prior as a Log-Normal distribution

\[ \beta \sim \operatorname{Log-Normal}(0,1) \tag{4.6}\]


Defining \(\beta\) as \(Log-Normal(0,1)\) means to claim that the logarithm of \(\beta\) has a \(Normal(0,1)\) distribution.” (McElreath, 2020, p. 96) (pdf)

: Log-Normal distribution

R Code 4.61 a: Log-Normal distribution (Original)

Code
set.seed(4) # to reproduce with tidyverse version
## R code 4.40a ####################
b <- stats::rlnorm(1e4, 0, 1)
rethinking::dens(b, xlim = c(0, 5), adj = 0.1)

Graph 4.25: Log-Normal distributiom (Original)

R Code 4.62 b: Log-Normal distribution (Tidyverse)

Code
set.seed(4)

tibble::tibble(b = stats::rlnorm(1e4, meanlog = 0, sdlog = 1)) |> 
  ggplot2::ggplot(ggplot2::aes(x = b)) +
  ggplot2::geom_density() +
  ggplot2::coord_cartesian(xlim = c(0, 5)) +
  ggplot2::theme_bw()

Graph 4.26: Log-Normal distribution (Tidyverse)

WATCH OUT! Argument matching

Kurz wrote just mean and sd instead of meanlog and sdlog. These shorter argument names work because of the partial matching feature in argument evaluation of R functions. But for educational reason (misunderstanding, clashing with other matching arguments and less readable code) I apply this technique only sometimes in interactive use.

Base R provides the dlnorm() and rlnorm() densities for working with log-normal distributions.

Using the Log-Normal distribution prohibits negative values. This is an important constraint for height and weight as these variables cannot be under \(0\).

“The reason is that exp(x) is greater than zero for any real number \(x\). This is the reason that Log-Normal priors are commonplace. They are an easy way to enforce positive relationships.” (McElreath, 2020, p. 96) (pdf)

R Code 4.63 b: Compare Normal(0,1) with log(Log-Normal(0,1))”

If you’re unfamiliar with the log-normal distribution, it is the distribution whose logarithm is normally distributed. For example, here’s what happens when we compare Normal(0,1) with log(Log-Normal(0,1)). (Kurz)

Code
set.seed(4)

tibble::tibble(rnorm             = stats::rnorm(1e5, mean = 0, sd = 1),
       `log(rlognorm)` = base::log(stats::rlnorm(1e5, meanlog = 0, sdlog = 1))) |> 
  tidyr::pivot_longer(tidyr::everything()) |> 

  ggplot2::ggplot(ggplot2::aes(x = value)) +
  ggplot2::geom_density(fill = "grey92") +
  ggplot2::coord_cartesian(xlim = c(-3, 3)) +
  ggplot2::theme_bw() +
  ggplot2::facet_wrap(~ name, nrow = 2)

Graph 4.27: Compare Normal(0,1) with log(Log-Normal(0,1))

Those values are what the mean and standard deviation of the output from the rlnorm() function after they are log transformed. The formulas for the actual mean and standard deviation for the log-normal distribution itself are complicated (see Wikipedia). (Kurz)

: Prior predictive simulation again, now with the Log-Normal prior: (Original)

Code
## R code 4.41a ###################
set.seed(2971)
N_100_a <- 100 # 100 lines
a <- rnorm(N_100_a, 178, 20)
b <- rlnorm(N_100_a, 0, 1)

## R code 4.39a ###################
plot(NULL,
  xlim = range(d2_a$weight), ylim = c(-100, 400),
  xlab = "weight", ylab = "height"
)
abline(h = 0, lty = 2)
abline(h = 272, lty = 1, lwd = 0.5)
mtext("b ~ dnorm(0,10)")
xbar <- mean(d2_a$weight)
for (i in 1:N_100_a) {
  curve(a[i] + b[i] * (x - xbar),
    from = min(d2_a$weight), to = max(d2_a$weight), add = TRUE,
    col = rethinking::col.alpha("black", 0.2)
  )
}

Graph 4.28: Prior predictive simulation again, now with the Log-Normal prior: rethinking version

This is much more sensible. There is still a rare impossible relationship. But nearly all lines in the joint prior for \(\alpha\) and \(\beta\) are now within human reason.

R Code 4.64 b: Prior predictive simulation again, now with the Log-Normal prior (Tidyverse)

Code
# make a tibble to annotate the plot
text <-
  tibble::tibble(weight = c(34, 43),
         height = c(0 - 25, 272 + 25),
         label  = c("Embryo", "World's tallest person (272 cm)"))

# simulate
base::set.seed(2971)

tibble::tibble(n = 1:n_lines,
       a = stats::rnorm(n_lines, mean = 178, sd = 20),
       b = stats::rlnorm(n_lines, mean = 0, sd = 1)) |> 
  tidyr::expand_grid(weight = base::range(d2_b$weight)) |> 
  dplyr::mutate(height = a + b * (weight - base::mean(d2_b$weight))) |>
  
  # plot
  ggplot2::ggplot(ggplot2::aes(x = weight, y = height, group = n)) +
  ggplot2::geom_hline(yintercept = c(0, 272), linetype = 2:1, linewidth = 1/3) +
  ggplot2::geom_line(alpha = 1/10) +
  ggplot2::geom_text(data = text,
            ggplot2::aes(label = label),
            size = 3) +
  ggplot2::coord_cartesian(ylim = c(-100, 400)) +
  ggplot2::ggtitle("log(b) ~ dnorm(0, 1)") +
  ggplot2::theme_bw()

Graph 4.29: Prior predictive simulation again, now with the Log-Normal prior (Tidyverse)

What is the correct prior?

“There is no more a uniquely correct prior than there is a uniquely correct likelihood. …

In choosing priors, there are simple guidelines to get you started. Priors encode states of information before seeing data. So priors allow us to explore the consequences of beginning with different information. In cases in which we have good prior information that discounts the plausibility of some parameter values, like negative associations between height and weight, we can encode that information directly into priors. When we don’t have such information, we still usually know enough about the plausible range of values. And you can vary the priors and repeat the analysis in order to study // how different states of initial information influence inference. Frequently, there are many reasonable choices for a prior, and all of them produce the same inference.” (McElreath, 2020, p. 95/96) (pdf)

WATCH OUT! Prior predictive simulation and p-hacking

“A serious problem in contemporary applied statistics is p-hacking, the practice of adjusting the model and the data to achieve a desired result. The desired result is usually a p-value less then 5%. The problem is that when the model is adjusted in light of the observed data, then p-values no longer retain their original meaning. False results are to be expected.” (McElreath, 2020, p. 97) (pdf)

Prior predictive simulation and p-hacking

The paper by Simmons, Nelson and Simonsohn (2011), False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant, is often cited as an introduction to the problem.

Another more recent publication is: MacCoun, R. J. (2022). MacCoun, R. J. (2022). P-hacking: A Strategic Analysis. In L. Jussim, J. A. Krosnick, & S. T. Stevens (Eds.), Research Integrity: Best Practices for the Social and Behavioral Sciences (pp. 295–315). Oxford University Press. https://doi.org/10.1093/oso/9780190938550.003.0011.

A book wide treatment is: Chambers, C. (2017). Seven Deadly Sins of Psychology: A Manifesto for Reforming the Culture of Scientific Practice (Illustrated Edition). Princeton University Press.

4.4.2 Finding the posterior distribution

In Example 4.16 we have compared the linear heights model (Equation 4.4) with the linear model heights against weights (Equation 4.5). Now we repeat the linear model heights against weights and compare it with the corresponding R Code.

Example 4.18 : Compare formula and R code for linear model heights against weights (V2)

Theorem 4.4 : Formula of the linear model heights against weights (V2)

\[ \begin{align*} h_{i} \sim \operatorname{Normal}(μ_{i}, σ) \space \space (1) \\ μ_{i} = \alpha + \beta(x_{i}-\overline{x}) \space \space (2) \\ \alpha \sim \operatorname{Normal}(178, 20) \space \space (3) \\ \beta \sim \operatorname{Log-Normal}(0,10) \space \space (4) \\ \sigma \sim \operatorname{Uniform}(0, 50) \space \space (5) \end{align*} \tag{4.7}\]

R Code 4.65 : R Code of the linear model heights against weights (V2)

height ~ dnorm(mu, sigma)     # (1)
mu <- a + b * (weight - xbar) # (2)
a ~ dnorm(178, 20)            # (3)        
b ~ dlnorm(0, 10)             # (4)        
sigma ~ dunif(0, 50)          # (5)       

Notice that the linear model, in the R code on the right-hand side, uses the R assignment operator, <- instead of the symbol =.

Example 4.19 : Find the posterior distribution of the linear height-weight model

R Code 4.66 a: Find the posterior distribution of the linear height-weight model (Original)

Code
## R code 4.42a #############################

# define the average weight, x-bar
xbar_a <- mean(d2_a$weight)

# fit model
m4.3a <- rethinking::quap(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * (weight - xbar_a),
    a ~ dnorm(178, 20),
    b ~ dlnorm(0, 1),
    sigma ~ dunif(0, 50)
  ),
  data = d2_a
)

# summary result
## R code 4.44a ############################
rethinking::precis(m4.3a)
#>              mean         sd        5.5%       94.5%
#> a     154.6012591 0.27030995 154.1692516 155.0332666
#> b       0.9033071 0.04192398   0.8363045   0.9703097
#> sigma   5.0719240 0.19115884   4.7664153   5.3774327

R Code 4.67 a: Find the posterior distribution of the linear height-weight model: Log version (Original)

Code
## R code 4.43a ############################
m4.3a_2 <- rethinking::quap(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + exp(log_b) * (weight - xbar_a),
    a ~ dnorm(178, 20),
    log_b ~ dnorm(0, 1),
    sigma ~ dunif(0, 50)
  ),
  data = d2_a
)

rethinking::precis(m4.3a_2)
#>               mean         sd        5.5%        94.5%
#> a     154.60136683 0.27030697 154.1693641 155.03336957
#> log_b  -0.09957465 0.04626322  -0.1735122  -0.02563709
#> sigma   5.07186793 0.19115333   4.7663680   5.37736786

Note the exp(log_b) in the definition of mu. This is the same model as m4.3. It will make the same predictions. But instead of β in the posterior distribution, you get log((β). It is easy to translate between the two, because \(\beta = exp(log(\beta))\). In code form: b <- exp(log_b).

R Code 4.68 b: Find the posterior distribution of the linear height-weight model (Tidyverse)

Code
d2_b <-
  d2_b |> 
  dplyr::mutate(weight_c = weight - base::mean(weight))

m4.3b <- 
  brms::brm(data = d2_b, 
      family = gaussian,
      height ~ 1 + weight_c,
      prior = c(brms::prior(normal(178, 20), class = Intercept),
                brms::prior(lognormal(0, 1), class = b, lb = 0),
                brms::prior(uniform(0, 50), class = sigma, ub = 50)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 4,
      file = "brm_fits/m04.03b")

brms:::summary.brmsfit(m4.3b)
#>  Family: gaussian 
#>   Links: mu = identity; sigma = identity 
#> Formula: height ~ 1 + weight_c 
#>    Data: d2_b (Number of observations: 352) 
#>   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#>          total post-warmup draws = 4000
#> 
#> Population-Level Effects: 
#>           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept   154.61      0.27   154.09   155.14 1.00     4546     2895
#> weight_c      0.90      0.04     0.82     0.99 1.00     4278     2978
#> 
#> Family Specific Parameters: 
#>       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sigma     5.11      0.20     4.72     5.51 1.00     4439     2773
#> 
#> Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).

Remember: The detailed explication of the syntax for the following brms::brm() function is in the “Tidyverse” tab of Example 4.11.

Unlike with McElreath’s rethinking::quap() formula syntax, Kurz is not aware if we can just specify something like weight – xbar in the formula argument in brms::brm().

However, the alternative is easy: Just make a new variable in the data that is equivalent to weight – mean(weight). We’ll call it weight_c.

R Code 4.69 Find the posterior distribution of the linear height-weight model (log version) (Tidyverse)

Code
m4.3b_log <- 
  brms::brm(data = d2_b, 
      family = gaussian,
      brms::bf(height ~ a + exp(lb) * weight_c,
         a ~ 1,
         lb ~ 1,
         nl = TRUE),
      prior = c(brms::prior(normal(178, 20), class = b, nlpar = a),
                brms::prior(normal(0, 1), class = b, nlpar = lb),
                brms::prior(uniform(0, 50), class = sigma, ub = 50)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 4,
      file = "brm_fits/m04.03b_log")

brms:::summary.brmsfit(m4.3b_log)
#>  Family: gaussian 
#>   Links: mu = identity; sigma = identity 
#> Formula: height ~ a + exp(lb) * weight_c 
#>          a ~ 1
#>          lb ~ 1
#>    Data: d2_b (Number of observations: 352) 
#>   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#>          total post-warmup draws = 4000
#> 
#> Population-Level Effects: 
#>              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> a_Intercept    154.61      0.27   154.08   155.12 1.00     3524     2646
#> lb_Intercept    -0.10      0.05    -0.20    -0.01 1.00     4412     2848
#> 
#> Family Specific Parameters: 
#>       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sigma     5.11      0.19     4.75     5.51 1.00     3697     2629
#> 
#> Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).

Remember: The detailed explication of the syntax for the following brms::brm() function is in the “Tidyverse” tab of Example 4.11.

The difference is for the \(\beta\) parameter, which we called lb in the m4.3b_log model. If we term that parameter from m4.3b as \(\beta^{m4.3b}\) and the one from our new log model \(\beta^{m4.3b_log}\), it turns out that \(\beta^{m4.3b} = exp(\beta^{m4.3b_log})\).

Compare the result with the previous tab “Tidyverse” in R Code 4.68.

R Code 4.70 b: Extract and compare the population-level (fixed) effects from object m4.3b and the form log version m4.3b_log

Code
brms::fixef(m4.3b)["weight_c", "Estimate"]
brms::fixef(m4.3b_log)["lb_Intercept", "Estimate"] |> exp()
#> [1] 0.9031995
#> [1] 0.9037418

They’re the same within simulation variance.

R Code 4.71 Display trace plots (Tidyverse)

Code
brms:::plot.brmsfit(m4.3b, widths = c(1, 2))

Checking MCMC chains with trace plots and trank plots

At the moment I didn’t learn how to interpret these types of graphic output. McElreath’s explains in later video lectures that it is important that trace plot cover the same location in the vertical axis (e.g. they do not jump around) and show that all different chains alternate in their (top) positions. I mentioned here the top position because this is the place where irregularities can be detected more easily.

In the above example it is difficult to decide if this is the case because the color differences of the different chains are weak. But it is general difficult to inspect trace plot, therefore McElreath proposes trace rank plots or trank plot (his terminus) in ?sec-chap09.

To understand the differences in syntax between {rethinking} and {brms} it is also quite revealing to look at Table 4.2. I have inserted the following table from Kurz’ explanation of model b4.3.

Table 4.2: Compare statistical notation of rethinking with brms package
{rethinking} package {brms} package:
\(\text{height}_i \sim \operatorname{Normal}(\mu_i, \sigma)\) family = gaussian
\(\mu_i = \alpha + \beta \text{weight}_i\) height ~ 1 + weight_c
\(\alpha \sim \operatorname{Normal}(178, 20)\) prior(normal(178, 20), class = Intercept
\(\beta \sim \operatorname{Log-Normal}(0, 1)\) prior(lognormal(0, 1), class = b)
\(\sigma \sim \operatorname{Uniform}(0, 50)\) prior(uniform(0, 50), class = sigma)

4.4.3 Interpreting the posterior distribution

What do parameters mean?

“Posterior probabilities of parameter values describe the relative compatibility of different states of the world with the data, according to the model. These are small world (Chapter 2) numbers.” (McElreath, 2020, p. 99) (pdf)

Statistical models are hard to interpret. Plotting posterior distributions and posterior predictions is better than attempting to understand a table.

4.4.3.1 Table of marginal distributions

There are many different options in the tidyverse approach, repsectively with {brms}. The most

Example 4.20 Inspect the marginal posterior distributions of the parameters

R Code 4.72 a: Inspect the marginal posterior distributions of the parameters (Original)

Code
## R code 4.44 ##########
rethinking::precis(m4.3a)
#>              mean         sd        5.5%       94.5%
#> a     154.6012591 0.27030995 154.1692516 155.0332666
#> b       0.9033071 0.04192398   0.8363045   0.9703097
#> sigma   5.0719240 0.19115884   4.7664153   5.3774327

  1. First row: quadratic approximation for \(\alpha\)
  2. Second row: quadratic approximation for \(\beta\)
  3. Third row: quadratic approximation for \(\sigma\)

Let’s focus on b (\(\beta\)), because it’s the new parameter. Since (\(\beta\)) is a slope, the value 0.90 can be read as a person 1 kg heavier is expected to be 0.90 cm taller. 89% of the posterior probability (\(94.5-5.5\)) lies between 0.84 and 0.97. That suggests that (\(\beta\)) values close to zero or greatly above one are highly incompatible with these data and this model. It is most certainly not evidence that the relationship between weight and height is linear, because the model only considered lines. It just says that, if you are committed to a line, then lines with a slope around 0.9 are plausible ones.

R Code 4.73 b: Display variance-covariance matrix (Original)

The numbers in the default rethinking::precis() output aren’t sufficient to describe the quadratic posterior completely. Therefore we also need to inspect the variance-covariance matrix.

Code
## R code 4.45a ################
round(rethinking::vcov(m4.3a), 3)
#>           a     b sigma
#> a     0.073 0.000 0.000
#> b     0.000 0.002 0.000
#> sigma 0.000 0.000 0.037

There is very little covariation among the parameters in this case.

R Code 4.74 b: Display variance-covariance matrix (Original)

Code
rethinking::pairs(m4.3a)

The graphic shows both the marginal posteriors and the covariance.

R Code 4.75 b: Display the marginal posterior distributions of the parameters (brms)

Code
brms::posterior_summary(m4.3b, probs = c(0.055, 0.945))[1:3, ] |> 
  round(digits = 2)
#>             Estimate Est.Error   Q5.5  Q94.5
#> b_Intercept   154.61      0.27 154.18 155.05
#> b_weight_c      0.90      0.04   0.84   0.97
#> sigma           5.11      0.20   4.80   5.43

Looking up brms::posterior_summary() I learned that the “function mainly exists to retain backwards compatibility”. It will eventually be replaced by functions of the {posterior} package”. (brms)

WATCH OUT! How to insert coefficients into functions with {brms}?

{brms} does not allow users to insert coefficients into functions like exp() within the conventional formula syntax. We can fit a {brms} model like McElreath’s m4.3a if we adopt what’s called the non-linear syntax. The non-linear syntax is a lot like the syntax McElreath uses in {rethinking} in that it typically includes both predictor and variable names in the formula. (Kurz)

Kurz promises to explain later in his ebook what non-linear syntax exactly means.

R Code 4.76 b: Display the marginal posterior distributions of the parameters with summarize_draws() from {posterior}

Code
posterior::summarize_draws(m4.3b, "mean", "median", "sd", 
                           ~quantile(., probs = c(0.055, 0.945)),
                           .num_args = list(sigfig = 2))[1:3, ]
#> # A tibble: 3 × 6
#>   variable      mean median    sd `5.5%` `94.5%`
#>   <chr>        <dbl>  <dbl> <dbl>  <dbl>   <dbl>
#> 1 b_Intercept 155.   155.   0.27  154.    155.  
#> 2 b_weight_c    0.90   0.90 0.042   0.84    0.97
#> 3 sigma         5.1    5.1  0.20    4.8     5.4

R Code 4.77 b: Display the marginal posterior distributions of the parameters of brms::as_draws_array() objects

Code
## summary for draws object
summary(brms::as_draws_array(m4.3b))[1:3, ]
#> # A tibble: 3 × 10
#>   variable    mean  median     sd    mad      q5     q95  rhat ess_bulk ess_tail
#>   <chr>      <dbl>   <dbl>  <dbl>  <dbl>   <dbl>   <dbl> <dbl>    <dbl>    <dbl>
#> 1 b_Inter… 155.    155.    0.272  0.275  154.    155.     1.00    4546.    2895.
#> 2 b_weigh…   0.903   0.903 0.0424 0.0414   0.834   0.973  1.00    4278.    2978.
#> 3 sigma      5.11    5.10  0.198  0.190    4.79    5.44   1.00    4439.    2773.

R Code 4.78 b: Display variance-covariance matrix (brms)

Code
brms:::vcov.brmsfit(m4.3b) |> 
  round(3)
#>           Intercept weight_c
#> Intercept     0.074    0.000
#> weight_c      0.000    0.002

We got the variance/covariance matrix of the intercept and weight_c coefficient but not \(\sigma\) however. To get that, we’ll have to extract the posterior draws and use the cov() function, instead. (See next tab “vcov2 (T)”)

R Code 4.79 b: Variance-covariance matrix with {posterior} draws objects (brms)

Code
brms::as_draws_df(m4.3b) |>
  dplyr::select(b_Intercept:sigma) |>
  stats::cov() |>
  base::round(digits = 3)
#>             b_Intercept b_weight_c sigma
#> b_Intercept       0.074      0.000 0.000
#> b_weight_c        0.000      0.002 0.000
#> sigma             0.000      0.000 0.039

R Code 4.80 b: Display variance-covariance matrix (brms)

Code
brms:::pairs.brmsfit(m4.3b)

The graphic shows both the marginal posteriors and the covariance.

4.4.3.2 Plotting posterior inference against the data

Example 4.21 : Height plotted against weight with linear regression (line at the posterior mean)

R Code 4.81 a: Height in centimeters (vertical) plotted against weight in kilograms (horizontal), with the line at the posterior mean plotted in black (Original)

Code
## R code 4.46a ############################################
plot(height ~ weight, data = d2_a, col = rethinking::rangi2)
post_m4.3a <- rethinking::extract.samples(m4.3a)
a_map <- mean(post_m4.3a$a)
b_map <- mean(post_m4.3a$b)
curve(a_map + b_map * (x - xbar_a), add = TRUE)

Graph 4.30: Height in centimeters (vertical) plotted against weight in kilograms (horizontal), with the line at the posterior mean plotted in black: rethinking version

Each point in this plot is a single individual. The black line is defined by the mean slope \(\beta\) and mean intercept \(\alpha\). This is not a bad line. It certainly looks highly plausible. But there are an infinite number of other highly plausible lines near it. See next tab

R Code 4.82 b: Height in centimeters (vertical) plotted against weight_c (horizontal), with the line at the posterior mean plotted in black with standardized centered weight values (Tidyverse)

Code
d2_b |>
  ggplot2::ggplot(ggplot2::aes(x = weight_c, y = height)) +
  ggplot2::geom_abline(intercept = brms:::fixef.brmsfit(m4.3b)[1], 
              slope     = brms:::fixef.brmsfit(m4.3b)[2]) +
  ggplot2::geom_point(shape = 1, size = 2, color = "royalblue") +
  ggplot2::theme_bw()

Graph 4.31: Height in centimeters (vertical) plotted against weight_c (horizontal), with the line at the posterior mean plotted in black with standardized centered weight values (Tidyverse)

Note how the breaks on our x-axis look off. That’s because we fit the model with weight_c and we plotted the points in that metric, too. Since we computed weight_c by subtracting the mean of weight from the data, we can adjust the x-axis break point labels by simply adding that value back. (See next tab “Tidyverse2”)

Further note the use of the brms:::fixef.brmsfit() function within ggplot2::geom_abline(). The function extracts the population-level (‘fixed’) effects from a brmsfit object.

R Code 4.83 b: Height in centimeters (vertical) plotted against weight_c (horizontal), with the line at the posterior mean plotted in black with centered weights in kg (Tidyverse)

Code
labels <-
  c(-10, 0, 10) + base::mean(d2_b$weight) |> 
  base::round(digits = 0)

d2_b |>
  ggplot2::ggplot(ggplot2::aes(x = weight_c, y = height)) +
  ggplot2::geom_abline(intercept = brms::fixef(m4.3b, probs = c(0.055, 0.945))[[1]], 
              slope     = brms::fixef(m4.3b, probs = c(0.055, 0.945))[[2]]) +
  ggplot2::geom_point(shape = 1, size = 2, color = "royalblue") +
  ggplot2::scale_x_continuous("weight",
                     breaks = c(-10, 0, 10),
                     labels = labels) +
  ggplot2::theme_bw()

Graph 4.32: Height in centimeters (vertical) plotted against weight in kilograms (horizontal), with the line at the posterior mean plotted in black with centered weights in kg (Tidyverse)

Note the use of the brms:::fixef.brmsfit() function within ggplot2::geom_abline(). The function extracts the population-level (‘fixed’) effects from a brmsfit object.

What are population-level (“fixed”) effects?

I didn’t know what it meant exactly that brms::fixef() extracts the population-level (‘fixed’) effect from a brmsfit object. After reading other books I now understand that this means effects of the whole population in contrast to effects on the idividual level (so-called “random effect”). (The terms “fixed” and “random” effects are misleading, as both effects are random.)

“Generally, these are the three types of parameters in multi-level models: the population-level estimate (commonly called fixed effects), the participant-level estimates (random effects) and the participant-level variation.” (Schmettow, 2022, p. 268) (pdf)

“There is a lot of confusion about the type of models that we deal with in this chapter. They have also been called hierarchical models or mixed effects models. The”mixed” stands for a mixture of so called fixed effects and random effects. The problem is: if you start by understanding what fixed effects and random effects are, confusion is programmed, not only because there exist several very different definitions. In fact, it does not matter so much whether an estimate is a fixed effect or random effect. As we will see, you can construct a multi-level model by using just plain descriptive summaries. What matters is that a model contains estimates on population level and on participant level. The benefit is, that a multi-level model can answer the same question for the population as a whole and for every single participant.” (Schmettow, 2022, p. 278) (pdf)

Fixed and random effects

After I googled it turned out that there is a great discussion about Fixed Effects in Linear Regression and generally about fixed, random and mixed models (See Cross Validated here and here).

There is also a glossary entry in statistics.com.

And last but not least there are academic papers on this topic:

4.4.3.3 Adding uncertainty around the mean

“Plots of the average line, like Graph 4.30, are useful for getting an impression of the magnitude of the estimated influence of a variable. But they do a poor job of communicating uncertainty. Remember, the posterior distribution considers every possible regression line connecting height to weight. It assigns a relative plausibility to each. This means that each combination of \(\alpha\) and \(\beta\) has a posterior probability. It could be that there are many lines with nearly the same posterior probability as the average line. Or it could be instead that the posterior distribution is rather narrow near the average line.” (McElreath, 2020, p. 101) (pdf)

Example 4.22 : Inspect the uncertainty around the mean

R Code 4.84 a: Show some posterior data rows (Original)

Code
## R code 4.47a ################
# post_m4.3a <- rethinking::extract.samples(m4.3a) # already in previous listing
set.seed(4)
bayr::as_tbl_obs(post_m4.3a)
Data set with 4 variables, showing 8 of 10000 observations.
Obs a b sigma
307 154.2976 0.9186361 4.841837
587 154.9650 0.8467550 4.994881
684 154.2021 0.9094732 5.415900
1795 155.1948 0.8750700 5.593173
2867 154.8796 0.8997763 5.146971
4167 153.9071 0.9294830 5.028728
4794 154.6078 0.8424997 4.918431
5624 154.7940 0.9329589 5.006664

“Each row is a correlated random sample from the joint posterior of all three parameters, using the covariances provided by vcov(m4.3a) in Tab”vcov (O)” in Example 4.20. The paired values of a and b on each row define a line. The average of very many of these lines is the posterior mean line. But the scatter around that average is meaningful, because it alters our confidence in the relationship between the predictor and the outcome” (McElreath, 2020, p. 101) (pdf)

R Code 4.85 a: Plot 20 sampled lines of 10 data points to the uncertainty around the mean (Original)

Code
## R code 4.48a ##########################
N10_a <- 10
dN10_a <- d2_a[1:N10_a, ]
mN10_a <- rethinking::quap(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * (weight - mean(weight)),
    a ~ dnorm(178, 20),
    b ~ dlnorm(0, 1),
    sigma ~ dunif(0, 50)
  ),
  data = dN10_a
)

## R code 4.49a ##############################
# extract 20 samples from the posterior
set.seed(4)
post_20_m4.3a <- rethinking::extract.samples(mN10_a, n = 20)

# display raw data and sample size
plot(dN10_a$weight, dN10_a$height,
  xlim = range(d2_a$weight), ylim = range(d2_a$height),
  col = rethinking::rangi2, xlab = "weight", ylab = "height"
)
mtext(rethinking:::concat("N = ", N10_a))

# plot the lines, with transparency
for (i in 1:20) {
  curve(post_20_m4.3a$a[i] + post_20_m4.3a$b[i] * (x - mean(dN10_a$weight)),
    col = rethinking::col.alpha("black", 0.3), add = TRUE
  )
}

Graph 4.33: Samples from the quadratic approximate posterior distribution for the height/weight model, m4.3a. 20 lines sampled from 10 data points of the posterior distribution, showing the uncertainty in the regression relationship (Original)

“By plotting multiple regression lines, sampled from the posterior, it is easy to see both the highly confident aspects of the relationship and the less confident aspects. The cloud of regression lines displays greater uncertainty at extreme values for weight.” (McElreath, 2020, p. 102) (pdf)

R Code 4.86 a: Plot 20 sampled lines of 50 data points to the uncertainty around the mean (Original)

Code
## R code 4.48a ######################
N50_a <- 50
dN50_a <- d2_a[1:N50_a, ]
mN50_a <- rethinking::quap(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * (weight - mean(weight)),
    a ~ dnorm(178, 20),
    b ~ dlnorm(0, 1),
    sigma ~ dunif(0, 50)
  ),
  data = dN50_a
)

## R code 4.49a ######################
# extract 20 samples from the posterior
set.seed(4)
post_50_m4.3a <- rethinking::extract.samples(mN50_a, n = 20)

# display raw data and sample size
plot(dN50_a$weight, dN50_a$height,
  xlim = range(d2_a$weight), ylim = range(d2_a$height),
  col = rethinking::rangi2, xlab = "weight", ylab = "height"
)
mtext(rethinking:::concat("N = ", N50_a))

# plot the lines, with transparency
for (i in 1:20) {
  curve(post_50_m4.3a$a[i] + post_50_m4.3a$b[i] * (x - mean(dN50_a$weight)),
    col = rethinking::col.alpha("black", 0.3), add = TRUE
  )
}

Graph 4.34: Samples from the quadratic approximate posterior distribution for the height/weight model, m4.3a. 20 lines sampled from 50 data points of the posterior distribution, showing the uncertainty in the regression relationship (Original)

“Notice that the cloud of regression lines grows more compact as the sample size increases. This is a result of the model growing more confident about the location of the mean.” (McElreath, 2020, p. 102) (pdf)

R Code 4.87 a: Plot 20 sampled lines of 352 data points to the uncertainty around the mean (Original)

Code
## R code 4.48, 4.49 ###########################
N352_a <- 352
dN352_a <- d2_a[1:N352_a, ]
mN352_a <- rethinking::quap(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * (weight - mean(weight)),
    a ~ dnorm(178, 20),
    b ~ dlnorm(0, 1),
    sigma ~ dunif(0, 50)
  ),
  data = dN352_a
)

# extract 20 samples from the posterior
post_352_m4.3a <- rethinking::extract.samples(mN352_a, n = 20)

# display raw data and sample size
plot(dN352_a$weight, dN352_a$height,
  xlim = range(d2_a$weight), ylim = range(d2_a$height),
  col = rethinking::rangi2, xlab = "weight", ylab = "height"
)
mtext(rethinking:::concat("N = ", N352_a))

# plot the lines, with transparency
for (i in 1:20) {
  curve(post_352_m4.3a$a[i] + post_352_m4.3a$b[i] * (x - mean(dN352_a$weight)),
    col = rethinking::col.alpha("black", 0.3), add = TRUE
  )
}

Graph 4.35: Samples from the quadratic approximate posterior distribution for the height/weight model, m4.3a. 20 lines sampled from all 352 data points of the posterior distribution, showing the uncertainty in the regression relationship.

“Notice that the cloud of regression lines grows more compact as the sample size increases. This is a result of the model growing more confident about the location of the mean.” (McElreath, 2020, p. 102) (pdf)

R Code 4.88 b: Extract the iteration of the Hamiliton Monte Carlo (HMC) chains into a data frame and show 8 rows (Tidyverse)”

Code
post_m4.3b <- brms::as_draws_df(m4.3b)
set.seed(4)
bayr::as_tbl_obs(post_m4.3b)
Data set with 9 variables, showing 8 of 4000 observations.
Obs b_Intercept b_weight_c sigma lprior lp__ .chain .iteration .draw
71 154.4093 0.9211252 4.990186 -9.362502 -1079.373 1 71 71
587 154.5356 0.8923309 5.097617 -9.326425 -1079.011 1 587 587
684 154.9519 0.8758742 5.187928 -9.285901 -1080.125 1 684 684
1528 154.0199 0.8707605 4.943669 -9.335626 -1081.977 2 528 1528
1795 155.0464 0.8917125 5.051194 -9.296172 -1080.359 2 795 1795
2038 154.6438 0.8601186 5.262894 -9.288191 -1079.917 3 38 2038
2419 154.6285 1.0054228 5.048592 -9.433840 -1081.835 3 419 2419
2867 154.4709 0.8487887 5.262154 -9.287149 -1080.317 3 867 2867

Printout interpreted

  • b_Intercept represents a in the rethinking version.
  • b_weight_c represents b in the rethinking version. But why did we have to calculate it different?
  • sigma is the quadratic approximation of the standard deviation.
  • lprior is – I assume – the log prior.
  • l__ is what??

Instead of rethinking::extract.samples() the {brms} packages extract all the posterior draws with brms::as_draws_df(). We have already done this with model m4.1b (Example 4.14 in tab “draws1”).

b: Plot 20 sampled lines of 10, 50, 150 and 352 data points to the uncertainty around the mean (Original)

Code
## 1. Calculate all four models ################
dN10_b <- 10

m4.3b_010 <- 
  brms::brm(data = d2_b |>
      dplyr::slice(1:dN10_b),  # note our tricky use of `N` and `slice()`
      family = gaussian,
      height ~ 1 + weight_c,
      prior = c(brms::prior(normal(178, 20), class = Intercept),
                brms::prior(lognormal(0, 1), class = b),
                brms::prior(uniform(0, 50), class = sigma, ub = 50)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 4,
      file = "brm_fits/m04.03b_010")

dN50_b <- 50

m4.3b_050 <- 
  brms::brm(data = d2_b |>
      dplyr::slice(1:dN50_b), 
      family = gaussian,
      height ~ 1 + weight_c,
      prior = c(brms::prior(normal(178, 20), class = Intercept),
                brms::prior(lognormal(0, 1), class = b),
                brms::prior(uniform(0, 50), class = sigma, ub = 50)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 4,
      file = "brm_fits/m04.03b_050")

dN150_b <- 150

m4.3b_150 <- 
  brms::brm(data = d2_b |>
      dplyr::slice(1:dN150_b), 
      family = gaussian,
      height ~ 1 + weight_c,
      prior = c(brms::prior(normal(178, 20), class = Intercept),
                brms::prior(lognormal(0, 1), class = b),
                brms::prior(uniform(0, 50), class = sigma, ub = 50)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 4,
      file = "brm_fits/m04.03b_150")

dN352_b <- 352

m4.3b_352 <- 
  brms::brm(data = d2_b |>
      dplyr::slice(1:dN352_b), 
      family = gaussian,
      height ~ 1 + weight_c,
      prior = c(brms::prior(normal(178, 20), class = Intercept),
                brms::prior(lognormal(0, 1), class = b),
                brms::prior(uniform(0, 50), class = sigma, ub = 50)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 4,
      file = "brm_fits/m04.03b_352")


## 2. put model chains into dfs ##########
post010_m4.3b <- brms::as_draws_df(m4.3b_010)
post050_m4.3b <- brms::as_draws_df(m4.3b_050)
post150_m4.3b <- brms::as_draws_df(m4.3b_150)
post352_m4.3b <- brms::as_draws_df(m4.3b_352)


## 3. prepare plots ##########
p10 <- 
  ggplot2::ggplot(data =  d2_b[1:10, ], 
         ggplot2::aes(x = weight_c, y = height)) +
  ggplot2::geom_abline(data = post010_m4.3b |> dplyr::slice(1:20),
              ggplot2::aes(intercept = b_Intercept, slope = b_weight_c),
              linewidth = 1/3, alpha = .3) +
  ggplot2::geom_point(shape = 1, size = 2, color = "royalblue") +
  ggplot2::coord_cartesian(xlim = base::range(d2_b$weight_c),
                  ylim = base::range(d2_b$height)) +
  ggplot2::labs(subtitle = "N = 10")

p50 <-
  ggplot2::ggplot(data =  d2_b[1:50, ], 
         ggplot2::aes(x = weight_c, y = height)) +
  ggplot2::geom_abline(data = post050_m4.3b |> dplyr::slice(1:20),
              ggplot2::aes(intercept = b_Intercept, slope = b_weight_c),
              linewidth = 1/3, alpha = .3) +
  ggplot2::geom_point(shape = 1, size = 2, color = "royalblue") +
  ggplot2::coord_cartesian(xlim = base::range(d2_b$weight_c),
                  ylim = base::range(d2_b$height)) +
  ggplot2::labs(subtitle = "N = 50")

p150 <-
  ggplot2::ggplot(data =  d2_b[1:150, ], 
         ggplot2::aes(x = weight_c, y = height)) +
  ggplot2::geom_abline(data = post150_m4.3b |> dplyr::slice(1:20),
              ggplot2::aes(intercept = b_Intercept, slope = b_weight_c),
              linewidth = 1/3, alpha = .3) +
  ggplot2::geom_point(shape = 1, size = 2, color = "royalblue") +
  ggplot2::coord_cartesian(xlim = base::range(d2_b$weight_c),
                  ylim = base::range(d2_b$height)) +
  ggplot2::labs(subtitle = "N = 150")

p352 <- 
  ggplot2::ggplot(data =  d2_b[1:352, ], 
         ggplot2::aes(x = weight_c, y = height)) +
  ggplot2::geom_abline(data = post352_m4.3b |> dplyr::slice(1:20),
              ggplot2::aes(intercept = b_Intercept, slope = b_weight_c),
              linewidth = 1/3, alpha = .3) +
  ggplot2::geom_point(shape = 1, size = 2, color = "royalblue") +
  ggplot2::coord_cartesian(xlim = base::range(d2_b$weight_c),
                  ylim = base::range(d2_b$height)) +
  ggplot2::labs(subtitle = "N = 352")

## 4. display plots {patchwork} ########
library(patchwork)
(p10 + p50 + p150 + p352) &
  ggplot2::scale_x_continuous("weight",
                     breaks = c(-10, 0, 10),
                     labels = labels) &
  ggplot2::theme_bw()

“Notice that the cloud of regression lines grows more compact as the sample size increases. This is a result of the model growing more confident about the location of the mean.” (McElreath, 2020, p. 102) (pdf)

4.4.3.4 Plotting regression intervals and contours

“The cloud of regression lines in Example 4.22 is an appealing display, because it communicates uncertainty about the relationship in a way that many people find intuitive. But it’s more common, and often much clearer, to see the uncertainty displayed by plotting an interval or contour around the average regression line.” (McElreath, 2020, p. 102) (pdf)

Procedure 4.5 : Generating predictions and intervals from the posterior of a fit model

  1. Use link to generate distributions of posterior values for \(\mu\). The default behavior of link is to use the original data, so you have to pass it a list of new horizontal axis values you want to plot posterior predictions across.
  2. Use summary functions like base::mean() or rethinking::PI() to find averages and lower and upper bounds of \(\mu\) for each value of the predictor variable.
  3. Finally, use plotting functions like graphics::lines() and rethinking::shade() to draw the lines and intervals. Or you might plot the distributions of the predictions, or do further numerical calculations with them. (McElreath, 2020, p. 107) (pdf)

To understand the first task of Procedure 4.5 McElreath explains several preparation steps. As there are for the whole procedure nine different steps I will separate original and tidyverse approach into Example 4.23 (Original) and Example 4.24 (Tidyverse).

Example 4.23 a: Plotting regression intervals and contours (Original)

R Code 4.89 a: Calculate uncertainty around the average regression line at mean of 50kg (Original)

Code
## R code 4.50a ##########################

set.seed(4)                                             # 1
post_m4.3a <- rethinking::extract.samples(m4.3a)        # 2
mu_at_50_a <- post_m4.3a$a + post_m4.3a$b * (50 - xbar) # 3
str(mu_at_50_a)                                         # 4
#>  num [1:10000] 159 159 160 159 160 ...

Comments for the different R code lines:

  1. Set seed for exact reproducibility
  2. Repeating the code for drawing (extracting and collecting) from the fitted model m4.3a (already done in Graph 4.30). extract.samples() returns from a map object a data.frame containing samples for each parameter in the posterior distribution. These samples are cleaned of dimension attributes and the lp__, dev, and log_lik traces that are used internally. For map and other types, it uses the variance-covariance matrix and coefficients to define a multivariate Gaussian posterior to draw \(n\) samples from.
  3. The code to the right of the <- takes it’s form from the equation for \(\mu_{i} = \alpha + \beta(x_{i} - \overline{x})\). The value of \(x_{i}\) in this case is \(50\).
  4. The result is a vector of predicted means, one for each random sample from the posterior. Since joint a and b went into computing each, the variation across those means incorporates the uncertainty in and correlation between both parameters. (McElreath 2023, 103 and help file from extract.samples())

R Code 4.90 a: The quadratic approximate posterior distribution of the mean height, \(\mu\), when weight is \(50\) kg. This distribution represents the relative plausibility of different values of the mean (Original)

Code
## R code 4.51a ##################
rethinking::dens(mu_at_50_a, col = rethinking::rangi2, lwd = 2, xlab = "mu|weight=50")

Graph 4.36: The quadratic approximate posterior distribution of the mean height, μ, when weight is 50 kg. This distribution represents the relative plausibility of different values of the mean (Original)

R Code 4.91 a: 89% compatibility interval of μ at 50 kg (Original)

Code
## R code 4.52a ##############
rethinking::PI(mu_at_50_a, prob = 0.89)
#>       5%      94% 
#> 158.5758 159.6682

“What these numbers mean is that the central 89% of the ways for the model to produce the data place the average height between about 159 cm and 160 cm (conditional on the model and data), assuming the weight is 50 kg. That’s good so far, but we need to repeat the above calculation for every weight value on the horizontal axis, not just when it is 50 kg.” (McElreath, 2020, p. 104) (pdf)

R Code 4.92 a: Calculate \(\mu\) for each case in the data and sample from the posterior distribution (Original)

Code
## R code 4.53a ##############
mu_352_m4.3a <- rethinking::link(m4.3a)
str(mu_352_m4.3a)
#>  num [1:1000, 1:352] 157 157 157 158 157 ...

“You end up with a big matrix of values of \(\mu\). Each row is a sample from the posterior distribution. The default is 1000 samples, but you can use as many or as few as you like. Each column is a case (row) in the data. There are 352 rows in d2_a, corresponding to 352 individuals. So there are 352 columns in the matrix mu_m4.3a above.” (McElreath, 2020, p. 105) (pdf)

“The function link() provides a posterior distribution of \(\mu\) for each case we feed it. So above we have a distribution of \(\mu\) for each individual in the original data. We actually want something slightly different: a distribution of \(\mu\) for each unique weight value on the horizontal axis. It’s only slightly harder to compute that, by just passing link() some new data.” (McElreath, 2020, p. 105) (pdf) (See next tab “link2”.)

R Code 4.93 a: Calculate a distribution of \(\mu\) for each unique weight value on the horizontal axis (Original)

Code
## R code 4.54a ###############################
# define sequence of weights to compute predictions for
# these values will be on the horizontal axis
weight.seq <- seq(from = 25, to = 70, by = 1)

# use link to compute mu
# for each sample from posterior
# and for each weight in weight.seq
mu_46_m4.3a <- rethinking::link(m4.3a, data = data.frame(weight = weight.seq))
str(mu_46_m4.3a)
#>  num [1:1000, 1:46] 136 136 137 136 136 ...

And now there are only 46 columns in \(\mu\), because we fed it 46 different values for weight.

R Code 4.94 a: The first 100 values in the distribution of \(\mu\) at each weight value (Original)

Code
## R code 4.55a ##################
# use type="n" to hide raw data
base::plot(height ~ weight, d2_a, type = "n")

# loop over samples and plot each mu value
for (i in 1:100) {
  graphics::points(weight.seq, 
                   mu_46_m4.3a[i, ], 
                   pch = 16, 
                   col = rethinking::col.alpha(rethinking::rangi2, 0.1))
}

Graph 4.37: The first 100 values in the distribution of μ at each weight value. (Original)

At each weight value in weight.seq, a pile of computed \(\mu\) values are shown. Each of these piles is a Gaussian distribution, like that in tab “dens” of Example 4.23. You can see now that the amount of uncertainty in \(\mu\) depends upon the value of weight. And this is the same fact you saw in Example 4.22.

R Code 4.95 a: Summary of the distribution for each weight value (Original)

Code
## R code 4.56a #####################
# summarize the distribution of mu
mu_46.mean_m4.3a <- apply(mu_46_m4.3a, 2, mean)         # 1
mu_46.PI_m4.3a <- apply(mu_46_m4.3a, 2, 
                      rethinking::PI, prob = 0.89)      # 2
str(mu_46_m4.3a)                                        # 3
head(mu_46.mean_m4.3a)                                  # 4
head(mu_46.PI_m4.3a)[ , 1:6]                            # 5
#>  num [1:1000, 1:46] 136 136 137 136 136 ...
#> [1] 136.5276 137.4315 138.3354 139.2393 140.1433 141.0472
#>         [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
#> 5%  135.1206 136.0954 137.0654 138.0321 139.0101 139.9782
#> 94% 137.9998 138.8414 139.6767 140.5174 141.3442 142.1896
  1. Read apply(mu_46_m4.3a,2,mean) as “compute the mean of each column (dimension ‘2’) of the matrix mu46_m4.3a”. Now mu_46.mean_m4.3a contains the average \(\mu\) at each weight value.
  2. mu_46.PI_m4.3a contains 89% lower and upper bounds for each weight value.
  3. Displaying the structure shows that there are only 46 columns in mu_46_m4-3a, because we fed it 46 different values for weight.
  4. Display the means of first six columns of the matrix mu46_m4.3a (= mu_46.mean_m4.3a)
  5. Display the first six 89% PI lower and upper bounds for each weight value.

mu_46.mean_m4.3a and mu_46.PI_m4.3a are just different kinds of summaries of the distributions in mu_46_m4.3a, with each column being for a different weight value. These summaries are only summaries. The “estimate” is the entire distribution.

R Code 4.96 a: The !Kung height data with 89% compatibility interval of the mean indicated by the shaded region (Original)

Code
## R code 4.57a ###########################
# plot raw data
# fading out points to make line and interval more visible
plot(height ~ weight, data = d2_a, col = rethinking::col.alpha(rethinking::rangi2, 0.5))

# plot the MAP line, aka the mean mu for each weight
graphics::lines(weight.seq, mu_46.mean_m4.3a)

# plot a shaded region for 89% PI
rethinking::shade(mu_46.PI_m4.3a, weight.seq)

Graph 4.38: The !Kung height data with 89% compatibility interval of the mean indicated by the shaded region. Compare with tab ‘dist1’
Code
## R code 4.58a ################################
post_m4.3a <- rethinking::extract.samples(m4.3a)
mu.link_m4.3a <- function(weight) post_m4.3a$a + post_m4.3a$b * (weight - xbar)
weight.seq <- seq(from = 25, to = 70, by = 1)
mu3_m4.3a <- sapply(weight.seq, mu.link_m4.3a)
mu3.mean_m4.3a <- apply(mu3_m4.3a, 2, mean)
mu3.CI_m4.3a <- apply(mu3_m4.3a, 2, rethinking::PI, prob = 0.89)
head(mu3.mean_m4.3a)
head(mu3.CI_m4.3a)[ , 1:6]
#> [1] 136.5589 137.4615 138.3641 139.2666 140.1692 141.0718
#>         [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
#> 5%  135.1421 136.1088 137.0737 138.0387 139.0046 139.9712
#> 94% 137.9686 138.8058 139.6454 140.4852 141.3257 142.1673

And the values in mu3.mean_m4.3a and mu3.CI_m4.3a should be very similar (allowing for simulation variance) to what you got the automated way, using rethinking::link() in tab ‘sum’.

What follows now is the tidyverse procedure: Since we used weight_c to fit our model, we might first want to understand what exactly the mean value is for weight: mean(d2_b$weight) = 44.9904855. “Just a hair under 45. If we’re interested in \(\mu\) at weight = 50, that implies we’re also interested in \(\mu\) at weight_c + 5.01. Within the context of our model, we compute this with \(\alpha + \beta \cdot 5.01\).” (Kurz)

Example 4.24 b: Plotting regression intervals and contours (Tidyverse)

R Code 4.98 a: Calculate uncertainty around the average regression line at mean of 50kg (Tidyverse)

Code
mu_at_50_b <- 
  post_m4.3b |> 
  dplyr::mutate(mu_at_50_b = b_Intercept + b_weight_c * 5.01, .keep = "none")
 
head(mu_at_50_b)
#> # A tibble: 6 × 1
#>   mu_at_50_b
#>        <dbl>
#> 1       159.
#> 2       159.
#> 3       159.
#> 4       159.
#> 5       159.
#> 6       159.

R Code 4.99 a: The quadratic approximate posterior distribution of the mean height, \(\mu\), when weight is \(50\) kg. This distribution represents the relative plausibility of different values of the mean (Tidyverse)

Code
mu_at_50_b |>
  ggplot2::ggplot(ggplot2::aes(x = mu_at_50_b)) +
  tidybayes::stat_halfeye(point_interval = tidybayes::mode_hdi, .width = .95,
               fill = "deepskyblue") +
  ggplot2::scale_y_continuous(NULL, breaks = NULL) +
  ggplot2::xlab(expression(mu["height | weight = 50"])) +
  ggplot2::theme_bw()

Graph 4.39: The quadratic approximate posterior distribution of the mean height, μ, when weight is 50 kg. This distribution represents the relative plausibility of different values of the mean. Tidyverse version

Here we expressed the 95% HPDIs on the density plot with tidybayes::stat_halfeye(). Since tidybayes::stat_halfeye() also returns a point estimate, we’ll throw in the mode.

R Code 4.100 b: 89% and 95% Highest Priority Intensity Intervals (HPDIs) of \(\mu\) at 50 kg (Tidyverse)

Code
tidybayes::mean_hdi(mu_at_50_b[, 1], .width = c(.89, .95))
#> # A tibble: 2 × 6
#>   mu_at_50_b .lower .upper .width .point .interval
#>        <dbl>  <dbl>  <dbl>  <dbl> <chr>  <chr>    
#> 1       159.   159.   160.   0.89 mean   hdi      
#> 2       159.   158.   160.   0.95 mean   hdi

R Code 4.101 b: Calculate \(\mu\) for each case in the data and sample from the posterior distribution (Tidyverse)

Code
mu2_m4.3b <- brms:::fitted.brmsfit(m4.3b, summary = F)
str(mu2_m4.3b)
#>  num [1:4000, 1:352] 157 157 157 157 157 ...

With {brms} the equivalence for rethinking::link() (tab “link1” in Example 4.23) is the fitted() function.

With brms:::fitted.brmsfit(), it’s quite easy to plot a regression line and its intervals. Just omit the summary = T argument.

“When you specify summary = F, brms:::fitted.brmsfit() returns a matrix of values with as many rows as there were post-warmup draws across your Hamilton Monte Carlo (HMC) chains and as many columns as there were cases in your analysis. Because we had 4,000 post-warmup draws and \(n=352\), brms:::fitted.brmsfit() returned a matrix of 4,000 rows and 352 vectors. If you omitted the summary = F argument, the default is TRUE and brms:::fitted.brmsfit() will return summary information instead.” (Kurz)

WATCH OUT! How to apply fitted()?

Kurz applies the function fitted() in the code, but in the text he uses twice brms::fitted() which doesn’t exist. I used brms:::fitted.brmsfit().

But with stats::fitted() you will get the same result!

The object m4.3b is of class brmsfit but in the help file of stats::fitted() you can read: “fitted is a generic function which extracts fitted values from objects returned by modeling functions. All object classes which are returned by model fitting functions should provide a fitted method. (Extract Model Fitted Values, emphasis is mine)

My interpretation therefore is that stats::fitted() is using brms:::fitted.brmsfit(). That’s why the results are identical.

R Code 4.102 b: Calculate a distribution of \(\mu\) for each unique weight value on the horizontal axis (Tidyverse)

Code
weight_seq <- 
  tibble::tibble(weight = 25:70) |> 
  dplyr::mutate(weight_c = weight - base::mean(d2_b$weight))

mu3_m4.3b <-
  brms:::fitted.brmsfit(m4.3b,
         summary = F,
         newdata = weight_seq) |>
  tibble::as_tibble() |>
  # here we name the columns after the `weight` values from which they were computed
  rlang::set_names(25:70) |> 
  dplyr::mutate(iter = 1:dplyr::n())
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
#> `.name_repair` is omitted as of tibble 2.0.0.
#> ℹ Using compatibility `.name_repair`.
Code
mu3_m4.3b[1:6, 1:6]
#> # A tibble: 6 × 6
#>    `25`  `26`  `27`  `28`  `29`  `30`
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  137.  138.  139.  140.  141.  142.
#> 2  137.  138.  139.  140.  141.  142.
#> 3  136.  137.  138.  139.  140.  141.
#> 4  137.  138.  139.  139.  140.  141.
#> 5  137.  138.  139.  140.  141.  141.
#> 6  136.  137.  138.  139.  140.  141.

Much like rethinking::link(), brms:::fitted.brmsfit() can accommodate custom predictor values with its newdata argument.

To differentiate: The {rethinking} version used the variable weight.seq whereas here I am using weight_seq.

R Code 4.103 b: The first 100 values in the distribution of \(\mu\) at each weight value (Tidyverse)

Code
mu4_m4.3b <- 
  mu3_m4.3b |>
  tidyr::pivot_longer(-iter,
               names_to = "weight",
               values_to = "height") |> 
  # we might reformat `weight` to numerals
  dplyr::mutate(weight = base::as.numeric(weight))

d2_b |>
  ggplot2::ggplot(ggplot2::aes(x = weight, y = height)) +
  ggplot2::geom_point(data = mu4_m4.3b |> dplyr::filter(iter < 101), 
             color = "navyblue", alpha = .05) +
  ggplot2::coord_cartesian(xlim = c(30, 65)) +
  ggplot2::theme_bw()

Graph 4.40: The first 100 values in the distribution of μ at each weight value (Tidyverse)

But did a little more data processing with the aid of tidyr::pivot_longer(), which will convert the data from the wide format to the long format.

Wide and long data

If you are new to the distinction between wide and long data, you can learn more from the

R Code 4.104 b: Draws predicted mean response values from the posterior predictive distribution (Tidyverse)

Code
mu_summary_m4.3b <-
  brms:::fitted.brmsfit(m4.3b, 
         newdata = weight_seq,
         probs = c(0.055, 0.945)) |>
  tibble::as_tibble() |>
  dplyr::bind_cols(weight_seq)

set.seed(4)
mu_summary_m4.3b |> 
  dplyr::slice_sample(n = 6)
#> # A tibble: 6 × 6
#>   Estimate Est.Error  Q5.5 Q94.5 weight weight_c
#>      <dbl>     <dbl> <dbl> <dbl>  <int>    <dbl>
#> 1     146.     0.506  145.  146.     35    -9.99
#> 2     138.     0.813  137.  140.     27   -18.0 
#> 3     142.     0.655  141.  143.     31   -14.0 
#> 4     163.     0.467  162.  163.     54     9.01
#> 5     170.     0.769  169.  171.     62    17.0 
#> 6     137.     0.853  136.  139.     26   -19.0

How to change the {brms) default CI value form 95% to 89%?

In {brms} I had to change the probs argument to c(0.055, 0.945) and the resulting third and fourth vectors from the fitted() object to Q5.5 and Q94.5. This was necessary to get the same 89% interval as in the book version. The default probs value for brms:::fitted.brmsfit() would have been c(0.025, 0.975) resulting in quantiles of Q2.5 and Q97.5. The Q prefix stands for quantile. See Rename summary columns of predict() and related methods.

R Code 4.105 a: The !Kung height data with 89% compatibility interval of the mean indicated by the shaded region (Tidyverse)

Code
d2_b |>
  ggplot2::ggplot(ggplot2::aes(x = weight, y = height)) +
  ggplot2::geom_smooth(data = mu_summary_m4.3b,
              ggplot2::aes(y = Estimate, ymin = Q5.5, ymax = Q94.5),
              stat = "identity",
              fill = "grey70", color = "black", alpha = 1, linewidth = 1/2) +
  ggplot2::geom_point(color = "navyblue", shape = 1, size = 1.5, alpha = 2/3) +
  ggplot2::coord_cartesian(xlim = range(d2_b$weight)) +
  ggplot2::theme_bw()

Graph 4.41: Plot of the summaries on top of the !Kung height data again, now with 89% compatibility interval of the mean indicated by the shaded region. Compare this region to the distributions of blue points in tab ‘plot1’

If you wanted to use intervals other than the default 89% ones, you’d include the probs argument like this: brms:::fitted.brmsfit(b4.3, newdata = weight.seq, probs = c(.25, .75)). The resulting third and fourth vectors from the fitted() object would be named Q25 and Q75 instead of the default Q5.5 and Q94.5. The Q prefix stands for quantile.

Similar to rethinking::link(), brms:::fitted.brmsfit() uses the formula from your model to compute the model expectations for a given set of predictor values. I used it a lot in this project. If you follow along, you’ll get a good handle on it.

How to use brms:::fitted.brmsfit()

To dive deeper about the fitted() function, you can go for the documentation. Though Kurz won’t be using it in this project, {brms} he informs users that fitted() is also an alias for the brms::posterior_epred() function, about which you might learn more here. Users can always learn more about them and other functions in the {brms} reference manual.

4.4.3.5 Prediction intervals

“What you’ve done so far is just use samples from the posterior to visualize the uncertainty in \(μ_{i}\), the linear model of the mean.” (McElreath, 2020, p. 107) (pdf)

“Now let’s walk through generating an 89% prediction interval for actual heights, not just the average height, \(\mu\). This means we’ll incorporate the standard deviation \(\sigma\) and its uncertainty as well.” (McElreath, 2020, p. 107) (pdf)

Procedure 4.6 : Simulating heights

  1. For any unique weight value, you sample from a Gaussian distribution
    • with the correct mean \(\mu\) for that weight,
    • using the correct value of \(\sigma\) sampled from the posterior distribution.
  2. Do this
    • for every sample from the posterior,
    • for every weight value.
  3. You will end up with a collection of simulated heights
    • that embody the uncertainty in the posterior
    • as well as the uncertainty in the Gaussian distribution of heights.

Example 4.25 : Generate an 89% prediction interval for actual heights

R Code 4.106 a: Simulation of the posterior observations (Original)

Code
## R code 4.59a #######################
sim.height_m4.3a <- rethinking::sim(m4.3a, data = list(weight = weight.seq))
str(sim.height_m4.3a)
#>  num [1:1000, 1:46] 138 133 137 136 134 ...

This matrix is much like the earlier one in tab ‘sum’ of Example 4.23, but it contains simulated heights, not distributions of plausible average height, \(\mu\).

R Code 4.107 a: Summarize simulated heights of the 89% posterior prediction interval of observable heights (Original)

Code
## R code 4.60a ###################
height.PI_m4.3a <- apply(sim.height_m4.3a, 2, rethinking::PI, prob = 0.89)

height.PI_m4-3a contains the 89% posterior prediction interval of observable (according to the model) heights, across the values of weight in weight.seq.

R Code 4.108 a: 89% prediction interval for height, as a function of weight by sampling 1e3 times (Original)

Code
## R code 4.61a ##################
# plot raw data
plot(height ~ weight, d2_a, col = rethinking::col.alpha(rethinking::rangi2, 0.5))

# draw MAP line
graphics::lines(weight.seq, mu_46.mean_m4.3a)

# draw PI region for line
# rethinking::shade(mu_46.PI_m4.3a, weight.seq)

# draw HPDI region for line
mu_46.HPDI_m4.3a <- apply(mu_46_m4.3a, 2, 
                      rethinking::HPDI, prob = .89)
rethinking::shade(mu_46.HPDI_m4.3a, weight.seq)

# draw PI region for simulated heights
rethinking::shade(height.PI_m4.3a, weight.seq)

Graph 4.42: 89% prediction interval for height, as a function of weight. The solid line is the average line for the mean height at each weight. The two shaded regions show different 89% plausible regions. The narrow shaded interval around the line is the distribution of μ. The wider shaded region represents the region within which the model expects to find 89% of actual heights in the population, at each weight.

“Notice that the outline for the wide shaded interval is a little rough. This is the simulation variance in the tails of the sampled Gaussian values. If it really bothers you, increase the number of samples you take from the posterior distribution. The optional \(n\) parameter for sim.height_m4.3a controls how many samples are used.” (McElreath, 2020, p. 109) (pdf)

Try for example \(1e4\) samples.

“Run the plotting code again, and you’ll see the shaded boundary smooth out some.” (McElreath, 2020, p. 109) (pdf)

See the next tab ‘plot2’.

R Code 4.109 a: 89% prediction interval for height, as a function of weight by sampling 1e4 times (Original)

Code
## R code 4.59a adapted ################
sim2.height_m4.3a <- rethinking::sim(m4.3a, data = list(weight = weight.seq), n = 1e4)

## R code 4.60a adapted ################
height2.PI_m4.3a <- apply(sim2.height_m4.3a, 2, rethinking::PI, prob = 0.89)

## R code 4.61a ##################

# plot raw data
plot(height ~ weight, d2_a, col = rethinking::col.alpha(rethinking::rangi2, 0.5))

mu_46.HPDI_m4.3a <- apply(mu_46_m4.3a, 2,
                      rethinking::HPDI, prob = 0.89)

# draw MAP line
graphics::lines(weight.seq, mu_46.mean_m4.3a)

# draw HPDI region for line
rethinking::shade(mu_46.HPDI_m4.3a, weight.seq)

# draw PI region for simulated heights
rethinking::shade(height2.PI_m4.3a, weight.seq)

Graph 4.43: 89% prediction interval for height, as a function of weight. Shaded boundary smoothed out by sampling 1e4 times instead of the standard value of 1e3

“With extreme percentiles, it can be very hard to get out all of the roughness. Luckily, it hardly matters, except for aesthetics. Moreover, it serves to remind us that all statistical inference is approximate. The fact that we can compute an expected value to the 10th decimal place does not imply that our inferences are precise to the 10th decimal place” (McElreath, 2020, p. 109) (pdf)

R Code 4.110 a: Writing you own rethinking:sim() function

Code
## R code 4.63a ########################################

# post <- extract.samples(m4.3)
# weight.seq <- 25:70
post_m4.3a <- rethinking::extract.samples(m4.3a)
sim.height2_m4.3a <- sapply(weight.seq, function(weight) {
  rnorm(
    n = nrow(post_m4.3a),
    mean = post_m4.3a$a + post_m4.3a$b * (weight - xbar),
    sd = post_m4.3a$sigma
  )
})
height2.PI_m4.3a <- apply(sim.height2_m4.3a, 2, rethinking::PI, prob = 0.89)
head(height.PI_m4.3a)[ , 1:6]
head(height2.PI_m4.3a)[ , 1:6]
#>         [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
#> 5%  128.3733 129.3550 129.9023 131.2129 132.5578 132.3555
#> 94% 144.7671 145.5569 146.4105 147.6139 147.9741 148.8141
#>         [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
#> 5%  128.4286 129.1754 130.1301 131.2844 131.9124 132.9920
#> 94% 144.7992 145.5424 146.7206 147.4119 148.2944 149.4395

The small differences are the result of the randomized sampling process.

R Code 4.111 b: Predict the posterior observations (Tidyverse)

Code
pred_height_m4.3b <-
  brms:::predict.brmsfit(m4.3b,
          newdata = weight_seq,
          probs = c(0.055, 0.945)) |>
  tibble::as_tibble() |>
  dplyr::bind_cols(weight_seq)

set.seed(4)
pred_height_m4.3b |>
  dplyr::slice_sample(n = 6)
#> # A tibble: 6 × 6
#>   Estimate Est.Error  Q5.5 Q94.5 weight weight_c
#>      <dbl>     <dbl> <dbl> <dbl>  <int>    <dbl>
#> 1     145.      5.08  137.  153.     35    -9.99
#> 2     138.      5.27  130.  147.     27   -18.0 
#> 3     142.      5.16  134.  150.     31   -14.0 
#> 4     163.      4.98  155.  171.     54     9.01
#> 5     170.      5.04  162.  178.     62    17.0 
#> 6     137.      5.19  129.  146.     26   -19.0

The predict() code looks a lot like what we used for the tab in fitted3 of Example 4.24.

{brms} equivalence of rethinking::link() and rethinking::sim()

Much as brms:::fitted.brmsfit() was our analogue to rethinking::link(), brms:::predict.brmsfit() is our analogue to rethinking::sim(). (Kurz)


  • link: Computes the value of each linear model at each sample for each case in the data.
  • sim: Uses the model definition from a map or map2stan fit to simulate outcomes that average over the posterior distribution.
  • fitted.brmsfit: Compute posterior draws of the expected value of the posterior predictive distribution. Returns an array of predicted mean response values.
  • predict.brmsfit: Compute posterior draws of the posterior predictive distribution. Returns an array of predicted response values.

R Code 4.112 b: Plot 89% prediction interval for height, as a function of weight with 2e3 iterations (Tidyverse)

Code
d2_b |>
  ggplot2::ggplot(ggplot2::aes(x = weight)) +
  ggplot2::geom_ribbon(data = pred_height_m4.3b, 
              ggplot2::aes(ymin = Q5.5, ymax = Q94.5),
              fill = "grey83") +
  ggplot2::geom_smooth(data = mu_summary_m4.3b,
              ggplot2::aes(y = Estimate, ymin = Q5.5, ymax = Q94.5),
              stat = "identity",
              fill = "grey70", color = "black", alpha = 1, linewidth = 1/2) +
  ggplot2::geom_point(ggplot2::aes(y = height),
             color = "navyblue", shape = 1, size = 1.5, alpha = 2/3) +
  ggplot2::coord_cartesian(xlim = base::range(d2_b$weight),
                  ylim = base::range(d2_b$height)) +
  ggplot2::theme_bw()

Graph 4.44: 89% prediction interval for height, as a function of weight. The solid line is the average line for the mean height at each weight. The two shaded regions show different 89% plausible regions. The narrow shaded interval around the line is the distribution of μ. The wider shaded region represents the region within which the model expects to find 89% of actual heights in the population, at each weight. (Tidyverse)

To smooth out the rough shaded interval we would have in the {brms} model fitting approach to refit m4.3b into m4.3b_smooth after specifying a larger number of post-warmup iterations with alterations to the iter and warmup parameters.

R Code 4.113 b: Plot 89% prediction interval for height, as a function of weight with 2e4 iterations (Tidyverse)

Code
m4.3b_smooth <- 
  brms::brm(data = d2_b, 
      family = gaussian,
      height ~ 1 + weight_c,
      prior = c(brms::prior(normal(178, 20), class = Intercept),
                brms::prior(lognormal(0, 1), class = b, lb = 0),
                brms::prior(uniform(0, 50), class = sigma, ub = 50)),
      iter = 20000, warmup = 1000, chains = 4, cores = 4,
      seed = 4,
      file = "brm_fits/m04.03b_smooth")

mu_summary_m4.3b_smooth <-
  brms:::fitted.brmsfit(m4.3b_smooth, 
         newdata = weight_seq,
         probs = c(0.055, 0.945)) |>
  tibble::as_tibble() |>
  dplyr::bind_cols(weight_seq)

pred_height_m4.3b_smooth <-
  brms:::predict.brmsfit(m4.3b_smooth,
          newdata = weight_seq,
          probs = c(0.055, 0.945)) |>
  tibble::as_tibble() |>
  dplyr::bind_cols(weight_seq)


d2_b |>
  ggplot2::ggplot(ggplot2::aes(x = weight)) +
  ggplot2::geom_ribbon(data = pred_height_m4.3b_smooth, 
              ggplot2::aes(ymin = Q5.5, ymax = Q94.5),
              fill = "grey83") +
  ggplot2::geom_smooth(data = mu_summary_m4.3b_smooth,
              ggplot2::aes(y = Estimate, ymin = Q5.5, ymax = Q94.5),
              stat = "identity",
              fill = "grey70", color = "black", alpha = 1, linewidth = 1/2) +
  ggplot2::geom_point(ggplot2::aes(y = height),
             color = "navyblue", shape = 1, size = 1.5, alpha = 2/3) +
  ggplot2::coord_cartesian(xlim = base::range(d2_b$weight),
                  ylim = base::range(d2_b$height)) +
  ggplot2::theme_bw()

Graph 4.45: 89% prediction interval for height, as a function of weight. The solid line is the average line for the mean height at each weight. The two shaded regions show different 89% plausible regions. The narrow shaded interval around the line is the distribution of μ. The wider shaded region represents the region within which the model expects to find 89% of actual heights in the population, at each weight. (Tidyverse)

This is the same graphic as in tab ‘plot1’ but with a factor 10 more iterations to smooth out the rough shaded interval.

R Code 4.114 b: Write you own predict for the posterior observations: Model-based predictions without {brms} and predict(): mean with quantiles of 0.055 and .945 (Tidyverse)

Code
set.seed(4)

post_m4.3b |> 
  tidyr::expand_grid(weight = 25:70) |> 
  dplyr::mutate(weight_c = weight - base::mean(d2_b$weight)) |> 
  dplyr::mutate(sim_height = stats::rnorm(dplyr::n(),
                            mean = b_Intercept + b_weight_c * weight_c,
                            sd   = sigma)) |> 
  dplyr::group_by(weight) |> 
  dplyr::summarise(mean = base::mean(sim_height),
            ll   = stats::quantile(sim_height, prob = .055),
            ul   = stats::quantile(sim_height, prob = .945)) |> 
  
  # plot
  ggplot2::ggplot(ggplot2::aes(x = weight)) +
  ggplot2::geom_smooth(ggplot2::aes(y = mean, ymin = ll, ymax = ul),
              stat = "identity",
              fill = "grey83", color = "black", alpha = 1, linewidth = 1/2) +
  ggplot2::geom_point(data = d2_b,
             ggplot2::aes(y = height),
             color = "navyblue", shape = 1, size = 1.5, alpha = 2/3) +
  ggplot2::coord_cartesian(xlim = base::range(d2_b$weight),
                           ylim = base::range(d2_b$height)) +
  ggplot2::theme_bw()

Graph 4.46: Model-based predictions without {brms} and pedict(): mean with quantiles of 0.055 and 0.945

Here we followed McElreath’s example and calculated our model-based predictions “by hand”. Instead of relying on base R apply() and sapply(), the main action in the {tidyverse} approach is in tidyr::expand_grid(), the second dplyr::mutate() line with stats:rnomr() and the dplyr::group_by() + dplyr::summarise() combination.

We specifically left out the fitted() intervals to make it more apparent what we were simulating.

R Code 4.115 b: Write you own predict for the posterior observations, now using {tidybayes} HDI of 89%: mode with quantiles of 0.055 and 0.945

Code
set.seed(4)

post_m4.3b |> 
  tidyr::expand_grid(weight = 25:70) |> 
  dplyr::mutate(weight_c = weight - base::mean(d2_b$weight)) |> 
  dplyr::mutate(sim_height = stats::rnorm(dplyr::n(),
                            mean = b_Intercept + b_weight_c * weight_c,
                            sd   = sigma)) |> 
  # dplyr::group_by(weight) |> 
  # dplyr::summarise(mean = base::mean(sim_height),
  #           ll   = stats::quantile(sim_height, prob = .055),
  #           ul   = stats::quantile(sim_height, prob = .945)) |> 
  
  dplyr::group_by(weight) |> 
  tidybayes::mode_hdi(sim_height, .width = .89) |> 
  
  # plot
  ggplot2::ggplot(ggplot2::aes(x = weight)) +
  ## instead of "aes(y = mean, ymin = ll, ymax = ul)"
  ggplot2::geom_smooth(ggplot2::aes(y = .point, ymin = .lower, ymax = .upper),
              stat = "identity",
              fill = "grey83", color = "black", alpha = 1, linewidth = 1/2) +
  ggplot2::geom_point(data = d2_b,
             ggplot2::aes(y = height),
             color = "navyblue", shape = 1, size = 1.5, alpha = 2/3) +
  ggplot2::coord_cartesian(xlim = base::range(d2_b$weight),
                           ylim = base::range(d2_b$height)) +
  ggplot2::labs(y = "mode") +
  ggplot2::theme_bw()

Graph 4.47: Model-based predictions without {brms} and predict(): now using mode of HDI of 89% from {tidybayes}: mean with quantiles of 0.055 and .945

Here we replaced that three-line summarize() code with a single line of tidybayes::mean_qi(sim_height), or whatever combination of central tendency and interval type you wanted (here we used: tidybayes::mode_hdi(sim_height, .width = .89)).

4.5 Curves from lines

“We’ll consider two commonplace methods that use linear regression to build curves. The first is polynomial regression. The second is b-splines.” (McElreath, 2020, p. 110) (pdf)

4.5.1 Polynomial regression

“Polynomial regression uses powers of a variable—squares and cubes—as extra predictors.” (McElreath, 2020, p. 110) (pdf)

4.5.1.1 Looking at the full !Kung data (Scatterplot)

Example 4.26 : Scatterplot of height against weight

R Code 4.116 a: Height in centimeters (vertical) plotted against weight in kilograms (horizontal) (Original)

Code
plot(height ~ weight, data = d_a, col = rethinking::rangi2)

Graph 4.48: Height in centimeters (vertical) plotted against weight in kilograms (horizontal). This time with the full data, e.g., the non-adults data.

R Code 4.117 b: Height in centimeters (vertical) plotted against weight in kilograms (horizontal) (Tidyverse)

Code
d_b |> 
  ggplot2::ggplot(ggplot2::aes(x = weight, y = height)) +
  ggplot2::geom_point(color = "navyblue", shape = 1, size = 1.5, alpha = 2/3) +
  ggplot2::annotate(geom = "text",
           x = 42, y = 115,
           label = "This relation is\nvisibly curved.",
           family = "Times") +
  ggplot2::theme_bw()

Graph 4.49: Height in centimeters (vertical) plotted against weight in kilograms (horizontal). This time with the full data, e.g., the non-adults data. (Tidyverse)

The relationship is visibly curved, now that we’ve included the non-adult individuals. (Compare with adult data in Graph 4.30).

What is a quadratic polynomial?

A quadratic function (also called a quadratic, a quadratic polynomial, or a polynomial of degree 2) is special type of polynomial function where the highest-degree term is second degree. … The graph of a quadratic function is a parabola, a 2-dimensional curve that looks like either a cup(∪) or a cap(∩). (Statistis How To)

Example 4.27 : Parabolic and cubic equation for the mean height

Theorem 4.5 : Parabolic equation for the mean height

“The most common polynomial regression is a parabolic model of the mean. Let x be standardized body weight.” (McElreath, 2020, p. 110) (pdf)

\[ \mu_{i} = \alpha + \beta_{1}x_{i} + \beta_{2}x_{i}^2 \tag{4.8}\]

“The above is a parabolic (second order) polynomial. The \(\alpha + \beta_{1}x_{i}\) part is the same linear function of \(x\) in a linear regression, just with a little ‘1’ subscript added to the parameter name, so we can tell it apart from the new parameter. The additional term uses the square of \(x_{i}\) to construct a parabola, rather than a perfectly straight line. The new parameter \(\beta_{2}\) measures the curvature of the relationship.” (McElreath, 2020, p. 110) (pdf)

\[ \begin{align*} h_{i} \sim \operatorname{Normal}(μ_{i}, σ) \space \space (1) \\ μ_{i} = \alpha + \beta_{1}x_{i} + \beta_{2}x_{i}^2 \space \space (2) \\ \alpha \sim \operatorname{Normal}(178, 20) \space \space (3) \\ \beta_{1} \sim \operatorname{Log-Normal}(0,10) \space \space (4) \\ \beta_{2} \sim \operatorname{Normal}(0,10) \space \space (5) \\ \sigma \sim \operatorname{Uniform}(0, 50) \space \space (6) \end{align*} \tag{4.9}\]

height ~ dnorm(mu, sigma)                  # (1)
mu <- a + b1 * weight.s + b2 * weight.s2^2 # (2)
a ~ dnorm(178, 20)                         # (3)
b1 ~ dlnorm(0, 10)                         # (4)
b2 ~ dnorm(0, 10)                          # (5)
sigma ~ dunif(0, 50)                       # (6)

“The confusing issue here is assigning a prior for \(\beta_{2}\), the parameter on the squared value of \(x\). Unlike \(\beta_{1}\), we don’t want a positive constraint.” (McElreath, 2020, p. 111) (pdf)

Theorem 4.6 : Cubic equation for the mean height

\[ \begin{align*} h_{i} \sim \operatorname{Normal}(μ_{i}, σ) \space \space (1) \\ μ_{i} = \alpha + \beta_{1}x_{i} + \beta_{2}x_{i}^2 + \beta_{3}x_{i}^3 \space \space (2) \\ \alpha \sim \operatorname{Normal}(178, 20) \space \space (3) \\ \beta_{1} \sim \operatorname{Log-Normal}(0,10) \space \space (4) \\ \beta_{2} \sim \operatorname{Normal}(0,10) \space \space (5) \\ \beta_{3} \sim \operatorname{Normal}(0,10) \space \space (6) \\ \sigma \sim \operatorname{Uniform}(0, 50) \space \space (7) \end{align*} \tag{4.10}\]

height ~ dnorm(mu, sigma)                                      # (1)
mu <- a + b1 * weight.s +  b2 * weight.s2^2 + b3 * weight.s3^3 # (2)
a ~ dnorm(178, 20)                                             # (3)
b1 ~ dlnorm(0, 10)                                             # (4)
b2 ~ dnorm(0, 10)                                              # (5)
b3 ~ dnorm(0, 10)                                              # (6)
sigma ~ dunif(0, 50)                                           # (7)

4.5.1.2 Standardizing the predictor variable

“The first thing to do is to standardize the predictor variable. We’ve done this in previous examples. But this is especially helpful for working with polynomial models. When predictor variables have very large values in them, there are sometimes numerical glitches. Even well-known statistical software can suffer from these glitches, leading to mistaken estimates. These problems are very common for polynomial regression, because the square or cube of a large number can be truly massive. Standardizing largely resolves this issue. It should be your default behavior.” (McElreath, 2020, p. 111) (pdf)

Example 4.28 : Posterior distribution of a parabolic model of height on weight

R Code 4.118 a: Finding the posterior distribution of a parabolic model of height on weight (Original)

Code
## R code 4.65a #######################

## standardization ########
d_a$weight_s <- (d_a$weight - mean(d_a$weight)) / sd(d_a$weight)
d_a$weight_s2 <- d_a$weight_s^2

## find posterior distribution ########
m4.5a <- rethinking::quap(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b1 * weight_s + b2 * weight_s2,
    a ~ dnorm(178, 20),
    b1 ~ dlnorm(0, 1),
    b2 ~ dnorm(0, 1),
    sigma ~ dunif(0, 50)
  ),
  data = d_a
)

## R code 4.66a ################
## precis results ###########
rethinking::precis(m4.5a)
#>             mean        sd       5.5%      94.5%
#> a     146.057415 0.3689756 145.467720 146.647109
#> b1     21.733063 0.2888890  21.271363  22.194764
#> b2     -7.803268 0.2741839  -8.241467  -7.365069
#> sigma   5.774474 0.1764651   5.492449   6.056500

\(\alpha\) (a): Intercept. Expected value of height when weight is at its mean value.

“But it is no longer equal to the mean height in the sample, since there is no guarantee it should in a polynomial regression.” (McElreath, 2020, p. 112) (pdf)

“The implied definition of α in a parabolic model is \(\alpha = Ey_{i} − \beta_{1}Ex_{i} − \beta_{2} Ex_{i}^2\). Now even when the average \(x_{i}\) is zero, \(Ex_{i} = 0\), the average square will likely not be zero. So \(\alpha\) becomes hard to directly interpret again.” (McElreath, 2020, p. 562) (pdf)

\(\beta_{1}\) and \(\beta_{2}\) (b1 and b2): Linear and square components of the curve. But that doesn’t make them transparent.

“Now, since the relationship between the outcome height and the predictor weight depends upon two slopes, \(\beta_{1}\) and \(\beta_{2}\), it isn’t so easy to read the relationship off a table of coefficients.” (McElreath, 2020, p. 111) (pdf)

“You have to plot these model fits to understand what they are saying.” (McElreath, 2020, p. 112) (pdf)

R Code 4.119 b: Finding the posterior distribution of a parabolic model of height on weight (Tidyverse)

Code
## R code 4.65b #######################
## standardization ###########
d3_m4.5b <-
  d_b |>
  dplyr::mutate(weight_s = (weight - base::mean(weight)) / stats::sd(weight)) |> 
  dplyr::mutate(weight_s2 = weight_s^2)

## fitting model ############
m4.5b <- 
  brms::brm(data = d3_m4.5b, 
      family = gaussian,
      height ~ 1 + weight_s + weight_s2,
      prior = c(brms::prior(normal(178, 20), class = Intercept),
                brms::prior(lognormal(0, 1), class = b, coef = "weight_s"),
                brms::prior(normal(0, 1), class = b, coef = "weight_s2"),
                brms::prior(uniform(0, 50), class = sigma, ub = 50)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 4,
      file = "brm_fits/m04.05b")

## R code 4.66a ################
## print.brmsfit results ###############
brms:::print.brmsfit(m4.5b)
#>  Family: gaussian 
#>   Links: mu = identity; sigma = identity 
#> Formula: height ~ 1 + weight_s + weight_s2 
#>    Data: d3_m4.5b (Number of observations: 544) 
#>   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#>          total post-warmup draws = 4000
#> 
#> Population-Level Effects: 
#>           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept   146.05      0.38   145.30   146.77 1.00     3544     2516
#> weight_s     21.74      0.29    21.15    22.32 1.00     3106     2604
#> weight_s2    -7.79      0.28    -8.33    -7.25 1.00     3298     2631
#> 
#> Family Specific Parameters: 
#>       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sigma     5.81      0.18     5.47     6.18 1.00     3788     2881
#> 
#> Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).

Note our use of the coef argument within our prior statements. Since \(\beta_{1}\) and \(\beta_{2}\) are both parameters of class = b within the {brms} set-up, we need to use the coef argument when we want their priors to differ.

R Code 4.120 : Display trace plot (Tidyverse)

Code
brms:::plot.brmsfit(m4.5b, widths = c(1, 2))

4.5.1.3 Fit linear, parabolic and cubic model

Example 4.29 : Different regressions (linear, parabolic and cubic) of height on weight (standardized), for the full !Kung data

R Code 4.121 a: Posterior distribution of the linear height-weight model with the full !Kung data set (Original)

Code
## R code 4.42a #############################
# define the average weight, x-bar
xbar_m4.4a <- mean(d_a$weight_s)

# fit model
m4.4a <- rethinking::quap(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * (weight_s - xbar_m4.4a),
    a ~ dnorm(178, 20),
    b ~ dlnorm(0, 1),
    sigma ~ dunif(0, 50)
  ),
  data = d_a
)


## R code 4.46a ############################################
plot(height ~ weight_s, data = d_a, col = rethinking::rangi2)
post_m4.4a <- rethinking::extract.samples(m4.4a)
a_map_m4.4a <- mean(post_m4.4a$a)
b_map_m4.4a <- mean(post_m4.4a$b)
curve(a_map_m4.4a + b_map_m4.4a * (x - xbar_m4.4a), add = TRUE)

## R code 4.58a ################################
mu.link_m4.4a <- function(weight) post_m4.4a$a + post_m4.4a$b * (weight - xbar_m4.4a)
weight.seq_m4.4a <- seq(from = -3, to = 3, by = 1)
mu_m4.4a <- sapply(weight.seq_m4.4a, mu.link_m4.4a)
mu.mean_m4.4a <- apply(mu_m4.4a, 2, mean)
mu.CI_m4.4a <- apply(mu_m4.4a, 2, rethinking::PI, prob = 0.89)


## R code 4.59a #######################
sim.height_m4.4a <- rethinking::sim(m4.4a, data = list(weight_s = weight.seq_m4.4a))

## R code 4.60a ###################
height.PI_m4.4a <- apply(sim.height_m4.4a, 2, rethinking::PI, prob = 0.89)

## R code 4.61a ##################
# plot raw data
# plot(height ~ weight, d_a, col = rethinking::col.alpha(rethinking::rangi2, 0.5))

# draw MAP line
graphics::lines(weight.seq_m4.4a, mu.mean_m4.4a)

# draw HPDI region for line
rethinking::shade(mu.CI_m4.4a, weight.seq_m4.4a)

# draw PI region for simulated heights
rethinking::shade(height.PI_m4.4a, weight.seq_m4.4a)

Graph 4.50: Posterior distribution of the linear height-weight model with the full !Kung data set (Original)

“[The graphic] shows the familiar linear regression from earlier in the chapter, but now with the standardized predictor and full data with both adults and non-adults. The linear model makes some spectacularly poor predictions, at both very low and middle weights.” (McElreath, 2020, p. 113) (pdf)

R Code 4.122 a: Parabolic regression of height on weight (standardized), for the full !Kung data (Original)

Code
## R code 4.67a ################
weight.seq_m4.5a <- seq(from = -2.2, to = 2, length.out = 30)
pred_dat_m4.5a <- list(weight_s = weight.seq_m4.5a, weight_s2 = weight.seq_m4.5a^2)
mu_m4.5a <- rethinking::link(m4.5a, data = pred_dat_m4.5a)
mu.mean_m4.5a <- apply(mu_m4.5a, 2, mean)
mu.PI_m4.5a <- apply(mu_m4.5a, 2, rethinking::PI, prob = 0.89)
sim.height_m4.5a <- rethinking::sim(m4.5a, data = pred_dat_m4.5a)
height.PI_m4.5a <- apply(sim.height_m4.5a, 2, rethinking::PI, prob = 0.89)


## R code 4.68a #################
plot(height ~ weight_s, d_a, col = rethinking::col.alpha(rethinking::rangi2, 0.5))
graphics::lines(weight.seq_m4.5a, mu.mean_m4.5a)
rethinking::shade(mu.PI_m4.5a, weight.seq_m4.5a)
rethinking::shade(height.PI_m4.5a, weight.seq_m4.5a)

Graph 4.51: A second order polynomial, a parabolic or quadratic regression of height on weight (standardized), for the full !Kung data (Original)

The quadratic regression does a pretty good job. It is much better than a linear regression for the full Howell1 data set.

R Code 4.123 a: Cubic regressions of height on weight (standardized), for the full !Kung data (Original)

Code
## R code 4.69a ####################
d_a$weight_s3 <- d_a$weight_s^3
m4.6a <- rethinking::quap(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b1 * weight_s + b2 * weight_s2 + b3 * weight_s3,
    a ~ dnorm(178, 20),
    b1 ~ dlnorm(0, 1),
    b2 ~ dnorm(0, 10),
    b3 ~ dnorm(0, 10),
    sigma ~ dunif(0, 50)
  ),
  data = d_a
)

## R code 4.67a ################
weight.seq_m4.6a <- seq(from = -2.2, to = 2, length.out = 30)
pred_dat_m4.6a <- list(weight_s = weight.seq_m4.6a, weight_s2 = weight.seq_m4.6a^2,
                    weight_s3 = weight.seq_m4.6a^3)
mu_m4.6a <- rethinking::link(m4.6a, data = pred_dat_m4.6a)
mu.mean_m4.6a <- apply(mu_m4.6a, 2, mean)
mu.PI_m4.6a <- apply(mu_m4.6a, 2, rethinking::PI, prob = 0.89)
sim.height_m4.6a <- rethinking::sim(m4.6a, data = pred_dat_m4.6a)
height.PI_m4.6a <- apply(sim.height_m4.6a, 2, rethinking::PI, prob = 0.89)


## R code 4.68a #################
plot(height ~ weight_s, d_a, col = rethinking::col.alpha(rethinking::rangi2, 0.5))
graphics::lines(weight.seq_m4.6a, mu.mean_m4.6a)
rethinking::shade(mu.PI_m4.6a, weight.seq_m4.6a)
rethinking::shade(height.PI_m4.6a, weight.seq_m4.6a)

Graph 4.52: A third order polynomial, a cubic regression of height on weight (standardized), for the full !Kung data (Original)

“This cubic curve is even more flexible than the parabola, so it fits the data even better.” (McElreath, 2020, p. 113) (pdf)

R Code 4.124 b: Fit a linear regression model of height on weight (standardized), for the full !Kung data (Tidyverse)

Code
d_m4.4b <-
  d_b |>
  dplyr::mutate(weight_s = (weight - base::mean(weight)) / stats::sd(weight))


## fit model m4.4b #############
m4.4b <- 
  brms::brm(data = d_m4.4b, 
      family = gaussian,
      height ~ 1 + weight_s,
      prior = c(brms::prior(normal(178, 20), class = Intercept),
                brms::prior(lognormal(0, 1), class = b, coef = "weight_s"),
                brms::prior(uniform(0, 50), class = sigma, ub = 50)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 4,
      file = "brm_fits/m04.04b")

weight_seq_m4.4b <- 
  tibble::tibble(weight_s = base::seq(from = -2.5, to = 2.5, length.out = 30))


## data wrangling: fitted, predict #########
fitd_m4.4b <-
  brms:::fitted.brmsfit(m4.4b, 
         newdata = weight_seq_m4.4b,
         probs = c(0.055, 0.945)) |>
  tibble::as_tibble() |>
  dplyr::bind_cols(weight_seq_m4.4b)

pred_m4.4b <-
  brms:::predict.brmsfit(m4.4b, 
          newdata = weight_seq_m4.4b,
         probs = c(0.055, 0.945)) |>
  tibble::as_tibble() |>
  dplyr::bind_cols(weight_seq_m4.4b) 


## plot linear model ############
ggplot2::ggplot(data = d_m4.4b, 
     ggplot2::aes(x = weight_s)) +
ggplot2::geom_ribbon(data = pred_m4.4b, 
            ggplot2::aes(ymin = Q5.5, ymax = Q94.5),
            fill = "grey83") +
ggplot2::geom_smooth(data = fitd_m4.4b,
            ggplot2::aes(y = Estimate, ymin = Q5.5, ymax = Q94.5),
            stat = "identity",
            fill = "grey70", color = "black", alpha = 1, linewidth = 1/4) +
ggplot2::geom_point(ggplot2::aes(y = height),
           color = "navyblue", shape = 1, size = 1.5, alpha = 1/3) +
ggplot2::labs(subtitle = "linear",
     y = "height") +
ggplot2::coord_cartesian(xlim = base::range(d_m4.4b$weight_s),
                         ylim = base::range(d_m4.4b$height)) +
ggplot2::theme_bw()

Graph 4.53: Fit a linear regression model of height on weight (standardized), for the full !Kung data. The raw data are shown by the circles. The solid curves show the path of μ in each model, and the shaded regions show the 95% interval of the mean (close to the solid curve) and the 95% interval of predictions (wider) (Tidyverse)

“[The graphic] shows the familiar linear regression from earlier in the chapter, but now with the standardized predictor and full data with both adults and non-adults. The linear model makes some spectacularly poor predictions, at both very low and middle weights.” (McElreath, 2020, p. 113) (pdf)

R Code 4.125 b: Parabolic regression of height on weight (standardized), for the full !Kung data (Tidyverse)

Code
## data wrangling: fitted, predict ##############
weight_seq_m4.5b <- 
  tibble::tibble(weight_s = base::seq(from = -2.5, to = 2.5, length.out = 30)) |> 
  dplyr::mutate(weight_s2 = weight_s^2)

fitd_m4.5b <-
  brms:::fitted.brmsfit(m4.5b, 
         newdata = weight_seq_m4.5b,
         probs = c(0.055, 0.945)) |>
  tibble::as_tibble() |>
  dplyr::bind_cols(weight_seq_m4.5b)

pred_m4.5b <-
  brms:::predict.brmsfit(m4.5b, 
          newdata = weight_seq_m4.5b,
          probs = c(0.055, 0.945)) |>
  tibble::as_tibble() |>
  dplyr::bind_cols(weight_seq_m4.5b)  

## plot quadratic model #########
ggplot2::ggplot(data = d3_m4.5b, 
       ggplot2::aes(x = weight_s)) +
ggplot2::geom_ribbon(data = pred_m4.5b, 
            ggplot2::aes(ymin = Q5.5, ymax = Q94.5),
            fill = "grey83") +
ggplot2::geom_smooth(data = fitd_m4.5b,
            ggplot2::aes(y = Estimate, ymin = Q5.5, ymax = Q94.5),
            stat = "identity",
            fill = "grey70", color = "black", alpha = 1, linewidth = 1/2) +
ggplot2::geom_point(ggplot2::aes(y = height),
           color = "navyblue", shape = 1, size = 1.5, alpha = 1/3) +
ggplot2::labs(subtitle = "quadratic",
     y = "height") +
ggplot2::coord_cartesian(xlim = base::range(d3_m4.5b$weight_s),
                         ylim = base::range(d3_m4.5b$height)) +
ggplot2::theme_bw()

Graph 4.54: Polynomial regressions of height on weight (standardized), for the full !Kung data. The raw data are shown by the circles. The solid curves show the path of μ in each model, and the shaded regions show the 95% interval of the mean (close to the solid curve) and the 95% interval of predictions (wider) (Tidyverse)

The quadratic regression does a pretty good job. It is much better than a linear regression for the full Howell1 data set.

R Code 4.126 b: Cubic regressions of height on weight (standardized), for the full !Kung data (Tidyverse)

Code
## data wrangling: cubic ##############
d3_m4.6b <-
  d3_m4.5b |> 
  dplyr::mutate(weight_s3 = weight_s^3)

m4.6b <- 
  brms::brm(data = d3_m4.6b, 
      family = gaussian,
      height ~ 1 + weight_s + weight_s2 + weight_s3,
      prior = c(brms::prior(normal(178, 20), class = Intercept),
                brms::prior(lognormal(0, 1), class = b, coef = "weight_s"),
                brms::prior(normal(0, 1), class = b, coef = "weight_s2"),
                brms::prior(normal(0, 1), class = b, coef = "weight_s3"),
                brms::prior(uniform(0, 50), class = sigma, ub = 50)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 4,
      file = "brm_fits/m04.06b")

## data wrangling: fitted, predict ##############
weight_seq_m4.6b <- 
  weight_seq_m4.5b |> 
  dplyr::mutate(weight_s3 = weight_s^3)

fitd_m4.6b <-
  brms:::fitted.brmsfit(m4.6b, 
         newdata = weight_seq_m4.6b,
          probs = c(0.055, 0.945)) |>
  tibble::as_tibble() |>
  dplyr::bind_cols(weight_seq_m4.6b)

pred_m4.6b <-
  brms:::predict.brmsfit(m4.6b, 
          newdata = weight_seq_m4.6b,
          probs = c(0.055, 0.945)) |>
  tibble::as_tibble() |>
  dplyr::bind_cols(weight_seq_m4.6b) 

## plot quadratic model #########
ggplot2::ggplot(data = d3_m4.6b, 
     ggplot2::aes(x = weight_s)) +
ggplot2::geom_ribbon(data = pred_m4.6b, 
            ggplot2::aes(ymin = Q5.5, ymax = Q94.5),
            fill = "grey83") +
ggplot2::geom_smooth(data = fitd_m4.6b,
            ggplot2::aes(y = Estimate, ymin = Q5.5, ymax = Q94.5),
            stat = "identity",
            fill = "grey70", color = "black", alpha = 1, linewidth = 1/4) +
ggplot2::geom_point(aes(y = height),
           color = "navyblue", shape = 1, size = 1.5, alpha = 1/3) +
ggplot2::labs(subtitle = "cubic",
     y = "height") +
ggplot2::coord_cartesian(xlim = base::range(d3_m4.6b$weight_s),
                         ylim = base::range(d3_m4.6b$height)) +
ggplot2::theme_bw()

Graph 4.55: Fit a cubic regression model of height on weight (standardized), for the full !Kung data. The raw data are shown by the circles. The solid curves show the path of μ in each model, and the shaded regions show the 95% interval of the mean (close to the solid curve) and the 95% interval of predictions (wider) (Tidyverse)

“This cubic curve is even more flexible than the parabola, so it fits the data even better.” (McElreath, 2020, p. 113) (pdf)

WATCH OUT! Models are just geocentric descriptions, not more!

“… it’s not clear that any of these models make a lot of sense. They are good geocentric descriptions of the sample, yes. But there are two problems. First, a better fit to the sample might not actually be a better model. That’s the subject of ?sec-chap07. Second, the model contains no biological information. We aren’t learning any causal relationship between height and weight. We’ll deal with this second problem much later, in ?sec-chap16.” (McElreath, 2020, p. 113) (pdf)

4.5.1.4 Converting back to natural scale

From Z-scores to natural scale

“The plots in Example 4.29 have standard units on the horizontal axis. These units are sometimes called z-scores. But suppose you fit the model using standardized variables, but want to plot the estimates on the original scale.” (McElreath, 2020, p. 114) (pdf)


  1. Turn off the horizontal axis when plotting the raw data.
  2. Define the location of the labels, in standardized units.
  3. Takes standardized units and converts them back to the natural scale.
  4. Explicitly construct and then draw the axis with values from the natural scale.

Example 4.30 : From Z-scores back to natural scale

R Code 4.127 a: From Z-scores back to natural scale (Original)

Code
## R code 4.71a #############
plot(height ~ weight_s, d_a, col = rethinking::col.alpha(rethinking::rangi2, 0.5), xaxt = "n", xlab = "Weight in kg", ylab = "Height in cm")

at_a <- c(-2, -1, 0, 1, 2)                           # 1
labels <- at_a * sd(d_a$weight) + mean(d_a$weight)   # 2
axis(side = 1, at = at_a, labels = round(labels, 1)) # 3

Graph 4.56: Height against weight: Standarzided but with natural scale (Original)
  1. Defines the location of the labels, in standardized units.
  2. Takes standardized units and converts them back to the original scale.
  3. Draws the axis.

Take a look at the help ?axis for more details.

R Code 4.128 b: From Z-scores back to natural scale (Tidyverse)

Code
at_b <- c(-2, -1, 0, 1, 2)

## plot linear model ############
ggplot2::ggplot(data = d_m4.4b, 
     ggplot2::aes(x = weight_s)) +
ggplot2::geom_ribbon(data = pred_m4.4b, 
            ggplot2::aes(ymin = Q5.5, ymax = Q94.5),
            fill = "grey83") +
ggplot2::geom_smooth(data = fitd_m4.4b,
            ggplot2::aes(y = Estimate, ymin = Q5.5, ymax = Q94.5),
            stat = "identity",
            fill = "grey70", color = "black", alpha = 1, linewidth = 1/4) +
ggplot2::geom_point(ggplot2::aes(y = height),
           color = "navyblue", shape = 1, size = 1.5, alpha = 1/3) +
ggplot2::labs(subtitle = "linear", 
              y = "height in cm", x = "weight in kg") +
ggplot2::coord_cartesian(xlim = base::range(d_m4.4b$weight_s),
                         ylim = base::range(d_m4.4b$height)) +
ggplot2::theme_bw() +
  
# here it is!
ggplot2::scale_x_continuous("standardized weight converted back: Weight in kg",
                   breaks = at_a,
                   labels = base::round(at_b*sd(d_m4.4b$weight) + 
                            base::mean(d_m4.4b$weight), 1))

Graph 4.57: ?(caption)

4.5.2 Splines

4.5.2.1 How do splines work?

“The second way to introduce a curve is to construct something known as a spline. The word spline originally referred to a long, thin piece of wood or metal that could be anchored in a few places in order to aid drafters or designers in drawing curves. In statistics, a spline is a smooth function built out of smaller, component functions. There are actually many types of splines. The b-spline we’ll look at here is commonplace. The “B” stands for “basis,” which here just means “component.” B-splines build up wiggly functions from simpler less-wiggly components. Those components are called basis functions.” (McElreath, 2020, p. 114) (pdf)

“The short explanation of B-splines is that they divide the full range of some predictor variable, like year, into parts. Then they assign a parameter to each part. These parameters are gradually turned on and off in a way that makes their sum into a fancy, wiggly curve.” (McElreath, 2020, p. 115) (pdf)

“With B-splines, just like with polynomial regression, we do this by generating new predictor variables and using those in the linear model, \(\mu_{i}\). Unlike polynomial regression, B-splines do not directly transform the predictor by squaring or cubing it. Instead they invent a series of entirely new, synthetic predictor variables. Each of these synthetic variables exists only to gradually turn a specific parameter on and off within a specific range of the real predictor variable. Each of the synthetic variables is called a basis function.” (McElreath, 2020, p. 115) (pdf)

Theorem 4.7 : Linear model with B-splines

\[ \mu_{i} = \alpha + w_{1}B_{i,1} + w_{2}B_{i,2} + w_{3}B_{i,3} + ... \] > “… \(B_{i,n}\) is the \(n\)-th basis function’s value on row \(i\), and the \(w\) parameters are corresponding weights for each. The parameters act like slopes, adjusting the influence of each basis function on the mean \(\mu_{i}\). So really this is just another linear regression, but with some fancy, synthetic predictor variables.” (McElreath, 2020, p. 115) (pdf)

Procedure 4.7 : How to generate B-splines?

  1. Choose the number and distribution of knots
  2. Choose the polynomial degree
  3. Get the parameter weights for each basis function (define the model and make it run)

4.5.2.2 Original fit

Example 4.31 : Fit Cherry Blossoms data (Original)

R Code 4.129 a: Load Cherry Blossoms data, filter by complete cases and display summary precis

Code
## R code 4.72 modified ######################
data(package = "rethinking", list = "cherry_blossoms")
d_m4.7a <- cherry_blossoms
d2_m4.7a <- d_m4.7a[ complete.cases(d_m4.7a$doy) , ] # complete cases on doy
rethinking::precis(d2_m4.7a)
#>                   mean          sd      5.5%     94.5%       histogram
#> year       1548.841596 304.1497740 1001.7200 1969.5700   ▁▂▂▃▅▃▇▇▇▇▇▇▁
#> doy         104.540508   6.4070362   94.4300  115.0000        ▁▂▅▇▇▃▁▁
#> temp          6.100356   0.6834104    5.1000    7.3662        ▁▃▅▇▃▁▁▁
#> temp_upper    6.937560   0.8119865    5.8100    8.4177 ▁▃▅▇▅▃▂▁▁▁▁▁▁▁▁
#> temp_lower    5.263545   0.7621943    4.1923    6.6277     ▁▁▁▂▅▇▃▁▁▁▁

The original data set has 1,215 observation but only 827 records have values in day-of-year (doy, Day of the year of first blossom) column.

See ?cherry_blossoms for details and sources.

R Code 4.130 a: Display raw data for doy (Day of the year of first blossom) against the year

Code
plot(doy ~ year, data = d2_m4.7a, col = rethinking::rangi2)

Graph 4.58: Display raw data for doy (Day of the year of first blossom) against the year (Original)

“We’re going to work with the historical record of first day of blossom, doy, for now. It ranges from 86 (late March) to 124 (early May). The years with recorded blossom dates run from 812 CE to 2015 CE.” (McElreath, 2020, p. 114) (pdf)

You can see that the doy data are sparse in the early years. Their number increases steadily approaching the year 2000.

R Code 4.131 a: Choose number of knots and distribute them over the data points of the year variable. (Original)

Code
## R code 4.73a complete cases on doy ##########
num_knots <- 15                                  # (1) number of cutpoints
knot_list <- quantile(d2_m4.7a$year,             # (2) divide regions for fitting
      probs = seq(0, 1, length.out = num_knots))
print(tibble::enframe(knot_list), n = 15)        # (3) display knot_list
#> # A tibble: 15 × 2
#>    name      value
#>    <chr>     <dbl>
#>  1 0%          812
#>  2 7.142857%  1036
#>  3 14.28571%  1174
#>  4 21.42857%  1269
#>  5 28.57143%  1377
#>  6 35.71429%  1454
#>  7 42.85714%  1518
#>  8 50%        1583
#>  9 57.14286%  1650
#> 10 64.28571%  1714
#> 11 71.42857%  1774
#> 12 78.57143%  1833
#> 13 85.71429%  1893
#> 14 92.85714%  1956
#> 15 100%       2015

The locations of the knotsare part of the model. Therefore you are responsible for them. We placed the knots at different evenly-spaced quantiles of the predictor variable.

Explanation of code lines

  1. num_knots: Specify the number of cutpoints that define different regions (or partitions) for a variable. Here were 15 knots chosen.
  2. knot_list: Vector that divides the available rows of variable year into 15 parts named after the percentiles. It is important to understand that not the range of the variable year was divided but the available data points.
  3. I have knot_list wrapped into a tibble and output with print(tibble::enframe(knot_list), n = 15), so that one can inspect the content of the vector more easily.

Again you can see that the doy data are sparse in the early years. Starting with a the 16th century the we get similar intervals for the distances between years. This can be inspected better graphically in the next “parts” tab.

R Code 4.132 a: Plot data with equally number of doy data points for each segment against year

Code
plot(doy ~ year, data = d2_m4.7a, col = rethinking::rangi2)
abline(v = knot_list)

Graph 4.59: Display raw data for doy (Day of the year of first blossom) against the year with vertical lines at the knots positions (Original)

Starting with a the 16th century the we get similar intervals for the distances between years, e.g. the number of doy data points is approximately evenly distributed (59 - 67 years)

Code
knot_list2 <- tibble::enframe(knot_list) |> 
  dplyr::mutate(diff = round(value - dplyr::lag(value), 0))
print(knot_list2, n = 15)
#> # A tibble: 15 × 3
#>    name      value  diff
#>    <chr>     <dbl> <dbl>
#>  1 0%          812    NA
#>  2 7.142857%  1036   224
#>  3 14.28571%  1174   138
#>  4 21.42857%  1269    95
#>  5 28.57143%  1377   108
#>  6 35.71429%  1454    77
#>  7 42.85714%  1518    64
#>  8 50%        1583    65
#>  9 57.14286%  1650    67
#> 10 64.28571%  1714    64
#> 11 71.42857%  1774    60
#> 12 78.57143%  1833    59
#> 13 85.71429%  1893    60
#> 14 92.85714%  1956    63
#> 15 100%       2015    59

R Code 4.133 a: Calculate basis functions of a cubic spline (degree = 3) with 15 areas (knots) to fit (Original)

Code
## R code 4.74a ################
B_m4.7a <- splines::bs(d2_m4.7a$year,            # (1) generate B-splines
  knots = knot_list[-c(1, num_knots)],           # (2) knots without first & last
  degree = 3,                                    # (3) polynomial (cubic) degree
  intercept = TRUE)                              # (4) intercept
str(B_m4.7a)                                     # (5) show data structure                                   
#>  'bs' num [1:827, 1:17] 1 0.96 0.767 0.563 0.545 ...
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : NULL
#>   ..$ : chr [1:17] "1" "2" "3" "4" ...
#>  - attr(*, "degree")= int 3
#>  - attr(*, "knots")= Named num [1:13] 1036 1174 1269 1377 1454 ...
#>   ..- attr(*, "names")= chr [1:13] "7.142857%" "14.28571%" "21.42857%" "28.57143%" ...
#>  - attr(*, "Boundary.knots")= int [1:2] 812 2015
#>  - attr(*, "intercept")= logi TRUE

Explanation of code lines

  1. The function splines::bs() generate the B-spline basis matrix for a polynomial spline:
  2. knots is generated without the two boundary knots (fist and last knot) that are placed at the minimum and maximum of the variable. These two knots are excluded with the tricky code knot_list[-c(1, num_knots)] to prevent redundancies as splines::bs() places by default knots at the boundaries. So we have 13 internal knots.
  3. With degree = 3 a cubic B-spline is chosen. The polynomial degree determines how basis functions combine, which determines how the parameters interact to produce the spline. For degree 1, (= line), two basis functions combine at each point. For degree 2 (= quadratic), three functions combine at each point. For degree 3 (= cubic), there are four basis functions combined. This should give enough flexibility for each region to fit.
  4. McElreath chose intercept = TRUE: > “We’ll also have an intercept to capture the average blossom day. This will make it easier to define priors on the basis weights, because then we can just conceive of each as a deviation from the intercept.” (McElreath, 2020, p. 117) (pdf)
    Kurz mentioned ominously: “For reasons I’m not prepared to get into, here, splines don’t always include intercept parameters. Indeed, the bs() default is intercept = FALSE.” (Kurz).
    I do not know exactly what’s the result of choosing intercept = TRUE as the difference to FALSE is marginal. The first B-spline (third hill) starts on the left edge of the graph with \(1.0\) instead of \(0\).
  5. In the data structure you can see in the second to last line, that there are two “Boundary.knots” at year \(812\) and \(2015\). These two years are in fact the first and last value of the year variable:
  • dplyr::first(d2_m4.7a$year): 812
  • dplyr::last(d2_m4.7a$year) : 2015

R Code 4.134 a: Draw raw basis functions for the year variable for 15 areas (knots) and degree 3 (= cubic polynomial) (Original)

Code
## R code 4.75a #############
plot(NULL, xlim = range(d2_m4.7a$year),     
  ylim = c(0, 1), xlab = "year", ylab = "basis")  
for (i in 1:ncol(B_m4.7a)) 
  lines(d2_m4.7a$year, B_m4.7a[, i])

Graph 4.60: Draw raw basis functions B-splines for the year variable for 15 areas (knots) and degree 3 (cubic polynomial) (Original)

Theorem 4.8 a: Linear model for Cherry Blossoms data

\[ \begin{align*} \text{day of year}_{i} \sim \operatorname{Normal}(\mu_{i}, \sigma) \\ \mu_{i} = \alpha + {\sum_{k=1}^K w_{k} B_{k, i}} \\ \alpha \sim \operatorname{Normal}(100, 10) \\ w_{j} \sim \operatorname{Normal}(0, 10) \\ \sigma \sim \operatorname{Exponential}(1) \end{align*} \tag{4.11}\]


where \(\alpha\) is the intercept, \(B_{k, i}\) is the value of the \(k^\text{th}\) bias function on the \(i^\text{th}\) row of the data, and \(w_k\) is the estimated regression weight for the corresponding \(k^\text{th}\) bias function. (](https://bookdown.org/content/4857/geocentric-models.html#splines.))

The model is multiplying each basis value by a corresponding parameter \(w_{k}\) and then adding up all \(K\) of those products. This is just a compact way of writing a linear model. The rest should be familiar. … the \(w\) priors influence how wiggly the spline can be.” (McElreath, 2020, p. 118) (pdf)

“This is also the first time we’ve used an exponential distribution as a prior. Exponential distributions are useful priors for scale parameters, parameters that must be positive. The prior for \(\sigma\) is exponential with a rate of \(1\). The way to read an exponential distribution is to think of it as containing no more information than an average deviation. That average // is the inverse of the rate. So in this case it is \(1/1 = 1\). If the rate were \(0.5\), the mean would be \(1/0.5 = 2\). We’ll use exponential priors for the rest of the book, in place of uniform priors. It is much more common to have a sense of the average deviation than of the maximum.” (McElreath, 2020, p. 118/119) (pdf)

Exponential distributions in R

To learn more about the exponential distribution and it’s single parameter \(\lambda\) lambda), which is also called the rate and how to apply it in R read Exponential distribution in R.

R Code 4.135 a: Fit model and show summary (Original)

Code
## R code 4.76a ############
m4.7a <- rethinking::quap(
  alist(
    D ~ dnorm(mu, sigma),
    mu <- a + B_m4.7a %*% w,
    a ~ dnorm(100, 10),
    w ~ dnorm(0, 10),
    sigma ~ dexp(1)
  ),
  data = list(D = d2_m4.7a$doy, B_m4.7a = B_m4.7a),
  start = list(w = rep(0, ncol(B_m4.7a)))
)

rethinking::precis(m4.7a, depth = 2)
#>               mean        sd         5.5%       94.5%
#> w[1]   -0.74907825 3.8699046  -6.93393317   5.4357767
#> w[2]    0.46260929 3.0138007  -4.35402625   5.2792448
#> w[3]    5.75074086 2.6352108   1.53916511   9.9623166
#> w[4]    0.29257647 2.4850428  -3.67900190   4.2641548
#> w[5]    5.33551430 2.6157766   1.15499802   9.5160306
#> w[6]   -4.24900191 2.4432842  -8.15384200  -0.3441618
#> w[7]    8.89031186 2.4660377   4.94910731  12.8315164
#> w[8]    0.05414883 2.5463000  -4.01533036   4.1236280
#> w[9]    4.08605901 2.5841481  -0.04390883   8.2160269
#> w[10]   5.71821727 2.5609005   1.62540361   9.8110309
#> w[11]   0.90260492 2.5360789  -3.15053899   4.9557488
#> w[12]   6.60934377 2.5575674   2.52185710  10.6968304
#> w[13]   1.77695187 2.6804345  -2.50690017   6.0608039
#> w[14]   0.23168278 3.0115420  -4.58134299   5.0447085
#> w[15]  -5.89934783 3.0966563 -10.84840275  -0.9502929
#> w[16]  -6.64967713 2.9471305 -11.35976093  -1.9395933
#> a     102.29008683 1.9474030  99.17776076 105.4024129
#> sigma   5.87829936 0.1438198   5.64844752   6.1081512

The model conforms to the content in tab “formula”.

“To build this model in quap, we just need a way to do that sum. The easiest way is to use matrix multiplication.” (McElreath, 2020, p. 119) (pdf)

“Matrix algebra is just a new way to represent ordinary algebra. It is often much more compact. So to make model m4.7a easier to program, we used a matrix multiplication of the basis matrix B_m4.7a by the vector of parameters \(w\): (McElreath, 2020, p. 120) ([pdf](zotero://open-pdf/groups/5243560/items/CPFRPHX8?page=139&annotation=FXMD6NR9

\[B %\*% w\].

This notation is just linear algebra shorthand for (1) multiplying each element of the vector w by each value in the corresponding row of B and then (2) summing up each result.” (McElreath, 2020, p. 120) (pdf)

The difference between matrix multiplication and traditional style is seen in the line with the basis matrix B_m4.7a

\[ \begin{align*} \text{mu <- a + B %*% w, (1)} \\ \text{mu <- a + sapply(1:827, function(i) sum(B[i, ] * w)), (2)} \\ \end{align*} \] (1) Matrix multiplication (2) Less elegant code but with the same result

We looked with precis(m4.7a, depth = 2) at the posterior means. We see 17 \(w\) parameters. But this didn’t help much. We need to plot the posterior prediction (see last tab “plot2”)

R Code 4.136 a: Draw weighted basis functions (B-splines)

Code
## R code 4.77a #############
post_m4.7a <- rethinking::extract.samples(m4.7a)
w_m4.7a <- apply(post_m4.7a$w, 2, mean)
plot(NULL,
  xlim = base::range(d2_m4.7a$year), ylim = c(-6, 6),
  xlab = "year", ylab = "basis * weight"
)
for (i in 1:ncol(B_m4.7a)) 
    lines(d2_m4.7a$year, w_m4.7a[i] * B_m4.7a[, i])

Graph 4.61: Draw weighted basis functions (B-splines) for the year variable for 15 areas (knots) and degree 3 (cubic polynomial) (Original)

To get the parameter weights for each basis function, we had to define the model (tab “formula”) and make it run (tab “fit”).

Compare the weighted basis functions with the raw basis function (tab “basis”) and with the wiggles in the Cherry Blossoms data (tab “plot2”).

R Code 4.137 a: Plot the 97% posterior interval for \(\mu\), at each year

Code
## R code 4.78a ###############
mu_m4.7a <- rethinking::link(m4.7a)
mu_PI_m4.7a <- apply(mu_m4.7a, 2, rethinking::PI, 0.97)
plot(d2_m4.7a$year, d2_m4.7a$doy, 
     col = rethinking::col.alpha(rethinking::rangi2, 0.3), pch = 16)
rethinking::shade(mu_PI_m4.7a, d2_m4.7a$year, 
     col = rethinking::col.alpha("black", 0.5))

Graph 4.62: Posterior prediction: The sum of the weighted basis functions, at each point, produces the spline, displayed as a 97% posterior interval of μ

“The spline is much wigglier now. Something happened around 1500, for example. If you add more knots, you can make this even wigglier. You might wonder how many knots is correct. We’ll be ready to address that question in a few more chapters” (McElreath, 2020, p. 119) (pdf)

4.5.2.3 Tidyverse fit

Example 4.32 b: Fit Cherry Blossoms data (Tidyverse)

R Code 4.138 b: Load Cherry Blossoms data, complete cases and display summary

Code
## R code 4.72b modified ######################
data(package = "rethinking", list = "cherry_blossoms")
d_m4.7b <- cherry_blossoms

d2_m4.7b <- 
  d_m4.7b |> 
  tidyr::drop_na(doy)


# ground-up tidyverse way to summarize
(
    d2_m4.7b |> 
      tidyr::gather() |> 
      dplyr::group_by(key) |> 
      dplyr::summarise(mean = base::mean(value, na.rm = T),
                sd   = stats::sd(value, na.rm = T),
                ll   = stats::quantile(value, prob = .055, na.rm = T),
                ul   = stats::quantile(value, prob = .945, na.rm = T)) |> 
      dplyr::mutate(dplyr::across(where(is.double), 
                                  \(x) round(x, digits = 2)))
)
#> # A tibble: 5 × 5
#>   key           mean     sd      ll      ul
#>   <chr>        <dbl>  <dbl>   <dbl>   <dbl>
#> 1 doy         105.     6.41   94.4   115   
#> 2 temp          6.1    0.68    5.1     7.37
#> 3 temp_lower    5.26   0.76    4.19    6.63
#> 4 temp_upper    6.94   0.81    5.81    8.42
#> 5 year       1549.   304.   1002.   1970.

Within the tidyverse, we can drop NA’s with the tidyr::drop_na() function. – This is a much easier way than my approach originally used with dplyr::filter() (See also the different ways in StackOverflow

WATCH OUT! across() supersedes the family of “scoped variants” like mutate_if()

I am using Kurz’ “ground-up {tidyverse} way” to summarize the data. But instead of the superseded dplyr::mutate_if() function I used the new dplyr::across(). At first I wrote dplyr::mutate(dplyr::across(where(is.double), round, digits = 2)) but after I warning I changed it to an anonymous function:

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))

R Code 4.139 b: Display summary with skimr::skim()

Code
d2_m4.7b |>
  skimr::skim()
Data summary
Name d2_m4.7b
Number of rows 827
Number of columns 5
_______________________
Column type frequency:
numeric 5
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1.00 1548.84 304.15 812.00 1325.00 1583.00 1803.50 2015.00 ▂▅▆▇▇
doy 0 1.00 104.54 6.41 86.00 100.00 105.00 109.00 124.00 ▁▅▇▅▁
temp 40 0.95 6.10 0.68 4.69 5.62 6.06 6.46 8.30 ▃▇▇▂▁
temp_upper 40 0.95 6.94 0.81 5.45 6.38 6.80 7.38 12.10 ▇▇▂▁▁
temp_lower 40 0.95 5.26 0.76 2.61 4.77 5.25 5.65 7.74 ▁▃▇▂▁

Kurz’s version does not have the mini histograms. I added another summary with skimr::skim() to add tiny graphics.

R Code 4.140 b: Display raw data for doy (Day of the year of first blossom) against the year

Code
d2_m4.7b |> 
  ggplot2::ggplot(ggplot2::aes(x = year, y = doy)) +
  ## color from https://www.colorhexa.com/ffb7c5
  ggplot2::geom_point(color = "#ffb7c5", alpha = 1/2) +
  ggplot2::theme() +
  ggplot2::theme(panel.grid = ggplot2::element_blank(),
    panel.background = ggplot2::element_rect(fill = "#4f455c"))

## color from https://www.colordic.org/w/, 
## inspired by 
## https://chichacha.netlify.com/2018/11/29/plotting-traditional-colours-of-japan/

Graph 4.63: Display raw data for doy (Day of the year of first blossom) against the year (Tidyverse)

By default {ggplot2} removes missing data records with a warning. But I had already removed missing data for the doy variable (see: tab “data”).

Note

This is the same code and text as in tab “knots” of Example 4.31.

R Code 4.141 b: Choose number of knots and distribute them over the data points of the year variable.

Code
## R code 4.73b complete cases on doy ##########
num_knots_b <- 15                                  # (1) number of cutpoints
knot_list_b <- quantile(d2_m4.7b$year,             # (2) divide regions for fitting
      probs = seq(0, 1, length.out = num_knots_b))
print(tibble::enframe(knot_list_b), n = 15)        # (3) display knot_list
#> # A tibble: 15 × 2
#>    name      value
#>    <chr>     <dbl>
#>  1 0%          812
#>  2 7.142857%  1036
#>  3 14.28571%  1174
#>  4 21.42857%  1269
#>  5 28.57143%  1377
#>  6 35.71429%  1454
#>  7 42.85714%  1518
#>  8 50%        1583
#>  9 57.14286%  1650
#> 10 64.28571%  1714
#> 11 71.42857%  1774
#> 12 78.57143%  1833
#> 13 85.71429%  1893
#> 14 92.85714%  1956
#> 15 100%       2015

The locations of the knots are part of the model. Therefore you are responsible for them. We placed the knots at different evenly-spaced quantiles of the predictor variable.

Explanation of code lines

  1. num_knots_b: Specify the number of cutpoints that define different regions (or partitions) for a variable. Here were 15 knots chosen.
  2. knot_list_b: Vector that divides the available rows of variable year into 15 parts named after the percentiles. It is important to understand that not the range of the variable year was divided but the available data points.
  3. I have knot_list wrapped into a tibble and output with print(tibble::enframe(knot_list_b), n = 15), so that one can inspect the content of the vector more easily.

Again you can see that the doy data are sparse in the early years. Starting with a the 16th century the we get similar intervals for the distances between years. This can be inspected better graphically in the next “parts” tab.

Note

TThe code is different but the text is the same as in tab “parts” of Example 4.31.

R Code 4.142 b: Plot data with equally number of doy data points for each segment against year

Code
d2_m4.7b |> 
  ggplot2::ggplot(ggplot2::aes(x = year, y = doy)) +
  ggplot2::geom_vline(xintercept = knot_list_b, 
             color = "white", alpha = 1/2) +
  ggplot2::geom_point(color = "#ffb7c5", alpha = 1/2) +
  ggplot2::theme_bw() +
  ggplot2::theme(panel.background = ggplot2::element_rect(fill = "#4f455c"),
        panel.grid = ggplot2::element_blank())

Graph 4.64: Display raw data for doy (Day of the year of first blossom) against the year with vertical lines at the knots positions (Tidyverse)

Starting with a the 16th century the we get similar intervals for the distances between years, e.g. the number of doy data points is approximately evenly distributed (59 - 67 years)

Code
knot_list2 <- tibble::enframe(knot_list) |> 
  dplyr::mutate(diff = round(value - dplyr::lag(value), 0))
print(knot_list2, n = 15)
#> # A tibble: 15 × 3
#>    name      value  diff
#>    <chr>     <dbl> <dbl>
#>  1 0%          812    NA
#>  2 7.142857%  1036   224
#>  3 14.28571%  1174   138
#>  4 21.42857%  1269    95
#>  5 28.57143%  1377   108
#>  6 35.71429%  1454    77
#>  7 42.85714%  1518    64
#>  8 50%        1583    65
#>  9 57.14286%  1650    67
#> 10 64.28571%  1714    64
#> 11 71.42857%  1774    60
#> 12 78.57143%  1833    59
#> 13 85.71429%  1893    60
#> 14 92.85714%  1956    63
#> 15 100%       2015    59
Note

This is the same code and text as in tab “calc” of Example 4.31.

R Code 4.143 b: Calculate basis functions of a cubic spline (degree = 3) with 15 areas (knots) to fit

Code
## R code 4.74b ################
B_m4.7b <- splines::bs(d2_m4.7b$year,            # (1) generate B-splines
  knots = knot_list_b[-c(1, num_knots_b)],       # (2) knots without first & last
  degree = 3,                                    # (3) polynomial (cubic) degree
  intercept = TRUE)                              # (4) intercept
str(B_m4.7b)                                     # (5) show data structure
#>  'bs' num [1:827, 1:17] 1 0.96 0.767 0.563 0.545 ...
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : NULL
#>   ..$ : chr [1:17] "1" "2" "3" "4" ...
#>  - attr(*, "degree")= int 3
#>  - attr(*, "knots")= Named num [1:13] 1036 1174 1269 1377 1454 ...
#>   ..- attr(*, "names")= chr [1:13] "7.142857%" "14.28571%" "21.42857%" "28.57143%" ...
#>  - attr(*, "Boundary.knots")= int [1:2] 812 2015
#>  - attr(*, "intercept")= logi TRUE

Explanation of code lines

  1. The function splines::bs() generate the B-spline basis matrix for a polynomial spline:
  2. knots is generated without the two boundary knots (fist and last knot) that are placed at the minimum and maximum of the variable. These two knots are excluded with the tricky code knot_list[-c(1, num_knots)] to prevent redundancies as splines::bs() places by default knots at the boundaries. So we have 13 internal knots.
  3. With degree = 3 a cubic B-spline is chosen. The polynomial degree determines how basis functions combine, which determines how the parameters interact to produce the spline. For degree 1, (= line), two basis functions combine at each point. For degree 2 (= quadratic), three functions combine at each point. For degree 3 (= cubic), there are four basis functions combined. This should give enough flexibility for each region to fit.
  4. McElreath chose intercept = TRUE: > “We’ll also have an intercept to capture the average blossom day. This will make it easier to define priors on the basis weights, because then we can just conceive of each as a deviation from the intercept.” (McElreath, 2020, p. 117) (pdf)
    Kurz mentioned ominously: “For reasons I’m not prepared to get into, here, splines don’t always include intercept parameters. Indeed, the bs() default is intercept = FALSE.” (Kurz).
    I do not know exactly what’s the result of choosing intercept = TRUE as the difference to FALSE is marginal. The first B-spline (third hill) starts on the left edge of the graph with \(1.0\) instead of \(0\).
  5. In the data structure you can see in the second to last line, that there are two “Boundary.knots” at year \(812\) and \(2015\). These two years are in fact the first and last value of the year variable:
  • dplyr::first(d2_m4.7b$year): 812
  • dplyr::last(d2_m4.7b$year) : 2015

R Code 4.144 a: Draw raw basis functions for the year variable for 15 areas (knots) and degree 3 (= cubic polynomial)

Code
# wrangle a bit
d_B_m4.7b <-
  B_m4.7b |> 
  base::as.data.frame() |> 
  rlang::set_names(stringr::str_c(0, 1:9), 10:17) |>  
  dplyr::bind_cols(dplyr::select(d2_m4.7b, year)) |> 
  tidyr::pivot_longer(-year,
               names_to = "bias_function",
               values_to = "bias")

# plot
d_B_m4.7b |> 
  ggplot2::ggplot(ggplot2::aes(x = year, y = bias, group = bias_function)) +
  ggplot2::geom_vline(xintercept = knot_list_b, color = "white", alpha = 1/2) +
  ggplot2::geom_line(color = "#ffb7c5", alpha = 1/2, linewidth = 1.5) +
  ggplot2::ylab("bias value") +
  ggplot2::theme_bw() +
  ggplot2::theme(panel.background = ggplot2::element_rect(fill = "#4f455c"),
        panel.grid = ggplot2::element_blank())

Graph 4.65: Draw raw basis functions B-splines for the year variable for 15 areas (knots) and degree 3 (cubic polynomial) (Tidyverse)

WATCH OUT! Warning with tibble::as_tibble() instead of base::as.data.frame()

At first I used as always tibble::as_tibble(). But this generated a warning message.

Don’t know how to automatically pick scale for object of type . Defaulting to continuous.

I learned that it has to do with the difference of a variable name versus a function name in aes(). For instance “mean” versus variable “Mean” (statology) or “sample” versus “Sample” (StackOverflow).

But why this is the case with tibble::as_tibble() but not with base::as.data.frame() I do not know. Perhaps it has to do with the somewhat stricter tibble naming rules? (see: tibble overview and tbl_df class).

R Code 4.145 b: Basis functions of a cubic spline with 15 knots broken up in different facets

Code
d_B_m4.7b |> 
  dplyr::mutate(bias_function = stringr::str_c("bias function ", 
                               bias_function)) |> 
  ggplot2::ggplot(ggplot2::aes(x = year, y = bias)) +
  ggplot2::geom_vline(xintercept = knot_list_b, 
             color = "white", alpha = 1/2) +
  ggplot2::geom_line(color = "#ffb7c5", linewidth = 1.5) +
  ggplot2::ylab("bias value") +
  ggplot2::theme_bw() +
  ggplot2::theme(panel.background = ggplot2::element_rect(fill = "#4f455c"),
        panel.grid = ggplot2::element_blank(),
        strip.background = ggplot2::element_rect(
          fill = scales::alpha("#ffb7c5", .25), color = "transparent"),
        strip.text = ggplot2::element_text(
          size = 8, margin = ggplot2::margin(0.1, 0, 0.1, 0, "cm"))) +
  ggplot2::facet_wrap(~ bias_function, ncol = 1)

Graph 4.66: Basis functions of a cubic spline with 15 knots broken up in different facets

To see what’s going on in that plot, we tool Kurz code to break the graphic up for displaying all basis function in an extra slot with ggplot2::facet_wrap().

R Code 4.146 b: Using the density of the exponential distribution dexp() to show the prior

Code
tibble::tibble(x = base::seq(from = 0, to = 10, by = 0.1)) |> 
  dplyr::mutate(d = stats::dexp(x, rate = 1)) |> 
  
  ggplot2::ggplot(ggplot2::aes(x = x, y = d)) +
  ggplot2::geom_area(fill = "grey") +
  ggplot2::scale_y_continuous(NULL, breaks = NULL) +
  ggplot2::theme_bw()

Graph 4.67: Using the density of the exponential distribution dexp() to show the prior

We used the dexp() function for the model (see tab “formula” in Example 4.31). To get a sense of what that prior looks like, we displayed the dexp() function.

R Code 4.147 b: Add B-splines matrix for 15 knots to the data frame by creating a new data frame

Code
d3_m4.7b <-
  d2_m4.7b |> 
  dplyr::mutate(B_m4.7b = B_m4.7b) 

# take a look at the structure of the new data frame
d3_m4.7b |> 
  dplyr::glimpse()
#> Rows: 827
#> Columns: 6
#> $ year       <int> 812, 815, 831, 851, 853, 864, 866, 869, 889, 891, 892, 894,…
#> $ doy        <int> 92, 105, 96, 108, 104, 100, 106, 95, 104, 109, 108, 106, 10…
#> $ temp       <dbl> NA, NA, NA, 7.38, NA, 6.42, 6.44, NA, 6.83, 6.98, 7.11, 6.9…
#> $ temp_upper <dbl> NA, NA, NA, 12.10, NA, 8.69, 8.11, NA, 8.48, 8.96, 9.11, 8.…
#> $ temp_lower <dbl> NA, NA, NA, 2.66, NA, 4.14, 4.77, NA, 5.19, 5.00, 5.11, 5.5…
#> $ B_m4.7b    <bs[,17]> <bs[26 x 17]>

In R code 4.76a (tab “quap” in Example 4.32), McElreath defined his data in a list (list(D = d2_m4.7a$doy, B_m4.7a = B_m4.7a)). The approach with {brms} will be a little different. We’ll add the B_m4.7b matrix to our d2_m4.7b data frame and name the results as d3_m4.7b.

In the d3_m4.7b data, columns year through temp_lower are all standard data columns. The B_m4-7b column is a matrix column, which contains the same number of rows as the others, but also smuggled in 17 columns within that column. Each of those 17 columns corresponds to one of our synthetic \(B_{k}\) variables. The advantage of such a data structure is we can simply define our formula argument as \(doy \sim 1 + B_m4-7b\), where B_m4-7b is a stand-in for B_m4-7b.1 + B_m4-7b.2 + ... + B_m4-7b.17. (Kurz)

See next tab “brm”.

R Code 4.148 b: Fit model m4.7b with brms::brm()

Code
m4.7b <- 
  brms::brm(data = d3_m4.7b,
      family = gaussian,
      doy ~ 1 + B_m4.7b,
      prior = c(brms::prior(normal(100, 10), class = Intercept),
                brms::prior(normal(0, 10), class = b),
                brms::prior(exponential(1), class = sigma)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 4,
      file = "brm_fits/m04.07b")

brms:::print.brmsfit(m4.7b)
#>  Family: gaussian 
#>   Links: mu = identity; sigma = identity 
#> Formula: doy ~ 1 + B_m4.7b 
#>    Data: d3_m4.7b (Number of observations: 827) 
#>   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#>          total post-warmup draws = 4000
#> 
#> Population-Level Effects: 
#>           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept   103.48      2.44    98.62   108.27 1.00      590      932
#> B_m4.7b1     -3.14      3.89   -10.68     4.52 1.00     1780     2472
#> B_m4.7b2     -0.92      4.02    -8.63     6.92 1.00     1331     1862
#> B_m4.7b3     -1.25      3.61    -8.20     6.12 1.00     1233     1842
#> B_m4.7b4      4.73      3.02    -1.27    10.59 1.00      875     1345
#> B_m4.7b5     -0.98      2.93    -6.63     4.90 1.00      792     1394
#> B_m4.7b6      4.18      2.96    -1.66    10.07 1.00      883     1470
#> B_m4.7b7     -5.44      2.84   -11.00     0.20 1.00      820     1325
#> B_m4.7b8      7.71      2.88     2.06    13.19 1.00      719     1430
#> B_m4.7b9     -1.13      2.94    -6.81     4.63 1.00      812     1371
#> B_m4.7b10     2.87      2.94    -2.95     8.58 1.00      851     1393
#> B_m4.7b11     4.53      2.96    -1.15    10.53 1.00      820     1265
#> B_m4.7b12    -0.31      2.95    -6.20     5.45 1.00      805     1411
#> B_m4.7b13     5.43      2.98    -0.34    11.27 1.00      849     1315
#> B_m4.7b14     0.55      3.02    -5.38     6.39 1.00      953     1506
#> B_m4.7b15    -0.90      3.38    -7.52     5.78 1.00     1012     1742
#> B_m4.7b16    -7.08      3.42   -13.99    -0.42 1.00     1116     1982
#> B_m4.7b17    -7.79      3.29   -14.37    -1.20 1.00     1074     1763
#> 
#> Family Specific Parameters: 
#>       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sigma     5.95      0.15     5.66     6.25 1.00     5900     2573
#> 
#> Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).

Each of the 17 columns in our B matrix was assigned its own parameter. If you fit this model using McElreath’s rethinking code, you’ll see the results are very similar. (Kurz, ibid.)

R Code 4.149 b: Glimpse at transformed data of model m4.7b to draw objects

Code
post_m4.7b <- brms::as_draws_df(m4.7b)

glimpse(post_m4.7b)
#> Rows: 4,000
#> Columns: 24
#> $ b_Intercept <dbl> 100.6156, 100.4937, 103.1678, 102.5159, 101.5559, 101.7562…
#> $ b_B_m4.7b1  <dbl> 2.90164141, 3.32435081, -2.70015340, -0.02253457, -1.97050…
#> $ b_B_m4.7b2  <dbl> -2.1067230, -0.5311967, 0.6794133, -2.8772090, 3.4878935, …
#> $ b_B_m4.7b3  <dbl> 2.9992344, 6.5131175, 0.8860770, 1.2321642, -1.3893031, 1.…
#> $ b_B_m4.7b4  <dbl> 6.1257337, 4.9927835, 5.3287564, 4.6657258, 7.8084901, 4.9…
#> $ b_B_m4.7b5  <dbl> 1.16796045, 2.69443262, -1.14483788, -0.87995893, 3.187535…
#> $ b_B_m4.7b6  <dbl> 9.5029937, 5.2401181, 4.1763796, 7.3521813, 3.7853630, 7.5…
#> $ b_B_m4.7b7  <dbl> -3.8974486, -2.5350548, -3.8448680, -1.7500613, -3.5003478…
#> $ b_B_m4.7b8  <dbl> 10.234376, 12.795946, 5.555182, 7.698008, 11.170412, 8.328…
#> $ b_B_m4.7b9  <dbl> -1.516545385, 0.965091340, 3.329924529, 0.450062993, 1.554…
#> $ b_B_m4.7b10 <dbl> 6.48286406, 6.06286271, 0.64513422, 4.83968514, 3.35499895…
#> $ b_B_m4.7b11 <dbl> 7.79660448, 7.56857405, 5.66668708, 3.36413271, 9.02955370…
#> $ b_B_m4.7b12 <dbl> 1.8447173, -0.7395718, 1.8709336, 1.6117838, -0.9294026, 2…
#> $ b_B_m4.7b13 <dbl> 10.26433473, 10.88433479, 4.55267435, 8.00798110, 6.444273…
#> $ b_B_m4.7b14 <dbl> -1.8069079, -0.8779897, 5.5164497, -0.4733169, 7.1190998, …
#> $ b_B_m4.7b15 <dbl> 9.7734694, 8.8519378, -4.2358415, -1.7802159, 0.1933067, 0…
#> $ b_B_m4.7b16 <dbl> -9.2641966, -14.0515369, -3.1279428, 0.1699908, -7.1740999…
#> $ b_B_m4.7b17 <dbl> -0.007993781, -1.342456078, -13.852015496, -9.733215118, -…
#> $ sigma       <dbl> 5.639404, 5.677425, 6.229067, 6.289891, 6.191738, 5.977885…
#> $ lprior      <dbl> -67.06532, -67.63332, -66.43315, -66.13359, -66.75463, -65…
#> $ lp__        <dbl> -2723.712, -2724.698, -2719.803, -2721.983, -2720.532, -27…
#> $ .chain      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ .iteration  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
#> $ .draw       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…

We used brms::as_draws_df() to transform m4.7b to a draw object so that it can processed easier by the {posterior} package.

R Code 4.150 b: Weight each basis function by its corresponding parameter

Code
post_m4.7b |> 
  dplyr::select(b_B_m4.7b1:b_B_m4.7b17) |> 
  rlang::set_names(base::c(stringr::str_c(0, 1:9), 10:17)) |> 
  tidyr::pivot_longer(tidyselect::everything(), 
                      names_to = "bias_function") |> 
  dplyr::group_by(bias_function) |> 
  dplyr::summarise(weight = base::mean(value)) |> 
  ## add weight column to year & bias via "bias_function"
  dplyr::full_join(d_B_m4.7b, by = "bias_function") |> 
  
  # plot
  ggplot2::ggplot(ggplot2::aes(x = year, 
                        y = bias * weight, 
                        group = bias_function)) +
  ggplot2::geom_vline(xintercept = knot_list_b, 
                      color = "white", alpha = 1/2) +
  ggplot2::geom_line(color = "#ffb7c5", 
                     alpha = 1/2, linewidth = 1.5) +
  ggplot2::theme_bw() +
  ggplot2::theme(panel.background = 
                 ggplot2::element_rect(fill = "#4f455c"),
                 panel.grid = ggplot2::element_blank()) 

Graph 4.68: Each basis function weighted by its corresponding parameter

In case you missed it, the main action in the {ggplot2} code was y = bias * weight, where we defined the \(y\)-axis as the product of bias and weight. This is fulfillment of the \(w_k B_{k, i}\) parts of the model. (Kurz, ibid.)

R Code 4.151 b: Expected values of the posterior predictive distribution

Code
f_m4.7b <- brms:::fitted.brmsfit(m4.7b, probs = c(0.055, 0.945))

f_m4.7b |> 
  base::as.data.frame() |> 
  dplyr::bind_cols(d3_m4.7b) |> 
  
  ggplot2::ggplot(ggplot2::aes(x = year, y = doy, 
                               ymin = Q5.5, ymax = Q94.5)) + 
  ggplot2::geom_vline(xintercept = knot_list_b, 
                      color = "white", alpha = 1/2) +
  ggplot2::geom_hline(yintercept = 
                      brms::fixef(m4.7b, 
                                  probs = c(0.055, 0.945))[1, 1], 
                      color = "white", linetype = 2) +
  ggplot2::geom_point(color = "#ffb7c5", alpha = 1/2) +
  ggplot2::geom_ribbon(fill = "white", alpha = 2/3) +
  ggplot2::labs(x = "year", y = "day in year") +
  ggplot2::theme_bw() +
  ggplot2::theme(panel.background = 
                 ggplot2::element_rect(fill = "#4f455c"),
                 panel.grid = ggplot2::element_blank())

Graph 4.69: Expected values of the posterior predictive distribution

If it wasn’t clear, the dashed horizontal line intersecting a little above \(100\) on the \(y\)-axis is the posterior mean for the intercept. (Kurz, ibid.)

In contrast to Kurz’ code I have used probs = c(0.055, 0.945) to get the same 89% (Q5.5 and Q94.5) intervals as in the book.

4.5.3 Smooth functions for a rough world

“The splines in the previous section are just the beginning. A entire class of models, generalized additive models (GAMs), focuses on predicting an outcome variable using smooth functions of some predictor variables.” (McElreath, 2020, p. 120) (pdf)

Resources for working with Generalized Additive Models (GAMs)



In the bonus section Kurz added some more references for GAMs:

In this reference Kurz announces that in ?sec-chap13 we will understand better what’s going on with the s() function that he described in his first bonus section. As I didn’t understand this bonus section I skipped it and will come back to this section when I have understood more on GAMs and related model fitting procedures.

In the second bonus section Kurz explained that compact syntax to pass a matrix column of predictors into the formula as used in tab “matrix” in Example 4.32 was not an isolated trick but is a general approach. He referenced an example of a multiple regression model in Section 11.2.6 of the book “Regression and Other Stories”, by Gelman, Hill, and Vehtari. I skip this second bonus section too as I do not have sufficient knowledge to fully understand the procedure.

4.6 Practice

Problems are labeled Easy (E), Medium (M), and Hard (H).

4.6.1 Easy

4.6.1.1 4E1

Exercise 4E1: In the model definition below, which line is the likelihood?

\[ \begin{align*} y_{i} \sim \operatorname{Normal}(\mu, \sigma) \\ \mu \sim \operatorname{Normal}(0, 10) \\ \sigma \sim \operatorname{Exponential}(1) \end{align*} \tag{4.12}\]


In this kind of model description the first line is always the likelihood, the other lines are the priors.

4.6.1.2 4E2

Exercise 4E2: In the model definition Equation 4.12, how many parameters are in the posterior distribution?

There are two priors: \(\mu\) and \(\sigma\).

4.6.1.3 4E3

Exercise 4E3: Using the model definition Equation 4.12, write down the appropriate form of Bayes’ theorem that includes the proper likelihood and priors.

Remember Equation 2.7:

\[ \text{Posterior} = \frac{\text{Probability of data} \times \text{Prior}}{\text{Average probability of data}} \] \[ Pr{(\mu, \sigma\mid y)} = \frac{\operatorname{Normal}(\mu, \sigma)\operatorname{Normal}(0, 10)\operatorname{Exponential}(1)}{\int\int\operatorname{Normal}(\mu, \sigma)\operatorname{Normal}(0, 10)\operatorname{Exponential}(1)d\mu d\sigma} \tag{4.13}\]

WATCH OUT! I didn’t know the answer and had to look it up at the solutions by Jake Thompson.

4.6.1.4 4E4

Exercise 4E4: In the model definition below, which line is the linear model?

\[ \begin{align*} y_{i} \sim \operatorname{Normal}(\mu, \sigma) \\ \mu_{i} = \alpha + \beta{x_{i}} \\ \alpha \sim \operatorname{Normal}(0, 10) \\ \beta \sim \operatorname{Normal}(0, 1) \\ \sigma \sim \operatorname{Exponential}(1) \end{align*} \tag{4.14}\]


The second line (\(\mu_{i} = \alpha + \beta{x_{i}}\)) is the linear model.

4.6.1.5 4E5

Exercise 4E5: In the model definition Equation 4.14, how many parameters are in the posterior distribution?

Distinguish between deterministic (=) and stochastic (~) equations!

There are now three model parameters: \(\alpha, \beta\), and $. The mean, \(\mu\) is no longer a parameter, as it is defined deterministically, as a function of other parameters in the model.

WATCH OUT! \(\alpha\) and \(\beta\) are unobserved variables and therefore parameters!

Here comes my completely wrong answer:

There are two parameters in the posterior distribution: \(\mu\) and \(\sigma\).

I did not understand, that \(\mu\) is not a parameter anymore and thought — exactly the reverse — that \(\alpha\) and \(\beta\) are not parameters but \(\mu\) and \(\sigma\).

4.6.2 Middle

4.6.2.1 4M1

Exercise 4M1: For the model definition below, simulate observed \(y\) values from the prior (not the posterior).

\[ \begin{align*} y_{i} \sim \operatorname{Normal}(\mu, \sigma) \\ \mu \sim \operatorname{Normal}(0, 10) \\ \sigma \sim \operatorname{Exponential}(1) \end{align*} \]

See the discussion under Section 4.3.2.2 and the code in Example 4.6.


R Code 4.152 a: Exercise 4M1a: Simulate observed \(y\) values from the prior

Code
k.sim_4m1a <- 1e4
set.seed(4) # to make exercise reproducible

## R code 4.14a adapted #######################################
sample_mu_4m1a <- rnorm(k.sim_4m1a, mean = 0, sd = 10)
sample_sigma_4m1a <- rexp(k.sim_4m1a, rate = 1)
priors_4m1a <- rnorm(k.sim_4m1a, sample_mu_4m1a, sample_sigma_4m1a)
rethinking::dens(priors_4m1a, 
                 adj = 1, 
                 norm.comp = TRUE,
                 show.zero = TRUE,
                 col = "red")

Graph 4.70: Simulate observed y values from the prior

R Code 4.153 b: Exercise 4M1b: Simulate observed \(y\) values from the prior

Code
k.sim_4m1b <- 1e4
set.seed(4) # to make exercise reproducible

## R code 4.14b #######################################

sim_4m1b <-
  tibble::tibble(
    sample_mu_4m1b = stats::rnorm(k.sim_4m1b, mean = 0, sd  = 10),
    sample_sigma_4m1b = stats::rexp(k.sim_4m1b, rate = 1)) |> 
  dplyr::mutate(priors_4m1b = stats::rnorm(k.sim_4m1b, 
                                           mean = sample_mu_4m1b, 
                                           sd = sample_sigma_4m1b))
  
sim_4m1b |> 
  ggplot2::ggplot(ggplot2::aes(x = priors_4m1b)) +
  ggplot2::geom_density(color = "red") +
  ggplot2::stat_function(
        fun = dnorm,
        args = with(sim_4m1b, 
               c(mean = mean(priors_4m1b), 
                 sd = sd(priors_4m1b)))
        ) +
  ggplot2::theme_bw()

Graph 4.71: Simulate observed y values from the prior

4.6.2.2 4M2

Example 4.33 : Translate the model of Section 4.6.2.1 into a quap() and brm() formula

R Code 4.154 a: Translate the model of Section 4.6.2.1 into a quap() formula. (Original)

Code
flist_4m2a <- 
  alist(
    y ~ dnorm(mu, sigma),
    mu ~ dnorm(0,10),
    sigma ~ dexp(1)
  )

flist_4m2a
#> [[1]]
#> y ~ dnorm(mu, sigma)
#> 
#> [[2]]
#> mu ~ dnorm(0, 10)
#> 
#> [[3]]
#> sigma ~ dexp(1)

R Code 4.155 b: Translate the model of Section 4.6.2.1 into a quap() formula. (Tidyverse)

Code
exr_4m2b <- 
  brms::brm(
      data = d2_b, 
      family = gaussian(),
      formula = y ~ 1,
      prior = c(brms::prior(normal(0, 10), class = Intercept),
                brms::prior(exponential(1), class = sigma)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 4,
      file = "brm_fits/m04.4m2b")

4.6.2.3 4M3

Example 4.34 : Translate the quap() model formula below into a mathematical model definition

y ~ dnorm(mu ,sigma), 
mu <- a + b * x, 
a ~ dnorm(0, 10), 
b ~ dunif(0, 1),
sigma ~ dexp(1)

R Code 4.156 : Translate the quap() model formula into a mathematical model definition

\[ \begin{align*} y_{i} \sim Normal(\mu_{i},\sigma) \\ \mu_{i} = \alpha + \beta x_{i} \\ \alpha \sim Normal(0, 10) \\ \beta \sim Uniform(0, 1) \\ \sigma \sim Exponential(1) \end{align*} \]

4.6.2.4 4M4

Exercise 4M4: A sample of students is measured for height each year for 3 years. After the third year, you want to fit a linear regression predicting height using year as a predictor. Write down the mathematical model definition for this regression, using any variable names and priors you choose. Be prepared to defend your choice of priors.

\[ \begin{align*} height_{i} \sim Normal(\mu_{i}, \sigma) \\ \mu_{i} = \alpha + \beta(x_{i} - \overline{x}) \\ \alpha \sim Normal(180, 20) \\ \beta \sim Log-Normal(0, 10) \\ \sigma \sim Exponential(1) \end{align*} \]

4.6.2.5 4M7

Exercise 4M7: Refit model m4.3 from the chapter, but omit the mean weight xbar this time. Compare the new model’s posterior to that of the original model. In particular, look at the covariance among the parameters. What is different? Then compare the posterior predictions of both models.

Example 4.35 : Example 4M7: Refit model m4.3 without mean weight xbar

R Code 4.157 a: Example 4M7a: Refit model m4.3a without mean weight xbar (Original)

Code
## R code 4.42a adapted #############################

# fit model
exr_4m7a <- rethinking::quap(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + (b * weight),
    a ~ dnorm(178, 20),
    b ~ dlnorm(0, 1),
    sigma ~ dunif(0, 50)
  ),
  data = d2_a
)

## R code 4.44a ############################
cat("**************** Summary **********************\n")
rethinking::precis(exr_4m7a)

## R code 4.45 ################
cat("\n\n*************** Covariances ********************\n")
round(rethinking::vcov(exr_4m7a), 3)

# rethinking::pairs(exr_4m7a)

## R code 4.46a ############################################
plot(height ~ weight, data = d2_a, col = rethinking::rangi2)
post_exr_4m7a <- rethinking::extract.samples(exr_4m7a)
a_map <- mean(post_exr_4m7a$a)
b_map <- mean(post_exr_4m7a$b)
curve(a_map + (b_map * x) , add = TRUE)
#> **************** Summary **********************
#>              mean         sd        5.5%       94.5%
#> a     114.5351396 1.89774923 111.5021698 117.5681094
#> b       0.8907121 0.04175803   0.8239747   0.9574495
#> sigma   5.0727199 0.19124908   4.7670670   5.3783729
#> 
#> 
#> *************** Covariances ********************
#>            a      b sigma
#> a      3.601 -0.078 0.009
#> b     -0.078  0.002 0.000
#> sigma  0.009  0.000 0.037

Graph 4.72: Plotting posterior inference against the data (Original)

R Code 4.158 b: Example 4M7a: Refit model m4.3b without mean weight xbar (Tidyverse)

Code
exr_4m7b <- 
  brms::brm(data = d2_b, 
      formula = height ~ 1 + weight,
      family = gaussian(),
      prior = c(brms::prior(normal(178, 20), class = Intercept),
                brms::prior(lognormal(0, 1), class = b, lb = 0),
                brms::prior(uniform(0, 50), class = sigma, ub = 50)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 4,
      file = "brm_fits/exr_4m7b")

cat("**************** Summary **********************\n")
brms:::summary.brmsfit(exr_4m7b)

cat("\n\n*************** Covariances ********************\n")
brms::as_draws_df(exr_4m7b) |>
  dplyr::select(b_Intercept:sigma) |>
  stats::cov() |>
  base::round(digits = 3)

d2_b |>
  ggplot2::ggplot(ggplot2::aes(x = weight, y = height)) +
  ggplot2::geom_abline(
    intercept = brms::fixef(exr_4m7b, probs = c(0.055, 0.945))[[1]], 
    slope     = brms::fixef(exr_4m7b, probs = c(0.055, 0.945))[[2]]) +
  ggplot2::geom_point(shape = 1, size = 2, color = "royalblue") +
  ggplot2::labs(x = "weight", y = "height") +
  ggplot2::theme_bw()
#> **************** Summary **********************
#>  Family: gaussian 
#>   Links: mu = identity; sigma = identity 
#> Formula: height ~ 1 + weight 
#>    Data: d2_b (Number of observations: 352) 
#>   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#>          total post-warmup draws = 4000
#> 
#> Population-Level Effects: 
#>           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept   113.95      1.91   110.17   117.71 1.00     3822     2916
#> weight        0.90      0.04     0.82     0.98 1.00     3789     2817
#> 
#> Family Specific Parameters: 
#>       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sigma     5.10      0.20     4.73     5.51 1.00     4014     2686
#> 
#> Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).
#> 
#> 
#> *************** Covariances ********************
#>             b_Intercept b_weight  sigma
#> b_Intercept       3.660   -0.079 -0.012
#> b_weight         -0.079    0.002  0.000
#> sigma            -0.012    0.000  0.039

Graph 4.73: Plotting posterior inference against the data (Tidyverse)

In contrast to the vcov() values of the centered weight version in Example 4.20 we have now big covariances. But the posterior predictions of both models (centered and uncentered version) are nearly the same. Compare with Graph 4.30 (Original) and Graph 4.32 (Tidyverse).


4.7 Session Info

R Code 4.159 : Session Info

Code
#> R version 4.3.2 (2023-10-31)
#> Platform: x86_64-apple-darwin20 (64-bit)
#> Running under: macOS Sonoma 14.1.2
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: Europe/Vienna
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] parallel  stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] rethinking_2.40     posterior_1.5.0     cmdstanr_0.5.3     
#>  [4] patchwork_1.1.3     lubridate_1.9.3     forcats_1.0.0      
#>  [7] stringr_1.5.1       dplyr_1.1.4         purrr_1.0.2        
#> [10] readr_2.1.4         tidyr_1.3.0         tibble_3.2.1       
#> [13] ggplot2_3.4.4       tidyverse_2.0.0     glossary_1.0.0.9000
#> 
#> loaded via a namespace (and not attached):
#>   [1] gridExtra_2.3        inline_0.3.19        sandwich_3.1-0      
#>   [4] rlang_1.1.2          magrittr_2.0.3       multcomp_1.4-25     
#>   [7] matrixStats_1.2.0    compiler_4.3.2       loo_2.6.0           
#>  [10] vctrs_0.6.5          reshape2_1.4.4       shape_1.4.6         
#>  [13] pkgconfig_2.0.3      fastmap_1.1.1        backports_1.4.1     
#>  [16] ellipsis_0.3.2       utf8_1.2.4           threejs_0.3.3       
#>  [19] promises_1.2.1       rmarkdown_2.25       tzdb_0.4.0          
#>  [22] markdown_1.12        xfun_0.41            jsonlite_1.8.8      
#>  [25] later_1.3.2          R6_2.5.1             dygraphs_1.1.1.6    
#>  [28] stringi_1.8.3        StanHeaders_2.26.28  estimability_1.4.1  
#>  [31] Rcpp_1.0.11          rstan_2.32.3         knitr_1.45          
#>  [34] zoo_1.8-12           base64enc_0.1-3      bayesplot_1.10.0    
#>  [37] timechange_0.2.0     httpuv_1.6.13        Matrix_1.6-4        
#>  [40] splines_4.3.2        igraph_1.6.0         tidyselect_1.2.0    
#>  [43] rstudioapi_0.15.0    abind_1.4-5          yaml_2.3.8          
#>  [46] timeDate_4032.109    codetools_0.2-19     miniUI_0.1.1.1      
#>  [49] curl_5.2.0           pkgbuild_1.4.3       lattice_0.22-5      
#>  [52] plyr_1.8.9           withr_2.5.2          shiny_1.8.0         
#>  [55] bridgesampling_1.1-2 coda_0.19-4          evaluate_0.23       
#>  [58] survival_3.5-7       RcppParallel_5.1.7   xts_0.13.1          
#>  [61] xml2_1.3.6           pillar_1.9.0         tensorA_0.36.2      
#>  [64] checkmate_2.3.1      DT_0.31              stats4_4.3.2        
#>  [67] shinyjs_2.1.0        distributional_0.3.2 generics_0.1.3      
#>  [70] hms_1.1.3            rstantools_2.3.1.1   munsell_0.5.0       
#>  [73] commonmark_1.9.0     scales_1.3.0         timeSeries_4031.107 
#>  [76] gtools_3.9.5         xtable_1.8-4         glue_1.6.2          
#>  [79] emmeans_1.8.9        tools_4.3.2          shinystan_2.6.0     
#>  [82] colourpicker_1.3.0   mvtnorm_1.2-4        grid_4.3.2          
#>  [85] QuickJSR_1.0.8       crosstalk_1.2.1      colorspace_2.1-0    
#>  [88] nlme_3.1-164         cli_3.6.2            fansi_1.0.6         
#>  [91] Brobdingnag_1.2-9    V8_4.4.1             rversions_2.1.2     
#>  [94] gtable_0.3.4         digest_0.6.33        TH.data_1.1-2       
#>  [97] brms_2.20.4          htmlwidgets_1.6.4    farver_2.1.1        
#> [100] htmltools_0.5.7      lifecycle_1.0.4      mime_0.12           
#> [103] shinythemes_1.2.0    MASS_7.3-60