Personal Notes: Statistical Rethinking (2nd ed)

Author

Peter Baumgartner

Published

2023-09-06 20:15

Preface

This is work in progress: Included 118 of 555 of the {rethinking} code chunks \(\approx\) 21,3% of the book content.

Content and Goals of this Book

Text Passages

This book collects personal notes during reading of Statistical Rethinking by Richard McElreath. I am using the second edition published 2020 by CRC Press an imprint of Routledge of the Taylor & Francis Group. Additionally I am using Statistical Rethinking 2023, the most recent set of free YouTube video lectures.

You can find links to other material on McElreath’s website about the book. Of special interest for me are the brms+tidyverse and the Stan+tidyverse conversion of his code. As I am not very experienced with R and completely new to Bayesian statistics and their tools this additional material is for me also very challenging. I am planning to read them simultaneously (section by section) and will dedicate parallel sections for their approaches. This has the advantage that the section numbers of the files conform to the section numbers of the second edition of the printed book.

Sections with the

  • header “Original” refers to the original book
  • header “Tidyverse” refers to the {tidyverse} / {brms} conversion
  • header “Stan” refers to the {rstan} conversion
  • header “Reconsideration” refers to sections with my personal comments.

My text and code consists mostly of quotes from the - second book edition 2020 or - from the text of Robert McElreath’s video lectures 2023 or - from Solomon Kurz’s tidyverse / brms version or - from Vincent Arel-Bundock converted Stan code version.

Often I made minor editing (e.g., shorting the text) or put the content in my own wording. But almost all of my text of this Quarto book are not mine, but is coming from the resources mentioned above. Therefore I had many times not indicated these quotes. If you follow the book or the other resources you will note the similarities and know to which paragraph or section I am referring. And you will also realize whenever the text passage reflects my own thoughts. In any case I am the only responsible person for this text, especially if I have used code from the resources wrongly or misunderstood a quoted text passage.

Warning

I wrote this book as a text for others to read because that forces me to be become explicit and explain all my learning outcomes more carefully. Please keep in mind that this text is not written by an expert but by a learner. In spite of replicating most of the content it may contain many mistakes. All these misapprehensions and errors are my responsibility.

Code Chunks

Packages {rethinking} and {brms} have similar tasks. Therefore they share a lot of identical function name. Kurz has unloaded the {rethinking} package when it came to explain {brms} function and to prevent name conflicts. But this approach is not efficient for the structure of my documents where I have constantly changed between these two packages. So I just loaded with base::library() only the {tidyverse} meta packages with it attached nine packages: {dplyr}, {forcats}, {ggplot2}, {lubridate}, {purrr}, {readr}, {stringr}, {tibble}, and {tidyr}.

Whenever I used another packages I called the function with the package name in front with the syntax <package name>::<function name>().

To prevent conflicts in chunk names, objects and variables I added the following suffix to the end of the name:

  • suffix a for the original book version in the main text
  • suffix b for the {tidyverse} / {brms} version in the main text
  • suffix c for the {rstan} version in the main text (not used)
  • suffix r for the {rethinking} version in the synopsis
  • suffix s for the Stan {rstan} version in the synopsis
  • suffix t for the {tidyverse} / {brms} version in the synopsis

I am not using the exact code snippets for my replications because I am not only replicating the code to see how it works but also to change the values of parameters to observe their influences. Especially when it comes to plotting I try to use ggplot2 instead the base plotting system I have no experience at all.

As I have already some experiences with the {tidyverse} approach I do not include all code snippets from Kurz’s version. I am concentrating to learn Bayesian statistics and if there are no conceptual news for me I am not going to include the corresponding passages.

This is my first book using Quarto instead of bookdown I am using these notes therefore also to learn Quarto. As a result you will find sometimes remarks or call-out blocks to my Quarto experiences.

Synopsis

At the end of each chapter I summarize the rationale for the used techniques. It is more than a summary because it goes into details of code chunks. But I will leave out all supporting or illustrating passages and focus on a holistic point of view. The code chunks are somewhat cleaned so that the most important code snippets to understand the technique are in the center.

This synopsis gives me another occasion to digest the main points and try out the most important code chunks of the text. I will use the same names for code chunks, objects and variables as in the main text but with different suffixes (r, s, t instead of a, b, c).

Get Code Examples

Go to the book website and download the R code examples for the book.

dir.create("R")
download.file("http://xcelab.net/rmpubs/sr2/code.txt", "R/code.R")
Caution

There are big differences between the code snippets of the 2nd edition collected in code.txt and the new version preparing the 3rd book version. These new code snippets can be found in the slides and/or in the videos. I will always refer to the place where they can be found.

Additionally you will find all the scripts supporting the animation in the lectures at the new 2023 github repo.

The style of the code snippets is not the tidyverse style. For instance: The equal sign = is not embedded between spaces but a list of variables, separated by comas has in front and before the coma a space.

sample <- c("W","L","W","W","W","L","W","L","W")
W <- sum(sample=="W") # number of W observed
L <- sum(sample=="L") # number of L observed
p <- c(0,0.25,0.5,0.75,1) # proportions W
ways <- sapply( p , function(q) (q*4)^W * ((1-q)*4)^L )
prob <- ways/sum(ways)
cbind( p , ways , prob )

I have converted the original code style with the RStudio addin {styler} package to tidyverse style: Assuming that the default value of the style transformer is styler::tidyverse_style() I selected the code snippet I wanted to convert and called the addin which ran styler:::style_selection(). As an example: The transformation of the above code snippet resulted into the code below:

sample <- c("W", "L", "W", "W", "W", "L", "W", "L", "W")
W <- sum(sample == "W") # number of W observed
L <- sum(sample == "L") # number of L observed
p <- c(0, 0.25, 0.5, 0.75, 1) # proportions W
ways <- sapply(p, function(q) (q * 4)^W * ((1 - q) * 4)^L)
prob <- ways / sum(ways)
cbind(p, ways, prob)

As copy & paste from the slides does not work I downloaded the PDF of the Speaker deck slides. But still, it didn’t work always. In that case I used TextSniper and formatted manually. But these copy & paste problems only arise when using new code, prepared for the 3rd edition. With the book (2nd ed.) I do not have problems to copy the code snippets via calibre with the ePUB eBook version.

Setup Chunks

At first I tried to collect all necessary package to load with library() in the setup chunk (See Quarto equivalent to RMarkdown setup chunk). The idea was to be able to run individual calling functions from different packages for test purposes and not to have to run all code chunks of the very long files. Additionally I tried to prevent conflicts of function names with the conflicts_prefer() function of the {conflicted} package in each setup file.

But it turned out that this was not a feasible approach: I noticed between packages with similar purpose (in my case between {rethinking} and {**brms*}) too many conflicts. Furthermore many of these conflicts are hidden because the came from imports from other packages.

I will therefore just load the meta package {tidverse} in the setup chunks of every files. To prevent conflict with function names I will call functions of other packages with the syntax <package name>::<function()>. This has the additional advantage to learn from which package the function comes from. And my aim to be able to run code chunks separately is also possible.

I will differentiate between “base R” referring to all packages loaded automatically after starting R and the package {base} itself, that is part of the collection of “base R” packages.

If you find errors please do not hesitate to write issues or PRs on my GitHub site. I really appreciate it to learn from more experienced R users! It shortens the learning paths of self-directed learners.

Package Installation

In contrast to the sparse and partly outdated remarks in the book use the installation section from the rethinking package at GitHub.

Step 1

From the three steps I had already successfully installed the first one (rstan and the C++ toolchain), so I had no need to follow the detailed instructions of the rstan installation at https://mc-stan.org/users/interfaces/rstan.html.

Step 2

To install the cmdstanr package I visited https://mc-stan.org/cmdstanr/. This is an addition to my previous installation with the older version (2nd ed., 2022). As I installed the latest beta version of cmdstanr the first time I also needed to compile the libraries with cmdstanr::install_cmdstan().

To check the result of my installation I ran check_cmdstan_toolchain().

install.packages("cmdstanr", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
cmdstanr::install_cmdstan()
cmdstanr::check_cmdstan_toolchain()

The command for downloaded cmdstanr did not install the vignettes, which take a long time to build, but they are always available online at https://mc-stan.org/cmdstanr/articles/.

The vignette Getting started with CmdStanR also recommend to load the bayesplot and posterior packages, which are used later in the CmdStanR-examples. But I believe these two packages are not necessary if you just plan to stick with the book.

Step 3

Once the infrastructure is installed one can install the packages used by the book. With the exception of rethinking — the companion package of the book – they can all be downloaded from CRAN.

I had already devtools installed, therefore I deleted it from the list of installed packages.

install.packages(c("coda","mvtnorm", "loo","dagitty","shape"))
devtools::install_github("rmcelreath/rethinking")

Course Schedule

The following tables matches the lectures (videos 2023 and slides 2023) with the book chapters of the second edition (2020). It was generated by a screenshot from Statistical Rethinking 2023 - 01 - The Golem of Prague (50:09), but can also be found as a slide in Statistical Rethinking 2023 - Lecture 01.

A better overview with links to videos and slides provides the following HTML table, taken from the README.md file for the 2023 lectures.

Week ## Meeting date Reading Lectures
Week 01 06 January Chapters 1, 2 and 3 [1] <Golem of Prague> <Slides>
[2] <Garden of Forking Data> <Slides>
Week 02 13 January Chapter 4 [3] <Geocentric Models> <Slides>
[4] <Categories and Curves> <Slides>
Week 03 20 January Chapters 5 and 6 [5] <Elemental Confounds> <Slides>
[6] <Good and Bad Controls> <Slides>
Week 04 27 January Chapters 7,8,9 [7] <Overfitting> <Slides>
[8] <MCMC> <Slides>
Week 05 03 February Chapters 10 and 11 [9] <Modeling Events> <Slides>
[10] <Counts and Confounds> <Slides>
Week 06 10 February Chapters 11 and 12 [11] <Ordered Categories> <Slides>
[12] <Multilevel Models> <Slides>
Week 07 17 February Chapter 13 [13] <Multilevel Adventures> <Slides>
[14] <Correlated Features> <Slides>
Week 08 24 February Chapter 14 [15] <Social Networks> <Slides>
[16] <Gaussian Processes> <Slides>
Week 09 03 March Chapter 15 [17] <Measurement> <Slides>
[18] <Missing Data> <Slides>
Week 10 10 March Chapters 16 and 17 [19] <Generalized Linear Madness> <Slides>
[20] <Horoscopes> <Slides>