1. Prior and posterior are distributions which sum to 1, so prior and posterior are on the same scale. However, the likelihood does not sum to anything in particular. In order to plot the likelhood on the same scale, it has been rescaled to sum to 1. Only the relative shape of the likelihood matters; not its absolute scale.↩︎

2. In this book, “random” and “uncertain” are synonyms; the opposite of “random” is “certain”. (Later we will encounter random variables; “constant” is an antonym of “random variable”.) The word “random” has many uses in everyday life, which have evolved over time. Unfortunately, some of the everyday meanings of “random”, like “haphazard” or “unexpected”, are contrary to what we mean by “random” in this book. For example, we would consider Steph Curry shooting a free throw to be a random phenomenon because we’re not certain if he’ll make it or miss it; but we would not consider this process to be haphazard or unexpected.↩︎

3. We will refer to as “random” any scenario that involves a reasonable degree of uncertainty. We’re avoiding philosophical questions about what is “true” randomness, like the following. Is a coin flip really random? If all factors that affect the trajectory of the coin were known precisely, then wouldn’t the outcome be determined? Does true randomness only exist in quantum mechanics?↩︎

4. Probabilities are usually defined as decimals, but are often colloquially referred to as percentages. We’re not sticklers; we’ll refer to probabilities both as decimals and as percentages.↩︎

5. The Grand Duke of Tuscany posed this problem to Galileo, who published his solution in 1620. However, unbeknownst to Galileo, the same problem had been solved almost 100 years earlier by Gerolamo Cardano, one of the first mathematicians to study probability.↩︎

6. A natural question is: “how many repetitions are required to represent the long run?” We’ll consider this question when we discuss MCMC methods.↩︎

7. We do not advocate gambling. We merely use gambling contexts to motivate probability concepts.↩︎

8. You could probably get a pretty good idea by searching online, but don’t do that. Instead, answer the questions based on what you already know about me.↩︎

9. This section only covers Bayes’ rule for events. We’ll see Bayes’ rule for distributions of random variables later. But the ideas are analogous.↩︎

10. We’re using “hypothesis” in the sense of a general scientific hypothesis, not necessarily a statistical null or alternative hypothesis.↩︎

11. More formally, $$H_1,\ldots, H_k$$ is a partition which satisfies $$P\left(\cup_{i=1}^k H_i\right)=1$$ and $$H_1, \ldots, H_k$$ are disjoint — $$H_i\cap H_j=\emptyset , i\neq j$$.↩︎

12. “Posterior is proportional to likelihood times prior” summarizes the whole course in a single sentence.↩︎

13. In Section 3.2.2 of Statistical Rethinking (McElreath (2020)), the author suggests 67%, 89%, and 97%: “a series of nested intervals may be more useful than any one interval. For example, why not present 67%, 89%, and 97% intervals, along with the median? Why these values? No reason. They are prime numbers, which makes them easy to remember. But all that matters is they be spaced enough to illustrate the shape of the posterior. And these values avoid 95%, since conventional 95% intervals encourage many readers to conduct unconscious hypothesis tests.”↩︎

14. $$\theta$$ is used to denote both: (1) the actual parameter (i.e., the random variable) $$\theta$$ itself, and (2) possible values of $$\theta$$.↩︎

15. The expression defines the shape of the Beta density. All that’s missing is the scaling constant which ensures that the total area under the density is 1. The actual Beta density formula, including the normalizing constant, is $f(u) =\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\; u^{\alpha-1}(1-u)^{\beta-1}, \quad 0<u<1,$ where $$\Gamma(\alpha) = \int_0^\infty e^{-v}v^{\alpha-1} dv$$ is the Gamma function. For a positive integer $$k$$, $$\Gamma(k) = (k-1)!$$. Also, $$\Gamma(1/2)=\sqrt{\pi}$$.↩︎

16. The posterior predictive distribution can be found analytically in the Beta-Binomial situation. If $$\theta\sim$$ Beta$$(\alpha, \beta)$$ and $$(Y|\theta)\sim$$ Binomial$$(n, \theta)$$ then the marginal distribution of $$Y$$ is the Beta-Binomial distribution with $P(Y = y) = \binom{n}{y}\frac{B(\alpha+y,\beta+n-y)}{B(\alpha, \beta)}, \qquad y = 0, 1, \ldots, n,$ $$B(\alpha, \beta)$$ is the beta function, for which $$B(\alpha,\beta)=\frac{(\alpha-1)!(\beta-1)!}{(\alpha+\beta-1)!}$$ if $$\alpha,\beta$$ are positive integers. (For general $$\alpha,\beta>0$$, $$B(\alpha,\beta)=\int_0^1u^{\alpha-1} (1-u)^{\beta-1}du = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}$$.) The mean is $$n\left(\frac{\alpha}{\alpha+\beta}\right)$$. In R: dbbinom, rbbinom, pbbinom in extraDistr package↩︎

17. This example is motivated by an example in Section 1.1 of Dogucu, Johnson, and Ott (2022).↩︎

18. For some history, and an origin of the use of “Monte Carlo”, see Wikipedia.↩︎

19. If you’ve ever heard of BUGS (or WinBUGS) JAGS is very similar but with a few nicer features.↩︎

20. The examples in the section are motivated by examples in Kruschke (2015).↩︎

21. If $$Y_1$$ has mean $$\theta_1$$ and $$Y_2$$ has mean $$\theta_2$$ then linearity of expected value implies that $$Y_1+Y_2$$ has mean $$\theta_1+\theta_2$$. If $$Y_1$$ has variance $$\theta_1$$ and $$Y_2$$ has variance $$\theta_2$$ then independence of $$Y_1$$ and $$Y_2$$ implies that $$Y_1+Y_2$$ has variance $$\theta_1+\theta_2$$. What Poisson aggregation says is that if component counts are independent and each with a Poisson shape, then the total count also has a Poisson shape.↩︎

22. I keep meaning to say this, but technically the $$Y$$ values are not independent. Rather, they are conditionally independent given $$\theta$$. This is a somewhat subtle distinction, so I’ve glossed over the details.↩︎

23. Sometimes Gamma densities are parametrized in terms of the scale parameter $$1/\lambda$$, so that the mean is $$\alpha\lambda$$.↩︎

24. The expression defines the shape of a Gamma density. All that’s missing is the scaling constant which ensures that the total area under the density is 1. The actual Gamma density formula, including the normalizing constant, is $f(u) =\frac{\lambda^\alpha}{\Gamma(\alpha)}\; u^{\alpha-1}e^{-\lambda u}, \quad u>0,$ where $$\Gamma(\alpha) = \int_0^\infty e^{-u}u^{\alpha-1} du$$ is the Gamma function. For a positive integer $$k$$, $$\Gamma(k) = (k-1)!$$. Also, $$\Gamma(1/2)=\sqrt{\pi}$$.↩︎

25. I’ve been naming these models in the form “Prior-Likelihood”, e.g. Gamma prior and Poisson likelihood. I would rather do it as “Likelihood-Prior”. In modeling, the likelihood comes first; what is an appropriate distributional model for the observed data? This likelihood depends on some parameters, and then a prior distribution is placed on these parameters. So in modeling the order is likelihood then prior, and it would be nice if the names followed that pattern. But “Beta-Binomial” is the canonical example, and no one calls that “Binomial-Beta”. To be consistent, we’ll stick with the “Prior-Likelihood” naming convention.↩︎

26. Remember, if $$X$$ and $$Y$$ are independent then the correlation is 0, but the converse is not true in general. However, if $$X$$ and $$Y$$ have a Bivariate Normal distribution and their correlation is 0, then $$X$$ and $$Y$$ are independent.↩︎

27. There are about 4 million live births in the U.S. per year. The data is available at the CDC website. We’re only using a random sample to cut down on computation time.↩︎

28. If the observed data has multiple modes or is skewed, then other parameters like median or mode might be more appropriate measures of center than the population mean.↩︎

29. Values from a Multivariate Normal distribution can be simulated using mvrnorm from the MASS package. For Bivariate Normal, the inputs are the mean vector $$[E(\mu_1), E(\mu_2)]$$ and the covariance matrix $\begin{bmatrix} \textrm{Var}(\mu_1) & \textrm{Cov}(\mu_1, \mu_2) \\ \textrm{Cov}(\mu_1, \mu_2) & \textrm{Var}(\mu_2) \end{bmatrix}$ where $$\textrm{Cov}(\mu_1, \mu_2) = \textrm{Corr}(\mu_1, \mu_2)\textrm{SD}(\mu_1)\textrm{SD}(\mu_2)$$. ↩︎

30. Why briefly? Because we want to focus on the posterior distribution.↩︎

31. For some history, and an origin of the use of the the term “Monte Carlo”, see Wikipedia. Monte Carlo methods consist of a broad class of algorithms for obtaining numerical results based on random numbers, even in problems that don’t explicitly involve probability (e.g., Monte Carlo integration).↩︎

32. This island hopping example is inspired by Kruschke, Doing Bayesian Data Analysis.↩︎

33. The algorithm is named after Nicholas Metropolis, a physicist who led the research group which first proposed the method in the early 1950s, consisting of Arianna Rosenbluth, Marshall Rosenbluth, Augusta Teller, and Edward Teller. It is disputed whether Metropolis himself had anything to do with the actual invention of the algorithm.↩︎

34. I’ve seen this description in many references, but I don’t know who first used this terminology.↩︎

35. The coda library in R contains a lot of diagnostic tests for MCMC methods, including the function effectiveSize.↩︎

36. Why the Rockets? At the time I found the data, James Harden had by far the most FT attempts in the NBA and I wanted to see what affect he would have on the analysis. He’s still near the league leaders in attempts, but not at the top. I just haven’t updated the data.↩︎

37. With $$n$$ attempts and $$y$$ successes, the likelihood is proportional to $$\theta^y(1-\theta)^{n-y}$$. This is true regardless of whether the number of attempts, the number of successes, or neither, is fixed in advanced. Therefore, it’s not necessary that the number of attempts is fixed in advanced, so technically we don’t have to assume that $$(y|\theta)\sim$$ Binomial$$(n, \theta)$$, but rather just that the likelihood is proportional of $$\theta^y(1-\theta)^{n-y}$$.↩︎

38. For a Beta prior, we often interpret $$\alpha$$ as ”prior successes” and $$\beta$$ as ”prior failures”, so $$\kappa=\alpha+\beta$$ is the ”prior sample size”. But $$\alpha$$ and $$\beta$$, and hence $$\kappa$$ do not need to be integers; the only requirement is that $$\alpha>0$$ and $$\beta>0$$, and hence $$\kappa>0$$. This is one reason we model $$\kappa$$ with a Gamma prior.↩︎

39. Kruschke recommends a prior like this, and I’ve tried to justify the rationale, but I don’t quite buy it myself. Just think of it as a noninformative prior.↩︎

40. “Model parameters” is not the best terminology since the model is everything - likelihood, prior, hyperprior. But we’re using “model parameters” to distinguish the “primary” parameters that show up in the likelihood, from the “secondary” hyperparameters that only show up in the prior.↩︎

41. But BMI has its own problems; see http://fivethirtyeight.com/features/bmi-is-a-terrible-measure-of-health/.↩︎

42. This interval isn’t quite correct because the 7.99 doesn’t account for the sample-to-sample variability (SE) of the mean BMI for these men. But that SE (0.3335) is very small in comparison to the variability in individual percent body fats (7.986). Adjusting the prediction interval to reflect the estimation of the mean yields $$17.67\pm 2 \times 7.993$$, where $$7.993=\sqrt{7.986^2 + 0.3335^2}$$↩︎

### References

Dogucu, Mine, Alicia Johnson, and Miles Ott. 2022. Bayes Rules! An Introduction to Applied Bayesian Modeling. 1st ed. Boca Raton, Florida: Chapman; Hall/CRC. https://www.bayesrulesbook.com/.
Kruschke, John. 2015. Doing Bayesian Data Analysis: A Tutorial with r, JAGS, and Stan. 2nd ed. Academic Press. https://sites.google.com/site/doingbayesiandataanalysis/.
McElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in r and Stan, 2nd Edition. 2nd ed. CRC Press. http://xcelab.net/rm/statistical-rethinking/.