12  Beta-Binomial Model

Example 12.1 Recall Example 10.4 which concerned \(\theta\), the population proportion of students in Cal Poly statistics classes who prefer to consider data as a singular noun (as in “the data is”).

  1. Why might we not want to assume a Normal prior distribution for \(\theta\)?




  2. In Example 10.4 we assumed a prior distribution for \(\theta\) which is proportional to \(\theta^2,\; 0<\theta<1\). In that example, we used a grid approximation for the prior distribution, but now we’ll treat the distribution as continuous. Sketch this prior distribution.




  3. Now we’ll consider a few more prior distributions. Sketch each of the following priors. How do they compare?

    1. proportional to \(\theta^2,\; 0<\theta<1\). (from previous)
    2. proportional to \(\theta^5,\; 0<\theta<1\).
    3. proportional to \((1-\theta)^2,\; 0<\theta<1\).
    4. proportional to \(\theta^2(1-\theta)^2,\; 0<\theta<1\).
    5. proportional to \(\theta^5(1-\theta)^2,\; 0<\theta<1\).




Example 12.2 Continuing Example 12.1.

  1. Each of the distributions in the previous example was a Beta distribution. For each distribution, identify the shape parameters and the prior mean and standard deviation.
Prior proportional to Prior \(\alpha\) Prior \(\beta\) Prior Mean Prior SD
\(\theta^2\)
\(\theta^5\)
\((1-\theta)^2\)
\(\theta^2(1-\theta)^2\)
\(\theta^5(1-\theta)^2\)
  1. Assume a Beta(3, 1) prior distribution for \(\theta\). Now suppose that 31 students in a sample of 35 Cal Poly statistics students prefer data as singular. Write the function that defines the shape of the likelihood as a function of \(\theta, 0<\theta<1\).




  2. Write the function that defines the shape of the posterior density as a function of \(\theta, 0<\theta<1.\) Is this a Beta distribution? If so, specify the values of \(\alpha\) and \(\beta\).




  3. Starting with each of the prior distributions from the first part, find the posterior distribution of \(\theta\) based on this sample, and identify it as a Beta distribution by specifying the shape parameters \(\alpha\) and \(\beta\).

Prior Posterior \(\alpha\) Posterior \(\beta\) Posterior Mean Posterior SD
Beta(3, 1)
Beta(6, 1)
Beta(1, 3)
Beta(3, 3)
Beta(6, 3)
  1. For each of the posterior distributions in the previous part, compute the posterior mean and standard deviation. How does each posterior distribution compare to its respective prior distribution?




Prior Data Posterior
Successes \(\alpha\) \(y\) \(\alpha + y\)
Failures \(\beta\) \(n-y\) \(\beta + n - y\)
Total \(\alpha+\beta\) \(n\) \(\alpha + \beta + n\)

Example 12.3 In Example 10.4 we essentially assumed a Beta(3, 1) prior but we used grid approximation to compute the posterior. Now we’ll compare our results from that exercise to those using the continuous Beta(3, 1) prior.

Assume that \(\theta\) has a Beta(3, 1) prior distribution and that 31 students in a sample of 35 Cal Poly statistics students prefer data as singular.

  1. Plot the prior distribution, (scaled) likelihood, and posterior distribution.




  2. Use software to find 50%, 80%, and 98% central posterior credible intervals.




  3. Compare the results to those using the grid approximation.




  4. Express the posterior mean as a weighted average of the prior mean and sample proportion. Describe what the weights are, and explain why they make sense.




Example 12.4 Continuing the previous examples, we’ll now consider predictive distributions. Assume that \(\theta\) has a Beta(3, 1) prior distribution.

  1. Before observing any data, suppose we plan to randomly select a sample of 35 Cal Poly statistics students. Let \(y\) represent the number of students in the selected sample who prefer data as singular. Explain how we could use simulation to approximate the prior predictive distribution of \(y\). Then conduct the simulation and approximate the prior predictive distribution. Is the prior predictive distribution truly discrete or continuous? How does the prior predictive distribution compare to the one from Example 10.4?




  2. Now suppose we observe 31 students in a sample of 35 Cal Poly statistics students prefer data as singular, so that the posterior distribution of \(\theta\) is the Beta(34, 5) distribution. Suppose we plan to randomly select another sample of 35 Cal Poly statistics students. Let \(\tilde{y}\) represent the number of students in the selected sample who prefer data as singular. Explain how we could use simulation to approximate the posterior predictive distribution of \(\tilde{y}\). Then conduct the simulation and approximate the posterior predictive distribution. Is the posterior predictive distribution truly discrete or continuous? How does the posterior predictive distribution compare to the one from Example 10.5?




  3. Use the simulation results to approximate a 95% posterior prediction interval for \(\tilde{y}\). Write a clearly worded sentence interpreting this interval in context.




12.1 Notes

12.1.1 Prior Beta distributions

Distribution \(\alpha\) \(\beta\) Proportional to Mean SD
a Beta(3, 1) 3 1 \(\theta^2, 0<\theta < 1\) 0.750 0.194
b Beta(6, 1) 6 1 \(\theta^5, 0<\theta < 1\) 0.857 0.124
c Beta(1, 3) 1 3 \((1-\theta)^2, 0<\theta < 1\) 0.250 0.194
d Beta(3, 3) 3 3 \(\theta^2(1-\theta)^2, 0<\theta < 1\) 0.500 0.189
e Beta(6, 3) 6 3 \(\theta^5(1-\theta)^2, 0<\theta < 1\) 0.667 0.149

12.1.2 Posterior Beta distributions

Prior Distribution Posterior proportional to Posterior Distribution Posterior Mean Posterior SD
a Beta(3, 1) \(\theta^{2 + 31}(1-\theta)^{0 + 4}, 0 < \theta < 1\) Beta(34, 5) 0.872 0.053
b Beta(6, 1) \(\theta^{5 + 31}(1-\theta)^{0 + 4}, 0 < \theta < 1\) Beta(37, 5) 0.881 0.049
c Beta(1, 3) \(\theta^{0 + 31}(1-\theta)^{2 + 4}, 0 < \theta < 1\) Beta(32, 7) 0.821 0.061
d Beta(3, 3) \(\theta^{2 + 31}(1-\theta)^{2 + 4}, 0 < \theta < 1\) Beta(34, 7) 0.829 0.058
e Beta(6, 3) \(\theta^{5 + 31}(1-\theta)^{2 + 4}, 0 < \theta < 1\) Beta(37, 7) 0.841 0.055

12.1.3 Posterior distribution

The Beta(3, 1) prior is proportional to \[ \pi(\theta) \propto \theta^{3 - 1}(1-\theta)^{1-1}, \qquad 0 < \theta < 1. \]

Given \(\theta\), the number of students in the sample who prefer data as singular, \(y\), follows a Binomial(35, \(\theta\)) distribution. The likelihood is the probability of observing \(y=31\) viewed as a function of \(\theta\). \[\begin{align*} f(31|\theta) & = \binom{35}{31}\theta^{31}(1-\theta)^4, \qquad 0 < \theta <1\\ & \propto \theta^{31}(1-\theta)^4, \qquad 0 < \theta <1 \end{align*}\] The constant \(\binom{35}{31}\) does not affect the shape of the likelihood as a function of \(\theta\).

The posterior density, as a function of \(\theta\), is proportional to \[\begin{align*} \pi(\theta|y = 31) & \propto \left(\theta^2\right)\left(\theta^{31}(1-\theta)^4\right), \qquad 0 <\theta < 1\\ & \propto \theta^{33}(1-\theta)^4, \qquad 0 <\theta < 1\\ & \propto \theta^{34 - 1}(1-\theta)^{5 - 1}, \qquad 0 <\theta < 1\\ \end{align*}\] Therefore, the posterior distribution of \(\theta\) is the Beta(3 + 31, 1 + 35 - 31), that is, the Beta(34, 5) distribution.

We can use stat_fun to plot a function over a range. This works fine for the dbeta prior and posterior, but unfortunately the likeihood, determined by dbinom is not on a density scale.

# Beta prior
alpha_prior = 3
beta_prior = 1

# data
n = 35
y = 31

# Beta posterior
alpha_post = alpha_prior + y
beta_post = beta_prior + n - y

# Plot
ggplot(data.frame(x = c(0, 1)),
       aes(x = x)) +
  # prior
  stat_function(fun = dbeta,
                args = list(shape1 = alpha_prior,
                            shape2 = beta_prior),
                col = bayes_col["prior"],
                lty = bayes_lty["prior"],
                linewidth = 1) +
  # likelihood
  stat_function(fun = dbinom,
                args = list(size = n,
                            x = y),
                col = bayes_col["likelihood"],
                lty = bayes_lty["likelihood"],
                linewidth = 1) +
  # posterior
  stat_function(fun = dbeta,
                args = list(shape1 = alpha_post,
                            shape2 = beta_post),
                col = bayes_col["posterior"],
                lty = bayes_lty["posterior"],
                linewidth = 1) +
  labs(x = "theta",
       y = "") +
  theme_bw()

The following code creates a scaled version of the likelihood function for plotting (and moves a few things around to create a legend).

# data
n = 35
y = 31

# scaled likelihood, depends on data: n, y
likelihood_scaled <- function(theta) {
  likelihood <- function(theta) {
    dbinom(x = y, size = n, prob = theta)
  }
  scaling_constant <- integrate(likelihood, lower = 0, upper = 1)[[1]]
  likelihood(theta) / scaling_constant
}

# Plot
ggplot(data.frame(x = c(0, 1)),
       aes(x = x)) +
  # prior
  stat_function(fun = dbeta,
                args = list(shape1 = alpha_prior,
                            shape2 = beta_prior),
                lty = bayes_lty["prior"],
                linewidth = 1,
                aes(color = "prior", linetype = "prior")) +
  # (scaled) likelihood
  stat_function(fun = likelihood_scaled,
                lty = bayes_lty["likelihood"],
                linewidth = 1,
                aes(color = "likelihood", linetype = "likeihood")) +
  # posterior
  stat_function(fun = dbeta,
                args = list(shape1 = alpha_post,
                            shape2 = beta_post),
                lty = bayes_lty["posterior"],
                linewidth = 1,
                aes(color = "posterior", linetype = "posterior")) +
  # Define color and add a legend
  scale_color_manual(name = "",
                     breaks = c("prior", "likelihood", "posterior"),
                     values = bayes_col[c("prior", "likelihood", "posterior")]) +
  scale_linetype_manual(name = "",
                        breaks = c("prior", "likelihood", "posterior"),
                        values = bayes_lty[c("prior", "likelihood", "posterior")]) +
  labs(x = "theta",
       y = "") +
  theme_bw()
Warning: No shared levels found between `names(values)` of the manual scale and the
data's linetype values.

# 50% posterior credible interval
qbeta(c(0.25, 0.75), alpha_post, beta_post)
[1] 0.8398218 0.9105693
# 80% posterior credible interval
qbeta(c(0.10, 0.90), alpha_post, beta_post)
[1] 0.8005591 0.9345918
# 98% posterior credible interval
qbeta(c(0.01, 0.99), alpha_post, beta_post)
[1] 0.7239139 0.9650609

12.1.4 Prior predictive distribution

  1. Simulate \(\theta\) from the Beta(3, 1) prior distribution using rbeta.
  2. Given \(\theta\), simulate \(y\) from the Binomial(35, \(\theta\)) distribution.
  3. Simulate many \((\theta, y)\) pairs and summarize the simulated \(y\) values to approximate the prior predictive distribution.
n = 35

n_rep = 10000

# simulate theta from Beta prior distribution
theta_sim = rbeta(n_rep, alpha_prior, beta_prior)

# simulate y from Binomial(n, theta)
y_sim = rbinom(n_rep, n, theta_sim)

sim = data.frame(theta_sim, y_sim)

# display a few simulated pairs
sim |> head(10) |> kbl() |> kable_styling()
theta_sim y_sim
0.6280839 25
0.8274895 29
0.4882894 15
0.8900228 31
0.7641595 31
0.8126329 31
0.9561866 34
0.7012563 25
0.8672446 31
0.3422079 13
ggplot(data.frame(y_sim),
       aes(x = y_sim)) +
  geom_bar(aes(y = after_stat(prop)),
           col = bayes_col["posterior_predict"],
           fill = bayes_col["posterior_predict"],
           width = 0.1) +
  labs(x = "Number of successes",
       y = "Simulated relative frequency") +
  theme_bw()

quantile(y_sim, 0.05)
5% 
12 

12.1.5 Posterior predictive distribution

  1. Simulate \(\theta\) from the Beta(34, 5) posterior distribution using rbeta.
  2. Given \(\theta\), simulate \(y\) from the Binomial(35, \(\theta\)) distribution.
  3. Simulate many \((\theta, y)\) pairs and summarize the simulated \(y\) values to approximate the posterior predictive distribution.
n = 35

n_rep = 10000

# simulate theta from Beta posterior distribution
theta_sim = rbeta(n_rep, alpha_post, beta_post)

# simulate y from Binomial(n, theta)
y_sim = rbinom(n_rep, n, theta_sim)

sim = data.frame(theta_sim, y_sim)

# display a few simulated pairs
sim |> head(10) |> kbl() |> kable_styling()
theta_sim y_sim
0.8813914 29
0.9181022 33
0.8875739 30
0.8756327 30
0.9126112 30
0.7890674 27
0.6912154 24
0.8557116 30
0.8235766 30
0.8505959 29
ggplot(data.frame(y_sim),
       aes(x = y_sim)) +
  geom_bar(aes(y = after_stat(prop)),
           col = bayes_col["posterior_predict"],
           fill = bayes_col["posterior_predict"],
           width = 0.1) +
  labs(x = "Number of successes",
       y = "Simulated relative frequency") +
  theme_bw()

quantile(y_sim, c(0.025, 0.975))
 2.5% 97.5% 
   25    35 

  1. \(\theta\) is used to denote both: (1) the actual parameter (i.e., the random variable) \(\theta\) itself, and (2) possible values of \(\theta\).↩︎

  2. When considering the prior predictive distribution \(y\) also denotes potential values of the data. For the posterior predictive distribution \(\tilde{y}\) denotes potential new values of the data.↩︎

  3. The expression defines the shape of the Beta density. All that’s missing is the scaling constant which ensures that the total area under the density is 1. The actual Beta density formula, including the normalizing constant, is \[ p(u) =\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\; u^{\alpha-1}(1-u)^{\beta-1}, \quad 0<u<1, \] where \(\Gamma(\alpha) = \int_0^\infty e^{-v}v^{\alpha-1} dv\) is the Gamma function. For a positive integer \(k\), \(\Gamma(k) = (k-1)!\). Also, \(\Gamma(1/2)=\sqrt{\pi}\).↩︎

  4. When the prior and posterior distribution belong to the same family, that family is called a conjugate prior distribution for the likelihood. So, the Beta distributions form a conjugate prior family for Binomial distributions.↩︎