3.2 Conjugate prior to exponential family
Theorem 4.2.1
The prior distribution \pi(\boldsymbol{\theta})\propto C(\boldsymbol{\theta})^{b_0}\exp\left\{\eta(\boldsymbol{\theta})^{\top}\boldsymbol{a}_0\right\} is conjugate to the exponential family (equation (3.4)).
Proof
\begin{align} \pi(\boldsymbol{\theta}\mid \boldsymbol{y})& \propto C(\boldsymbol{\theta})^{b_0}\exp\left\{\eta(\boldsymbol{\theta})^{\top}\boldsymbol{a}_0\right\} \times h(\boldsymbol{y}) C(\boldsymbol{\theta})^N\exp\left\{\eta(\boldsymbol{\theta})^{\top}\boldsymbol{T}(\boldsymbol{y})\right\}\nonumber\\ & \propto C(\boldsymbol{\theta})^{N+b_0} \exp\left\{\eta(\boldsymbol{\theta})^{\top}(\boldsymbol{T}(\boldsymbol{y})+\boldsymbol{a}_0)\right\}.\nonumber \end{align}
Observe that the posterior is in the exponential family, \pi(\boldsymbol{\theta}\mid \boldsymbol{y})\propto C(\boldsymbol{\theta})^{\beta_n} \exp\left\{\eta(\boldsymbol{\theta})^{\top}\boldsymbol{\alpha}_n\right\}, \beta_n=N+b_0 and \boldsymbol{\alpha}_n=\boldsymbol{T}(\boldsymbol{y})+\boldsymbol{a}_0.
Remarks
We observe, by comparing the prior and the likelihood, that b_0 plays the role of a hypothetical sample size, and \boldsymbol{a}_0 plays the role of hypothetical sufficient statistics. This perspective aids the elicitation process, that is, integrating non-sample information into the prior distribution.
We established this result in the standard form of the exponential family. We can also establish it in the canonical form of the exponential family. Observe that, given \boldsymbol{\eta} = \boldsymbol{\eta}(\boldsymbol{\theta}), another way to derive a prior for \boldsymbol{\eta} is to use the change of variable theorem, given a bijective function.
In the case where there is a regular conjugate prior, Diaconis, Ylvisaker, et al. (1979) show that the posterior expectation of the sufficient statistics is a weighted average between the prior expectation and the likelihood estimate.
3.2.1 Examples: Theorem 4.2.1
- Likelihood functions from discrete distributions
The Poisson-gamma model
Given a random sample \boldsymbol{Y}=[Y_1 \ Y_2 \ \dots \ Y_N]^{\top} from a Poisson distribution then a conjugate prior density for \lambda has the form \begin{align} \pi(\lambda)&\propto \left(\exp(-\lambda)\right)^{b_0} \exp\left\{a_0\log(\lambda)\right\}\nonumber\\ & = \exp(-\lambda b_0) \lambda^{a_0}\nonumber\\ & = \exp(-\lambda \beta_0) \lambda^{\alpha_0-1}.\nonumber \end{align} This is the kernel of a gamma density in the rate parametrization, G(\alpha_0, \beta_0), where \alpha_0 = a_0 + 1 and \beta_0 = b_0.18 Thus, a prior conjugate distribution for the Poisson likelihood is a gamma distribution.
Since \sum_{i=1}^N Y_i is a sufficient statistic for the Poisson distribution, we can interpret a_0 as the number of occurrences in b_0 experiments.
Observe that \begin{align} \pi(\lambda\mid \boldsymbol{y})&\propto \exp(-\lambda \beta_0) \lambda^{\alpha_0-1} \times \exp(-N\lambda)\lambda^{\sum_{i=1}^Ny_i}\nonumber\\ &= \exp(-\lambda(N+\beta_0)) \lambda^{\sum_{i=1}^Ny_i+\alpha_0-1}.\nonumber \end{align} As expected, this is the kernel of a gamma distribution, which means \lambda\mid \boldsymbol{y}\sim G(\alpha_n,\beta_n), \alpha_n=\sum_{i=1}^Ny_i+\alpha_0 and \beta_n=N+\beta_0.
Observe that \alpha_0/\beta_0 is the prior mean, and \alpha_0/\beta_0^2 is the prior variance. Then, \alpha_0\rightarrow 0 and \beta_0\rightarrow 0 imply a non-informative prior such that the posterior mean converges to the maximum likelihood estimate \bar{y}=\frac{\sum_{i=1}^N y_i}{N}, \begin{align} \mathbb{E}\left[\lambda\mid \boldsymbol{y}\right]&=\frac{\alpha_n}{\beta_n}\nonumber\\ &=\frac{\sum_{i=1}^Ny_i+\alpha_0}{N+\beta_0}\nonumber\\ &=\frac{N\bar{y}}{N+\beta_0}+\frac{\alpha_0}{N+\beta_0}.\nonumber \end{align} The posterior mean is a weighted average of the sample and prior information. This is a general result for regular conjugate priors (Diaconis, Ylvisaker, et al. 1979). Note that \mathbb{E}[\lambda \mid \boldsymbol{y}] = \bar{y}, \quad \lim_{N \to \infty}.
Additionally, \alpha_0 \to 0 and \beta_0 \to 0 corresponds to \pi(\lambda) \propto \frac{1}{\lambda}, which is an improper prior. Improper priors may have undesirable consequences for Bayes factors (hypothesis testing); see below for a discussion of this in the linear regression framework. In this example, we can obtain analytical solutions for the marginal likelihood and the predictive distribution (see the health insurance example and Exercise 3 in Chapter 1).
The Bernoulli-beta model
Given a random sample \boldsymbol{Y}=[Y_1 \ Y_2 \ \dots \ Y_N]^{\top} from a Bernoulli distribution then a conjugate prior density for \theta has the form \begin{align} \pi(\theta)&\propto (1-\theta)^{b_0} \exp\left\{a_0\log\left(\frac{\theta}{1-\theta}\right)\right\}\nonumber\\ & = (1-\theta)^{b_0-a_0}\theta^{a_0}\nonumber\\ & = \theta^{\alpha_0-1}(1-\theta)^{\beta_0-1}.\nonumber \end{align} This is the kernel of a beta density, B(\alpha_0, \beta_0), where \alpha_0 = a_0 + 1 and \beta_0 = b_0 - a_0 + 1. A prior conjugate distribution for the Bernoulli likelihood is a beta distribution. Given that b_0 is the hypothetical sample size and a_0 is the hypothetical sufficient statistic (the number of successes), b_0 - a_0 represents the number of failures. This implies that \alpha_0 is the number of prior successes plus one, and \beta_0 is the number of prior failures plus one.
Since the mode of a beta-distributed random variable is given by \frac{\alpha_0 - 1}{\alpha_0 + \beta_0 - 2} = \frac{a_0}{b_0}, we can interpret this as the prior probability of success. Setting \alpha_0 = 1 and \beta_0 = 1, which corresponds to a uniform distribution on the interval [0, 1], represents a setting with 0 successes (and 0 failures) in 0 experiments.
Observe that \begin{align} \pi(\theta\mid \boldsymbol{y})&\propto \theta^{\alpha_0-1}(1-\theta)^{\beta_0-1} \times \theta^{\sum_{i=1}^N y_i}(1-\theta)^{N-\sum_{i=1}^Ny_i}\nonumber\\ &= \theta^{\alpha_0+\sum_{i=1}^N y_i-1}(1-\theta)^{\beta_0+N-\sum_{i=1}^Ny_i-1}.\nonumber \end{align} The posterior distribution is beta, \theta\mid \boldsymbol{y}\sim B(\alpha_n,\beta_n), \alpha_n=\alpha_0+\sum_{i=1}^N y_i and \beta_n=\beta_0+N-\sum_{i=1}^Ny_i, where the posterior mean \mathbb{E}[\theta\mid \boldsymbol{y}]=\frac{\alpha_n}{\alpha_n+\beta_n}=\frac{\alpha_0+N\bar{y}}{\alpha_0+\beta_0+N}=\frac{\alpha_0+\beta_0}{\alpha_0+\beta_0+N}\frac{\alpha_0}{\alpha_0+\beta_0}+\frac{N}{\alpha_0+\beta_0+N}\bar{y}. The posterior mean is a weighted average between the prior mean and the maximum likelihood estimate.
El marginal likelihood in this setting is \begin{align} p(\boldsymbol{y})=&\int_{0}^1 \frac{\theta^{\alpha_0-1}(1-\theta)^{\beta_0-1}}{B(\alpha_0,\beta_0)}\times \theta^{\sum_{i=1}^N y_i}(1-\theta)^{N-\sum_{i=1}^N y_i}d\theta\nonumber\\ =& \frac{B(\alpha_n,\beta_n)}{B(\alpha_0,\beta_0)},\nonumber \end{align} where B(\cdot ,\cdot) is the beta function.
In addition, the predictive density is \begin{align} p(y_0\mid \boldsymbol{y})&=\int_0^1 \theta^{y_0}(1-\theta)^{1-y_0}\times \frac{\theta^{\alpha_n-1}(1-\theta)^{\beta_n-1}}{B(\alpha_n,\beta_n)}d\theta\nonumber\\ &=\frac{B(\alpha_n+y_0,\beta_n+1-y_0)}{B(\alpha_n,\beta_n)}\nonumber\\ &=\frac{\Gamma(\alpha_n+\beta_n)\Gamma(\alpha_n+y_0)\Gamma(\beta_n+1-y_0)}{\Gamma(\alpha_n+\beta_n+1)\Gamma(\alpha_n)\Gamma(\beta_n)}\nonumber\\ &=\begin{Bmatrix} \frac{\alpha_n}{\alpha_n+\beta_n}, & y_0=1\\ \frac{\beta_n}{\alpha_n+\beta_n}, & y_0=0\\ \end{Bmatrix}.\nonumber \end{align}
This is a Bernoulli distribution with probability of success equal to \frac{\alpha_n}{\alpha_n+\beta_n}.
The multinomial-Dirichlet model
Given a random sample \boldsymbol{Y}=[Y_1 \ Y_2 \ \dots \ Y_N]^{\top} from a multinomial distribution then a conjugate prior density for \boldsymbol{\theta}=\left[\theta_1 \ \theta_2 \ \dots \ \theta_m\right] has the form \begin{align} \pi(\boldsymbol{\theta})&\propto \theta_m^{b_0} \exp\left\{\boldsymbol{\eta}(\boldsymbol{\theta})^{\top}\boldsymbol{a}_0\right\}\nonumber\\ & = \prod_{l=1}^{m-1}\theta_l^{a_{0l}}\theta_m^{b_0-\sum_{l=1}^{m-1}a_{0l}}\nonumber\\ & = \prod_{l=1}^{m}\theta_l^{\alpha_{0l}-1},\nonumber \end{align}
where \boldsymbol{\eta}(\boldsymbol{\theta})=\left[\log\left(\frac{\theta_1}{\theta_m}\right) \ \dots \ \log\left(\frac{\theta_{m-1}}{\theta_m}\right)\right], \boldsymbol{a}_0=\left[a_{01} \ \dots \ a_{0m-1}\right]^{\top}, \boldsymbol{\alpha}_0=\left[\alpha_{01} \ \alpha_{02} \ \dots \ \alpha_{0m}\right], \alpha_{0l}=a_{0l}+1, l=1,2,\dots,m-1 and \alpha_{0m}=b_0-\sum_{l=1}^{m-1} a_{0l}+1.
This is the kernel of a Dirichlet distribution, that is, the prior distribution is D(\boldsymbol{\alpha}_0).
Observe that a_{0l} is the hypothetical number of times outcome l is observed over the hypothetical b_0 trials. Setting \alpha_{0l} = 1, which corresponds to a uniform distribution over the open standard simplex, implicitly sets a_{0l} = 0, meaning that there are 0 occurrences of category l in b_0 = 0 experiments.
The posterior distribution of the multinomial-Dirichlet model is given by \begin{align} \pi(\boldsymbol{\theta}\mid \boldsymbol{y})&\propto \prod_{l=1}^m \theta_l^{\alpha_{0l}-1}\times\prod_{l=1}^m \theta_l^{\sum_{i=1}^{N} y_{il}}\nonumber\\ &=\prod_{l=1}^m \theta_l^{\alpha_{0l}+\sum_{i=1}^{N} y_{il}-1}\nonumber. \end{align} This is the kernel of a Dirichlet distribution D(\boldsymbol{\alpha}_n), \boldsymbol{\alpha}_n=\left[\alpha_{n1} \ \alpha_{n2} \ \dots \ \alpha_{nm}\right], \alpha_{nl}=\alpha_{0l}+\sum_{i=1}^{N}y_{il}, l=1,2,\dots,m. Observe that \begin{align} \mathbb{E}[\theta_{j}\mid \boldsymbol{y}]&=\frac{\alpha_{nj}}{\sum_{l=1}^m \left[\alpha_{0l}+\sum_{i=1}^N y_{il}\right]}\nonumber\\ &=\frac{\sum_{l=1}^m \alpha_{0l}}{\sum_{l=1}^m \left[\alpha_{0l}+\sum_{i=1}^N y_{il}\right]}\frac{\alpha_{0j}}{\sum_{l=1}^m \alpha_{0l}}\nonumber\\ &+\frac{\sum_{l=1}^m\sum_{i=1}^N y_{il}}{\sum_{l=1}^m \left[\alpha_{0l}+\sum_{i=1}^N y_{il}\right]}\frac{\sum_{i=1}^N y_{ij}}{\sum_{l=1}^m\sum_{i=1}^N y_{il}}.\nonumber \end{align} We have again that the posterior mean is a weighted average between the prior mean and the maximum likelihood estimate.
The marginal likelihood is \begin{align} p(\boldsymbol{y})&=\int_{\boldsymbol{\Theta}}\frac{\prod_{l=1}^m \theta_l^{\alpha_{0l}-1}}{B(\boldsymbol{\alpha}_0)}\times \prod_{i=1}^N\frac{n!}{\prod_{l=1}^m y_{il}}\prod_{l=1}^m \theta_{l}^{y_{il}}d\boldsymbol{\theta}\nonumber\\ &=\frac{N\times n!}{B(\boldsymbol{\alpha}_0)\prod_{i=1}^N\prod_{l=1}^m y_{il}!}\int_{\boldsymbol{\Theta}} \prod_{l=1}^m \theta_l^{\alpha_{0l}+\sum_{i=1}^N y_{il}-1} d\boldsymbol{\theta}\nonumber\\ &=\frac{N\times n!}{B(\boldsymbol{\alpha}_0)\prod_{i=1}^N\prod_{l=1}^m y_{il}!}B(\boldsymbol{\alpha}_n)\nonumber\\ &=\frac{N\times n! \Gamma\left(\sum_{l=1}^m\nonumber \alpha_{0l}\right)}{\Gamma\left(\sum_{l=1}^m \alpha_{0l}+N\times n\right)}\prod_{l=1}^m \frac{\Gamma\left( \alpha_{nl}\right)}{\Gamma\left(\alpha_{0l}\right)\prod_{i=1}^N y_{il}!},\nonumber \end{align} where B(\boldsymbol{\alpha})=\frac{\prod_{l=1}^m\Gamma(\alpha_l)}{\Gamma\left(\sum_{l=1}^m \alpha_l\right)}.
Following similar steps we get the predictive density \begin{align} p(y_0\mid \boldsymbol{y})&=\frac{ n! \Gamma\left(\sum_{l=1}^m \alpha_{nl}\right)}{\Gamma\left(\sum_{l=1}^m \alpha_{nl}+ n\right)}\prod_{l=1}^m \frac{\Gamma\left( \alpha_{nl}+y_{0l}\right)}{\Gamma\left(\alpha_{nl}\right) y_{0l}!}.\nonumber \end{align} This is a Dirichlet-multinomial distribution with parameters \boldsymbol{\alpha}_n.
Example: English premier league, Liverpool vs Manchester city
Let’s consider an example using data from the English Premier League. In particular, we want to calculate the probability that, in the next five matches between Liverpool and Manchester City, Liverpool wins two games and Manchester City wins three. This calculation is based on historical data from the last five matches where Liverpool played at home between January 14th, 2018, and April 10th, 2022. In those matches, Liverpool secured two wins, there were two draws, and Manchester City won one match. 19
We use two strategies to estimate the hyperparameters. First, we estimate the hyperparameters of the Dirichlet distribution using betting odds from bookmakers at 19:05 on October 6th, 2022 (Colombia time). We obtained data from 24 bookmakers (see file DataOddsLIVvsMAN.csv)20, and we transform these odds into probabilities using a simple standardization approach. Then, we apply maximum likelihood estimation to estimate the hyperparameters.
Second, we use empirical Bayes, where we estimate the hyperparameters by optimizing the marginal likelihood.
# Multinomial-Dirichlet example: Liverpool vs Manchester city
Data <- read.csv("https://raw.githubusercontent.com/besmarter/BSTApp/refs/heads/master/DataApp/DataOddsLIVvsMAN.csv", sep = ",", header = TRUE, quote = "")
attach(Data)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Probs <- Data %>%
mutate(pns1 = 1/home, pns2 = 1/draw, pns3 = 1/away)%>%
mutate(SumInvOdds = pns1 + pns2 + pns3) %>%
mutate(p1 = pns1/SumInvOdds, p2 = pns2/SumInvOdds, p3 = pns3/SumInvOdds) %>%
select(p1, p2, p3)
# We get probabilities using simple standardization. There are more technical approaches to do this. See for instance Shin (1993) and Strumbelj (2014).
DirMLE <- sirt::dirichlet.mle(Probs)
# Use maximum likelihood to estimate parameters of the
# Dirichlet distribution
alpha0odds <- DirMLE$alpha
alpha0odds
## p1 p2 p3
## 1599.122 1342.703 2483.129
y <- c(2, 2, 1)
# Historical records last five mathces
# Liverpool wins (2), draws (2) and Manchester
# city wins (1)
# Marginal likelihood
MarLik <- function(a0){
n <- sum(y)
Res1 <- sum(sapply(1:length(y),
function(l){lgamma(a0[l]+y[l])-lgamma(a0[l])}))
Res <- lgamma(sum(a0))-lgamma(sum(a0)+n)+Res1
return(-Res)
}
EmpBay <- optim(alpha0odds, MarLik, method = "BFGS")
alpha0EB <- EmpBay$par
alpha0EB
## p1 p2 p3
## 2362.622 2660.153 1279.510
# Bayes factor empirical Bayes vs betting odds.
# This is greather than 1 by construction
BF <- exp(-MarLik(alpha0EB))/exp(-MarLik(alpha0odds))
BF
## [1] 2.085819
# Posterior distribution based on empirical Bayes
alphan <- alpha0EB + y
# Posterior parameters
S <- 100000
# Simulation draws from the Dirichlet distribution
thetas <- MCMCpack::rdirichlet(S, alphan)
colnames(thetas) <- c("Liverpool","Draw","Manchester")
# Predictive distribution based on simulations
y0 <- c(2, 0, 3)
# Liverpool two wins and Manchester city three wins in next five matches
Pred <- apply(thetas, 1, function(p) {rmultinom(1, size = sum(y0), prob = p)})
ProY0 <- sum(sapply(1:S,function(s){sum(Pred[,s]==y0)==3}))/S
ProY0 # Probability of y0
## [1] 0.01224
# Predictive distribution using analytical expression
PredY0 <- function(y0){
n <- sum(y0)
Res1 <- sum(sapply(1:length(y), function(l){lgamma(alphan[l]+y0[l]) - lgamma(alphan[l])-lfactorial(y0[l])}))
Res <- lfactorial(n) + lgamma(sum(alphan)) - lgamma(sum(alphan)+n) + Res1
return(exp(Res))
}
PredY0(y0)
## [1] 0.01177531
We observe that the Bayes factor provides evidence in favor of the hyperparameters estimated via empirical Bayes, as these hyperparameters are specifically chosen to maximize the marginal likelihood.
Using the hyperparameters obtained from empirical Bayes, we calculate that the probability of Liverpool winning two out of the next five games, while Manchester City wins three, is 1.2%. The result obtained from the predictive distribution via simulations is similar to the probability derived using the exact predictive distribution.
- Likelihood functions from continuous distributions
The normal-normal/inverse-gamma model
Given a random sample \boldsymbol{Y}=[Y_1 \ Y_2 \ \dots \ Y_N]^{\top} from a normal distribution, then the conjugate prior density has the form \begin{align} \pi(\mu,\sigma^2)&\propto \exp\left\{b_0\left(-\frac{\mu^2}{2\sigma^2}-\frac{\log \sigma^2}{2}\right)\right\}\exp\left\{a_{01}\frac{\mu}{\sigma^2}-a_{02}\frac{1}{\sigma^2}\right\}\nonumber\\ &=\exp\left\{b_0\left(-\frac{\mu^2}{2\sigma^2}-\frac{\log \sigma^2}{2}\right)\right\}\exp\left\{a_{01}\frac{\mu}{\sigma^2}-a_{02}\frac{1}{\sigma^2}\right\}\nonumber\\ &\times \exp\left\{-\frac{a_{01}^2}{2\sigma^2b_0}\right\}\exp\left\{\frac{a_{01}^2}{2\sigma^2b_0}\right\}\nonumber\\ &=\exp\left\{-\frac{b_0}{2\sigma^2}\left(\mu-\frac{a_{01}}{b_0}\right)^2\right\}\left(\frac{1}{\sigma^2}\right)^{\frac{b_0+1-1}{2}}\nonumber\\ &\times \exp\left\{\frac{1}{\sigma^2}\frac{-2b_0a_{02}+a_{01}^2}{2b_0}\right\}\nonumber\\ &=\underbrace{\left(\frac{1}{\sigma^2}\right)^{\frac{1}{2}}\exp\left\{-\frac{b_0}{2\sigma^2}\left(\mu-\frac{a_{01}}{b_0}\right)^2\right\}}_{1}\nonumber\\ &\times\underbrace{\left(\frac{1}{\sigma^2}\right)^{\frac{b_0-1}{2}}\exp\left\{-\frac{1}{\sigma^2}\frac{2b_0a_{02}-a_{01}^2}{2b_0}\right\}}_{2}.\nonumber \end{align} The first part is the kernel of a normal density with mean \mu_0 = \frac{a_{01}}{\beta_0} and variance \frac{\sigma^2}{\beta_0}, where \beta_0 = b_0. That is, \mu \mid \sigma^2 \sim N\left(\mu_0, \frac{\sigma^2}{\beta_0}\right). The second part is the kernel of an inverse gamma density with shape parameter \frac{\alpha_0}{2} = \frac{\beta_0 - 3}{2} and scale parameter \frac{\delta_0}{2} = \frac{2\beta_0 a_{02} - a_{01}^2}{2\beta_0}, so \sigma^2 \sim IG\left(\frac{\alpha_0}{2}, \frac{\delta_0}{2}\right).
Observe that b_0 = \beta_0 represents the hypothetical sample size, and a_{01} is the hypothetical sum of prior observations. Therefore, it makes sense that \frac{a_{01}}{\beta_0} and \frac{\sigma^2}{\beta_0} represent the prior mean and variance, respectively.
Therefore, the posterior distribution is also a normal-inverse gamma distribution, \begin{align} \pi(\mu,\sigma^2\mid \boldsymbol{y})&\propto \left(\frac{1}{\sigma^2}\right)^{1/2}\exp\left\{-\frac{\beta_0}{2\sigma^2}(\mu-\mu_0)^2\right\}\left(\frac{1}{\sigma^2}\right)^{\alpha_0/2+1}\exp\left\{-\frac{\delta_0}{2\sigma^2}\right\}\nonumber\\ &\times(\sigma^2)^{-N/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^N (y_i-\mu)^2\right\}\nonumber\\ & = \left(\frac{1}{\sigma^2}\right)^{1/2}\exp\left\{-\frac{1}{2\sigma^2}\left(\beta_0(\mu-\mu_0)^2+\sum_{i=1}^N (y_i-\bar{y})^2+N(\mu-\bar{y})^2+\delta_0\right)\right\}\nonumber\\ & \times\left(\frac{1}{\sigma^2}\right)^{\frac{\alpha_0+N}{2}+1} + \frac{(\beta_0\mu_0+N\bar{y})^2}{\beta_0+N} - \frac{(\beta_0\mu_0+N\bar{y})^2}{\beta_0+N}\nonumber\\ & = \underbrace{\left(\frac{1}{\sigma^2}\right)^{1/2}\exp\left\{-\frac{1}{2\sigma^2}\left((\beta_0+N)\left(\mu-\left(\frac{\beta_0\mu_0+N\bar{y}}{\beta_0+N}\right)\right)^2\right)\right\}}_{1}\nonumber\\ & \times \underbrace{\left(\frac{1}{\sigma^2}\right)^{\frac{\alpha_0+N}{2}+1}\exp\left\{-\frac{1}{2\sigma^2}\left(\sum_{i=1}^N (y_i-\bar{y})^2+\delta_0+\frac{\beta_0N}{\beta_0+N}(\bar{y}-\mu_0)^2\right)\right\}}_{2}.\nonumber \end{align}
The first term is the kernel of a normal density, \mu\mid \sigma^2,\boldsymbol{y}\sim N \left(\mu_n, \sigma_n^2\right), where \mu_n=\frac{\beta_0\mu_0+N\bar{y}}{\beta_0+N} and \sigma_n^2=\frac{\sigma^2}{\beta_n}, \beta_n=\beta_0+N. The second term is the kernel of an inverse gamma density, \sigma^2\mid \boldsymbol{y}\sim IG(\alpha_n/2,\delta_n/2) where \alpha_n=\alpha_0+N and \delta_n=\sum_{i=1}^N (y_i-\bar{y})^2+\delta_0+\frac{\beta_0N}{\beta_0+N}(\bar{y}-\mu_0)^2. Observe that the posterior mean is a weighted average between prior and sample information. The weights depends on the sample sizes (\beta_0 and N).
The marginal posterior for \sigma^2 is inverse gamma with shape and scale parameters \alpha_n/2 and \delta_n/2, respectively. The marginal posterior of \mu is \begin{align} \pi(\mu\mid \boldsymbol{y})&\propto \int_{0}^{\infty}\left\{ \left(\frac{1}{\sigma^2}\right)^{\frac{\alpha_n+1}{2}+1}\exp\left\{-\frac{1}{2\sigma^2}(\beta_n(\mu-\mu_n)^2+\delta_n)\right\}\right\}d\sigma^2\nonumber\\ &=\frac{\Gamma\left(\frac{\alpha_n+1}{2}\right)}{\left[\frac{\beta_n(\mu-\mu_n)^2+\delta_n}{2}\right]^{\frac{\alpha_n+1}{2}}}\nonumber\\ &\propto \left[\frac{\beta_n(\mu-\mu_n)^2+\delta_n}{2}\right]^{-\frac{\alpha_n+1}{2}}\left(\frac{\delta_n}{\delta_n}\right)^{-\frac{\alpha_n+1}{2}}\nonumber\\ &\propto \left[\frac{\alpha_n\beta_n(\mu-\mu_n)^2}{\alpha_n\delta_n}+1\right]^{-\frac{\alpha_n+1}{2}},\nonumber \end{align} The second line follows from having the kernel of an inverse gamma density with parameters \frac{\alpha_n + 1}{2} and \frac{1}{2} \left( \beta_n (\mu - \mu_n)^2 + \delta_n \right).
This corresponds to the kernel of a Student’s t-distribution: \mu \mid \boldsymbol{y} \sim t\left(\mu_n, \frac{\delta_n}{\beta_n \alpha_n}, \alpha_n\right), where \mathbb{E}[\mu \mid \boldsymbol{y}] = \mu_n and \text{Var}[\mu \mid \boldsymbol{y}] = \frac{\alpha_n}{\alpha_n - 2} \left( \frac{\delta_n}{\beta_n \alpha_n} \right) = \frac{\delta_n}{(\alpha_n - 2) \beta_n}, \quad \alpha_n > 2. Observe that the marginal posterior distribution for \mu has heavier tails than the conditional posterior distribution due to the incorporation of uncertainty regarding \sigma^2.
The marginal likelihood is \begin{align} p(\boldsymbol{y})&=\int_{-\infty}^{\infty}\int_{0}^{\infty}\left\{ (2\pi\sigma^2/\beta_0)^{-1/2}\exp\left\{-\frac{1}{2\sigma^2/\beta_0}(\mu-\mu_0)^2\right\}\frac{(\delta_0/2)^{\alpha_0/2}}{\Gamma(\alpha_0/2)}\left(\frac{1}{\sigma^2}\right)^{\alpha_0/2+1}\right.\nonumber\\ &\times\left.\exp\left\{-\frac{\delta_0}{2\sigma^2}\right\}(2\pi\sigma^2)^{-N/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^N(y_i-\mu)^2\right\}\right\}d\sigma^2d\mu\nonumber\\ &=\frac{(\delta_0/2)^{\alpha_0/2}}{\Gamma(\alpha_0/2)}(2\pi)^{-\left(\frac{N+1}{2}\right)}\beta_0^{1/2}\int_{-\infty}^{\infty}\int_{0}^{\infty}\left\{\left(\frac{1}{\sigma^2}\right)^{\frac{\alpha_0+N+1}{2}+1}\right.\nonumber\\ &\times\left.\exp\left\{-\frac{1}{2\sigma^2}(\beta_0(\mu-\mu_0)^2+\sum_{i=1}^N (y_i-\mu)^2+\delta_0)\right\}\right\}d\sigma^2d\mu\nonumber\\ &=\frac{(\delta_0/2)^{\alpha_0/2}}{\Gamma(\alpha_0/2)}(2\pi)^{-\left(\frac{N+1}{2}\right)}\beta_0^{1/2}\Gamma\left(\frac{N+1+\alpha_0}{2}\right)\nonumber\\ &\times \int_{-\infty}^{\infty} \left[\frac{\beta_0(\mu-\mu_0)^2+\sum_{i=1}^N(y_i-\mu)^2+\delta_0}{2}\right]^{-\frac{\alpha_0+N+1}{2}}d\mu\nonumber\\ &=\frac{(\delta_0/2)^{\alpha_0/2}}{\Gamma(\alpha_0/2)}(2\pi)^{-\left(\frac{N+1}{2}\right)}\beta_0^{1/2}\Gamma\left(\frac{N+1+\alpha_0}{2}\right)\nonumber\\ &\times \int_{-\infty}^{\infty} \left[\frac{\beta_n(\mu-\mu_n)^2+\delta_n}{2}\right]^{-\frac{\alpha_n+1}{2}}d\mu\left(\frac{\delta_n/2}{\delta_n/2}\right)^{-\frac{\alpha_n+1}{2}}\nonumber\\ &=\frac{(\delta_0/2)^{\alpha_0/2}}{\Gamma(\alpha_0/2)}(2\pi)^{-\left(\frac{N+1}{2}\right)}\beta_0^{1/2}\Gamma\left(\frac{\alpha_n+1}{2}\right)\left(\frac{\delta_n}{2}\right)^{-\frac{\alpha_n+1}{2}}\frac{\left(\frac{\delta_n\pi}{\beta_n}\right)^{1/2}\Gamma\left(\frac{\alpha_n}{2}\right)}{\Gamma\left(\frac{\alpha_n+1}{2}\right)}\nonumber\\ &=\frac{\Gamma\left(\frac{\alpha_n}{2}\right)}{\Gamma\left(\frac{\alpha_0}{2}\right)}\frac{(\delta_0/2)^{\alpha_0/2}}{(\delta_n/2)^{\alpha_n/2}}\left(\frac{\beta_0}{\beta_n}\right)^{1/2}(\pi)^{-N/2},\nonumber \end{align}
where we take into account that \int_{-\infty}^{\infty} \left[\frac{\beta_n(\mu-\mu_n)^2+\delta_n}{2}\right]^{-\frac{\alpha_n+1}{2}}d\mu\left(\frac{\delta_n/2}{\delta_n/2}\right)^{-\frac{\alpha_n+1}{2}}=\int_{-\infty}^{\infty} \left[\frac{\beta_n\alpha_n(\mu-\mu_n)^2}{\delta_n\alpha_n}+1\right]^{-\frac{\alpha_n+1}{2}}d\mu\left(\frac{\delta_n}{2}\right)^{-\frac{\alpha_n+1}{2}}. The term in the integral is the kernel of a Student’s t density, this means that the integral is equal to \frac{\left(\frac{\delta_n\pi}{\beta_n}\right)^{1/2}\Gamma\left(\frac{\alpha_n}{2}\right)}{\Gamma\left(\frac{\alpha_n+1}{2}\right)}.
The predictive density is \begin{align} \pi(y_0\mid \boldsymbol{y})&\propto\int_{-\infty}^{\infty}\int_0^{\infty}\left\{ \left(\frac{1}{\sigma^2}\right)^{1/2}\exp\left\{-\frac{1}{2\sigma^2}(y_0-\mu)^2\right\}\left(\frac{1}{\sigma^2}\right)^{1/2}\exp\left\{-\frac{\beta_n}{2\sigma^2}(\mu-\mu_n)^2\right\}\right.\nonumber\\ &\times \left.\left(\frac{1}{\sigma^2}\right)^{\alpha_n/2+1}\exp\left\{-\frac{\delta_n}{2\sigma^2}\right\}\right\}d\sigma^2d\mu\nonumber\\ &=\int_{-\infty}^{\infty}\int_0^{\infty}\left\{ \left(\frac{1}{\sigma^2}\right)^{\frac{\alpha_n+2}{2}+1}\exp\left\{-\frac{1}{2\sigma^2}((y_0-\mu)^2+\beta_n(\mu-\mu_n)^2+\delta_n)\right\}\right\}d\sigma^2d\mu\nonumber\\ &\propto\int_{-\infty}^{\infty}\left[\beta_n(\mu-\mu_n)^2+(y_0-\mu)^2+\delta_n\right]^{-\left(\frac{\alpha_n}{2}+1\right)}d\mu\nonumber\\ &=\int_{-\infty}^{\infty}\left[(\beta_n+1)\left(\mu-\left(\frac{\beta_n\mu_n+y_0}{\beta_n+1}\right)\right)^2+\frac{\beta_n(y_0-\mu_n)^2}{\beta_n+1}+\delta_n\right]^{-\left(\frac{\alpha_n}{2}+1\right)}d\mu\nonumber\\ &=\int_{-\infty}^{\infty}\left[1+\frac{(\beta_n+1)^2\left(\mu-\left(\frac{\beta_n\mu_n+y_0}{\beta_n+1}\right)\right)^2}{\beta_n(y_0-\mu_n)^2+(\beta_n+1)\delta_n}\right]^{-\left(\frac{\alpha_n}{2}+1\right)}d\mu\nonumber\\ &\times\left(\frac{\beta_n(y_0-\mu_n)^2+(\beta_n+1)\delta_n}{\beta_n+1}\right)^{-\left(\frac{\alpha_n}{2}+1\right)}\nonumber\\ &\propto\left(\frac{\beta_n(y_0-\mu_n)^2+(\beta_n+1)\delta_n}{(\beta_n+1)^2(\alpha_n+1)}\right)^{\frac{1}{2}}\left(\frac{\beta_n(y_0-\mu_n)^2+(\beta_n+1)\delta_n}{\beta_n+1}\right)^{-\left(\frac{\alpha_n}{2}+1\right)}\nonumber\\ &\propto (\beta_n(y_0-\mu_n)^2+(\beta_n+1)\delta_n)^{\left(\frac{\alpha_n+1}{2}\right)}\nonumber\\ &\propto\left[1+\frac{\beta_n\alpha_n}{(\beta_n+1)\delta_n\alpha_n}(y_0-\mu_n)^2\right]^{-\left(\frac{\alpha_n+1}{2}\right)},\nonumber \end{align}
where we have that \left[1+\frac{(\beta_n+1)^2\left(\mu-\left(\frac{\beta_n\mu_n+y_0}{\beta_n+1}\right)\right)^2}{\beta_n(y_0-\mu_n)^2+(\beta_n+1)\delta_n}\right]^{-\left(\frac{\alpha_n}{2}+1\right)} is the kernel of a Student’s t density with degrees of freedom \alpha_n+1 and scale \frac{\beta_n(y_0-\mu_n)^2+(\beta_n+1)\delta_n}{(\beta_n+1)^2(\alpha_n+1)}.
The last expression is the kernel of a Student’s t density, that is, Y_0\mid \boldsymbol{y}\sim t\left(\mu_n,\frac{(\beta_n+1)\delta_n}{\beta_n\alpha_n},\alpha_n\right).
The multivariate normal-normal/inverse-Wishart model
We show in subsection 3.1 that the multivariate normal distribution is in the exponential family where \begin{equation*} C(\boldsymbol{\mu},\boldsymbol{\Sigma})=\exp\left\{-\frac{1}{2}\left(tr\left(\boldsymbol{\mu}\boldsymbol{\mu}^{\top}\boldsymbol{\Sigma}^{-1}\right)+\log(|\Sigma|)\right)\right\}, \end{equation*} \begin{equation*} \eta(\boldsymbol{\mu},\boldsymbol{\Sigma})^{\top}=\left[\left(vec\left(\boldsymbol{\Sigma}^{-1}\right)\right)^{\top} \ \ \left(vec\left(\boldsymbol{\mu}^{\top}\boldsymbol{\Sigma}^{-1}\right)\right)^{\top}\right], \end{equation*} \begin{equation*} T(\boldsymbol{y})=\left[-\frac{1}{2}\left(vec\left(\boldsymbol{S}\right)^{\top}+N vec\left(\hat{\boldsymbol{\mu}}\hat{\boldsymbol{\mu}}^{\top}\right)^{\top}\right) \ \ -N\hat{\boldsymbol{\mu}}^{\top}\right]^{\top} \end{equation*} and \begin{equation*} h(\boldsymbol{y})=(2\pi)^{-pN/2}. \end{equation*}
Then, its conjugate prior distribution should have the form \begin{align} \pi(\boldsymbol{\mu},\boldsymbol{\Sigma})&\propto \exp\left\{-\frac{b_0}{2}\left(tr\left(\boldsymbol{\mu}\boldsymbol{\mu}^{\top}\boldsymbol{\Sigma}^{-1}\right)+\log(|\Sigma|)\right)\right\}\nonumber\\ &\times \exp\left\{\boldsymbol{a}_{01}^{\top} vec\left(\boldsymbol{\Sigma}^{-1}\right)+\boldsymbol{a}_{02}^{\top}vec\left(\boldsymbol{\mu}^{\top}\boldsymbol{\Sigma}^{-1}\right)\right\}\nonumber\\ &=|\Sigma|^{-b_0/2}\exp\left\{-\frac{b_0}{2}\left(tr\left(\boldsymbol{\mu}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}\right)\right)+tr\left(\boldsymbol{a}_{02}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}\right)\right\}\nonumber\\ &\times \exp\left\{\boldsymbol{a}_{01}^{\top} vec\left(\boldsymbol{\Sigma}^{-1}\right)+\frac{\boldsymbol{a}_{02}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{a}_{02}}{2b_0}-\frac{\boldsymbol{a}_{02}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{a}_{02}}{2b_0}\right\}\nonumber\\ &=|\Sigma|^{-b_0/2}\exp\left\{-\frac{b_0}{2}\left(\boldsymbol{\mu}-\frac{\boldsymbol{a}_{02}}{b_0}\right)^{\top}\boldsymbol{\Sigma}^{-1}\left(\boldsymbol{\mu}-\frac{\boldsymbol{a}_{02}}{b_0}\right)\right\}\nonumber\\ &\times \exp\left\{-\frac{1}{2}tr\left(\left(\boldsymbol{A}_{01}-\frac{\boldsymbol{a}_{02}\boldsymbol{a}_{02}^{\top}}{b_0}\right)\boldsymbol{\Sigma}^{-1}\right)\right\}\nonumber\\ &=\underbrace{|\Sigma|^{-1/2}\exp\left\{-\frac{b_0}{2}\left(\boldsymbol{\mu}-\frac{\boldsymbol{a}_{02}}{b_0}\right)^{\top}\boldsymbol{\Sigma}^{-1}\left(\boldsymbol{\mu}-\frac{\boldsymbol{a}_{02}}{b_0}\right)\right\}}_1\nonumber\\ &\times \underbrace{|\Sigma|^{-(\alpha_0+p+1)/2}\exp\left\{-\frac{1}{2}tr\left(\left(\boldsymbol{A}_{01}-\frac{\boldsymbol{a}_{02}\boldsymbol{a}_{02}^{\top}}{b_0}\right)\boldsymbol{\Sigma}^{-1}\right)\right\}}_2,\nonumber \end{align}
Here, b_0 represents the hypothetical sample size, and \boldsymbol{a}_{01} and \boldsymbol{a}_{02} are p^2-dimensional and p-dimensional vectors of prior sufficient statistics, respectively. Specifically, \boldsymbol{a}_{01} = -\frac{1}{2} \text{vec}(\boldsymbol{A}_{01}), where \boldsymbol{A}_{01} is a p \times p positive semi-definite matrix.
Setting b_0 = 1 + \alpha_0 + p + 1, we observe that the first part of the last expression is the kernel of a multivariate normal density with mean \boldsymbol{\mu}_0 = \frac{\boldsymbol{a}_{02}}{b_0} and covariance \frac{\boldsymbol{\Sigma}}{b_0}, i.e., \boldsymbol{\mu} \mid \boldsymbol{\Sigma} \sim N_p \left( \boldsymbol{\mu}_0, \frac{\boldsymbol{\Sigma}}{b_0} \right), where b_0 = \beta_0. This choice of hyperparameters is intuitive because \boldsymbol{a}_{02} represents the hypothetical sum of prior observations, and b_0 represents the hypothetical prior sample size.
Additionally, the second part of the last expression corresponds to the kernel of an inverse Wishart distribution with scale matrix \boldsymbol{\Psi}_0 = \left( \boldsymbol{A}_{01} - \frac{\boldsymbol{a}_{02} \boldsymbol{a}_{02}^{\top}}{b_0} \right) and \alpha_0 degrees of freedom, i.e., \boldsymbol{\Sigma} \sim IW_p (\boldsymbol{\Psi}_0, \alpha_0). Observe that \boldsymbol{\Psi}_0 has the same structure as the first part of the sufficient statistics in T(\boldsymbol{y}), except that it should be understood as arising from prior hypothetical observations.
Therefore, the prior distribution in this setting is normal/inverse-Wishart, and, due to conjugacy, the posterior distribution belongs to the same family. \begin{align} \pi(\boldsymbol{\mu},\boldsymbol{\Sigma}\mid \boldsymbol{y})&\propto (2\pi)^{-p N/2}|\boldsymbol{\Sigma}|^{-N/2}\exp\left\{-\frac{1}{2}tr\left[\left(\boldsymbol{S}+N\left(\boldsymbol{\mu}-\hat{\boldsymbol{\mu}}\right)\left(\boldsymbol{\mu}-\hat{\boldsymbol{\mu}}\right)^{\top}\right)\boldsymbol{\Sigma}^{-1}\right]\right\}\nonumber\\ &\times |\boldsymbol{\Sigma}|^{-1/2}\exp\left\{-\frac{\beta_0}{2}tr\left[(\boldsymbol{\mu}-\boldsymbol{\mu}_0)(\boldsymbol{\mu}-\boldsymbol{\mu}_0)^{\top}\boldsymbol{\Sigma}^{-1}\right]\right\}|\boldsymbol{\Sigma}|^{-(\alpha_0+p+1)/2}\nonumber\\ &\times\exp\left\{-\frac{1}{2}tr(\boldsymbol{\Psi}_0\boldsymbol{\Sigma}^{-1})\right\}\nonumber. \end{align}
Taking into account that
\begin{align} N\left(\boldsymbol{\mu}-\hat{\boldsymbol{\mu}}\right)\left(\boldsymbol{\mu}-\hat{\boldsymbol{\mu}}\right)^{\top}+\beta_0\left(\boldsymbol{\mu}-\boldsymbol{\mu}_0\right)\left(\boldsymbol{\mu}-\boldsymbol{\mu}_0\right)^{\top}&=(N+\beta_0)\left(\boldsymbol{\mu}-\boldsymbol{\mu}_n\right)\left(\boldsymbol{\mu}-\boldsymbol{\mu}_n\right)^{\top}\nonumber\\ &+\frac{N\beta_0}{N+\beta_0}\left(\hat{\boldsymbol{\mu}}-\boldsymbol{\mu}_0\right)\left(\hat{\boldsymbol{\mu}}-\boldsymbol{\mu}_0\right)^{\top},\nonumber \end{align}
where \boldsymbol{\mu}_n=\frac{N}{N+\beta_0}\hat{\boldsymbol{\mu}}+\frac{\beta_0}{N+\beta_0}\boldsymbol{\mu}_0 is the posterior mean. We have
\begin{align} \pi(\boldsymbol{\mu},\boldsymbol{\Sigma}\mid \boldsymbol{y})&\propto |\boldsymbol\Sigma|^{-1/2}\exp\left\{-\frac{N+\beta_0}{2}tr\left[\left(\left(\boldsymbol{\mu}-\boldsymbol{\mu}_n\right)\left(\boldsymbol{\mu}-\boldsymbol{\mu}_n\right)^{\top}\right)\boldsymbol{\Sigma}^{-1}\right]\right\}\nonumber\\ &\times |\boldsymbol{\Sigma}|^{-(N+\alpha_0+p+1)/2}\nonumber\\ &\times\exp\left\{-\frac{1}{2}tr\left[\left(\boldsymbol{\Psi}_0+\boldsymbol{S}+\frac{N\beta_0}{N+\beta_0}(\hat{\boldsymbol{\mu}}-\boldsymbol{\mu}_0)(\hat{\boldsymbol{\mu}}-\boldsymbol{\mu}_0)^{\top}\right)\boldsymbol{\Sigma}^{-1}\right]\right\}.\nonumber \end{align}
Then, \boldsymbol{\mu}\mid \boldsymbol{\Sigma},\boldsymbol{y}\sim N_p\left(\boldsymbol{\mu}_n,\frac{1}{\beta_n}\boldsymbol{\Sigma}\right), and \boldsymbol{\Sigma}\mid \boldsymbol{y}\sim IW\left(\boldsymbol{\Psi}_n,\alpha_n\right) where \beta_n=N+\beta_0, \alpha_n=N+\alpha_0 and \boldsymbol{\Psi}_n=\boldsymbol{\Psi}_0+\boldsymbol{S}+\frac{N\beta_0}{N+\beta_0}(\hat{\boldsymbol{\mu}}-\boldsymbol{\mu}_0)(\hat{\boldsymbol{\mu}}-\boldsymbol{\mu}_0)^{\top}.
The marginal posterior of \boldsymbol{\mu} is given by \int_{\mathcal{S}} \pi(\boldsymbol{\mu},\boldsymbol{\Sigma})d\boldsymbol{\Sigma} where \mathcal{S} is the space of positive semi-definite matrices. Then,
\begin{align} \pi(\boldsymbol{\mu}\mid \boldsymbol{y})&\propto\int_{\mathcal{S}}\left\{|\boldsymbol{\Sigma}|^{-(\alpha_n+p+2)/2}\right.\nonumber\\ &\left. \exp\left\{-\frac{1}{2}tr\left[\left(\beta_n\left(\boldsymbol{\mu}-\boldsymbol{\mu}_n\right)\left(\boldsymbol{\mu}-\boldsymbol{\mu}_n\right)^{\top}+\boldsymbol{\Psi}_n\right)\boldsymbol{\Sigma}^{-1}\right]\right\} \right\}d\boldsymbol{\Sigma}\nonumber\\ &\propto \big\lvert\left(\beta_n\left(\boldsymbol{\mu}-\boldsymbol{\mu}_n\right)\left(\boldsymbol{\mu}-\boldsymbol{\mu}_n\right)^{\top}+\boldsymbol{\Psi}_n\right)\big\lvert^{-(\alpha_n+1)/2}\nonumber\\ &=\left[\big\lvert\boldsymbol{\Psi}_n\big\lvert\times \big\lvert1+\beta_n\left(\boldsymbol{\mu}-\boldsymbol{\mu}_n\right)^{\top}\boldsymbol{\Psi}_n^{-1}\left(\boldsymbol{\mu}-\boldsymbol{\mu}_n\right)\big\lvert\right]^{-(\alpha_n+1)/2}\nonumber\\ &\propto \left(1+\frac{1}{\alpha_n+1-p}\left(\boldsymbol{\mu}-\boldsymbol{\mu}_n\right)^{\top}\left(\frac{\boldsymbol{\Psi}_n}{(\alpha_n+1-p)\beta_n}\right)^{-1}\left(\boldsymbol{\mu}-\boldsymbol{\mu}_n\right)\right)^{-(\alpha_n+1-p+p)/2},\nonumber \end{align}
where the second line uses properties of the inverse Wishart distribution, and the third line uses a particular case of the Sylvester’s determinant theorem.
We observe that the last line is the kernel of a multivariate t distribution, that is, \boldsymbol{\mu}\mid \boldsymbol{y}\sim t_p(v_n,\boldsymbol{\mu}_n,\boldsymbol{\Sigma}_n) where v_n=\alpha_n+1-p and \boldsymbol{\Sigma}_n=\frac{\boldsymbol{\Psi}_n}{(\alpha_n+1-p)\beta_n}.
The marginal likelihood is given by \begin{align} p(\boldsymbol{y})=\frac{\Gamma_p\left(\frac{v_n}{2}\right)}{\Gamma_p\left(\frac{\alpha_0}{2}\right)}\frac{|\boldsymbol{\Psi}_0|^{\alpha_0/2}}{|\boldsymbol{\Psi}_n|^{\alpha_n/2}}\left(\frac{\beta_0}{\beta_n}\right)^{p/2}(2\pi)^{-Np/2},\nonumber \end{align}
where \Gamma_p is the multivariate gamma function (see Exercise 5).
The posterior predictive distribution is \boldsymbol{Y}_0\mid \boldsymbol{y}\sim t_p(v_n,\boldsymbol{\mu}_n,(\beta_n+1)\boldsymbol{\Sigma}_n) (see Exercise 6).
Example: Tangency portfolio of US tech stocks
The tangency portfolio is the portfolio that maximizes the Sharpe ratio, which is defined as the excess return of a portfolio standardized by its risk.
We aim to find the portfolio weights \boldsymbol{w} that maximize the Sharpe ratio, where \mu_{i,T+\kappa} = \mathbb{E}\left( R_{i,T+\kappa} - R_{f,T+\kappa} \mid \mathcal{I}_T \right), with R_{i,T+\kappa} and R_{f,T+\kappa} representing the returns of stock i and the risk-free asset, respectively. Here, \mu_{i,T+\kappa} is the expected value of the excess return at period T+\kappa, conditional on information available up to time T (\mathcal{I}_T), and \boldsymbol{\Sigma}_{T+\kappa} is the covariance matrix of the excess returns, which quantifies the risk. \begin{equation*} \text{argmax}_{{\boldsymbol w}\in \mathbb{R}^{p}} \frac{{\boldsymbol w}^{\top}\boldsymbol{\mu}_{T+\kappa}}{\sqrt{{\boldsymbol w}^{\top}{\boldsymbol{\Sigma}}_{T+\kappa} {\boldsymbol w}}}; \hspace{1cm} \text{s.t}\hspace{.5cm} {\boldsymbol w}^{\top}{\boldsymbol{1}}=1, \end{equation*} where the solution is \begin{equation*} {\boldsymbol w}^*=\frac{{\boldsymbol{\Sigma}}^{-1}_{T+\kappa}\boldsymbol{\mu}_{T+\kappa}}{{\boldsymbol{1}}^{\top}{\boldsymbol \Sigma}^{-1}_{T+\kappa}\boldsymbol{\mu}_{T+\kappa}}. \end{equation*}
If we want to find the optimal portfolio for the next period under the assumption that the excess returns follow a multivariate normal distribution –a common assumption in these applications– we can set \kappa = 1 and use the predictive distribution of the excess returns. In this case, \boldsymbol{\mu}_{T+1} = \boldsymbol{\mu}_n and \boldsymbol{\Sigma}_{T+1} = \frac{v_n}{v_n - 2} (\beta_n + 1) \boldsymbol{\Sigma}_n, based on the previous predictive result.
We apply this framework to ten tech stocks of the US market between January first, 2021, and September ninth, 2022. In particular, we use information from Yahoo Finance for Apple (AAPL), Netflix (NFLX), Amazon (AMZN), Microsoft (MSFT), Google (GOOG), Meta (META), Tesla (TSLA), NVIDIA Corporation (NVDA), Intel (INTC), and PayPal (PYPL).
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
## Loading required package: TTR
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
# grid.arrange
graphics.off()
rm(list=ls())
# Data Range
sdate <- as.Date("2021-01-01")
edate <- as.Date("2022-09-30")
Date <- seq(sdate, edate, by = "day")
tickers <- c("AAPL", "NFLX", "AMZN", "GOOG", "INTC","META", "MSFT", "TSLA", "NVDA", "PYPL")
p <- length(tickers)
# AAPL: Apple, NFLX: Netflix, AMZN: Amazon,
# MSFT: Microsoft, GOOG: Google, META: Meta,
# TSLA: Tesla, NVDA: NVIDIA Corporation
# INTC: Intel, PYPL: PayPal
ss_stock <- getSymbols(tickers, from=sdate, to=edate, auto.assign = T)
ss_stock <- purrr::map(tickers,function(x) Ad(get(x)))
ss_stock <- as.data.frame(purrr::reduce(ss_stock, merge))
colnames(ss_stock) <- tickers
# This is to get stock prices
ss_rtn <- as.data.frame(apply(ss_stock, 2, function(x) {diff(log(x), 1)}))
# Daily returns
t10yr <- getSymbols(Symbols = "DGS10", src = "FRED", from=sdate, to=edate, auto.assign = F)
# To get 10-Year US Treasury yield data from the Federal Reserve Electronic Database (FRED)
t10yrd <- (1 + t10yr/100)^(1/365)-1
# Daily returns
t10yrd <- t10yrd[row.names(ss_rtn)]
Exc_rtn <- as.matrix(ss_rtn) - kronecker(t(rep(1, p)), as.matrix(t10yrd))
# Excesses of return
df <- as.data.frame(Exc_rtn)
df$Date <- as.Date(rownames(df))
# Get months
df$Month <- months(df$Date)
# Get years
df$Year <- format(df$Date, format="%y")
# Aggregate on months and year and get mean
Data <- sapply(1:p, function(i) {
aggregate(df[, i] ~ Month + Year, df, mean)})
DataExcRtn <- matrix(0, length(Data[, 1]$Month), p)
for(i in 1:p){
DataExcRtn[, i] <- as.numeric(Data[, i]$`df[, i]`)
}
colnames(DataExcRtn) <- tickers
head(DataExcRtn)
## AAPL NFLX AMZN GOOG INTC
## [1,] 0.003453436 -0.0007978684 0.0053804495 0.0072313890 -0.0051193877
## [2,] 0.001856496 0.0042864089 0.0018802467 0.0032834533 0.0005462708
## [3,] 0.003214835 -0.0029236867 -0.0023355829 0.0006654206 0.0020368917
## [4,] 0.001054383 0.0009737922 0.0003104527 0.0033227708 0.0061459733
## [5,] -0.004406269 0.0006005357 -0.0019272761 0.0054374191 0.0050578243
## [6,] 0.002962127 -0.0010048976 -0.0016201572 0.0035865836 -0.0021341328
## META MSFT TSLA NVDA PYPL
## [1,] 0.0046552164 0.0031597866 0.002826751 0.0055412954 0.0036246294
## [2,] 0.0028180326 0.0026818375 0.003066171 0.0062470706 0.0020811144
## [3,] 0.0015960684 0.0007412593 -0.003674775 -0.0048193681 0.0008583936
## [4,] -0.0022658207 0.0034977089 0.004623756 -0.0005564295 0.0005398989
## [5,] -0.0001791074 0.0001820381 -0.008509941 0.0028232820 0.0054109929
## [6,] 0.0011262192 0.0023652517 0.000486675 -0.0012498933 -0.0027156472
# Hyperparameters #
N <- dim(DataExcRtn)[1]
mu0 <- rep(0, p)
beta0 <- 1
Psi0 <- 100 * diag(p)
alpha0 <- p + 2
# Posterior parameters #
alphan <- N + alpha0
vn <- alphan + 1 - p
muhat <- colMeans(DataExcRtn)
mun <- N/(N + beta0) * muhat + beta0/(N + beta0) * mu0
S <- t(DataExcRtn - rep(1, N)%*%t(muhat))%*%(DataExcRtn - rep(1, N) %*%t(muhat))
Psin <- Psi0 + S + N*beta0/(N + beta0)*(muhat - mu0)%*%t(muhat - mu0)
betan <- N + beta0
Sigman <- Psin/((alphan + 1 - p)*betan)
Covarn <- (Sigman * (1 + betan)) * vn / (vn - 2)
Covari <- solve(Covarn)
OptShare <- t(Covari%*%mun/as.numeric((t(rep(1, p))%*%Covari%*%mun)))
colnames(OptShare) <- tickers
OptShare
## AAPL NFLX AMZN GOOG INTC META MSFT
## [1,] -0.01871449 0.248481 0.1028211 -0.03408626 0.1733602 0.2299969 -0.0222197
## TSLA NVDA PYPL
## [1,] -0.01591862 0.03534303 0.3009368
We find that the optimal tangency portfolio is composed by 24.8%, 10.2%, 17.3%, 23%, 3.5% and 30.1% weights of Netflix, Amazon, Intel, Meta, NVIDIA and PayPal, and -1.9%, -3.4%, -2.2% and -1.6% weights of Apple, Google, Microsoft and Tesla. A negative weight means being short in financial jargon, that is, borrowing a stock to sell it.
References
Another parametrization of the gamma density is the scale parametrization, where \kappa_0 = 1/\beta_0. See the health insurance example in Chapter 1.↩︎
https://www.11v11.com/teams/manchester-city/tab/opposingTeams/opposition/Liverpool/.↩︎
https://www.oddsportal.com/soccer/england/premier-league/liverpool-manchester-city-WrqgEz5S/↩︎