Chapter 1 Basics of Bayesian linear regression

1.1 Bayes’ theorem

Theorem 1.1 (Bayes' theorem) For events $$U, K$$ and $$P(K) \neq 0$$, we have

$P(U \mid K) = \frac{P(K \mid U) P(U)}{P(K)}$

We denote $$U$$ as unknown parameters and $$K$$ as known parameters. We call $$P(U)$$ prior and $$P(K|U)$$ likelihood. The Bayes’ theorem gives us the posterior distribution of unknown parameters given the known parameters $P(U \mid K) \propto P(U) \cdot P(K \mid U)$

1.2 Normal-Inverse-Gamma (NIG) prior

1.2.1 Joint distribution of NIG prior

Definition 1.1 (Normal-Inverse-Gamma Distribution) Suppose \begin{aligned} & \beta \mid \sigma^2, \mu, M \sim N(\mu,\sigma^2 M) \\ & \sigma^2 \mid a, b \sim IG(a, b) \end{aligned} Then $$(\beta,\sigma^2)$$ has a Normal-Inverse-Gamma distribution, denoted as $$(\beta,\sigma^2) \sim NIG(\mu,M,a,b)$$.

We use a Normal-Inverse-Gamma prior for $$(\beta, \sigma^2)$$

\begin{align} P(\beta, \sigma^{2}) &= NIG \left(\beta, \sigma^{2} \mid m_{0}, M_{0}, a_{0}, b_{0}\right) \\ &= \frac{b_0^{a_0}}{\Gamma\left(a_{0}\right)} \left(\frac{1}{\sigma^{2}}\right)^{a_{0}+1} e^{-\frac{b_{0}}{\sigma^{2}}} \frac{1}{(2 \pi \sigma^{2})^{\frac{p}{2}}\left| M_{0}\right|^{\frac{1}{2}}} e^{-\frac{1}{2 \sigma^{2}} Q \left(\beta, m_{0}, M_{0}\right)} \end{align}

where $$Q(x, m, M)=(x-m)^{\mathrm{\scriptscriptstyle T}} M^{-1} (x-m) \, .$$

1.2.2 Marginal distribution of NIG prior

As for marginal priors, we can can get it by integration

\begin{aligned} P(\sigma^2) & = \int N I G\left(\beta, \sigma^{2} \mid m_{0}, M_{0}, a_{0}, b_{0}\right) \ d\beta=I G\left(\sigma^{2} \mid a_{0}, b_{0}\right) \\ P(\beta) & = \int N I G\left(\beta, \sigma^{2} \mid m_{0}, M_{0}, a_{0}, b_{0}\right) \ d\sigma^{2}=t_{2a_0}\left(m_0, \frac{b_0}{a_0}M_0\right) \\ \end{aligned}

Note: the density of multivariate t-distribution is given by

$t_v(\mu, \Sigma)=\frac{\Gamma\left(\frac{v+p}{2}\right)}{(v \pi)^{\frac{p}{2}} \Gamma\left(\frac{v}{2}\right) |\Sigma|^{\frac{1}{2}}}\left(1+\frac{1}{v}(x-\mu)^{\mathrm{\scriptscriptstyle T}} \Sigma^{-1}(x-\mu)\right)^{-\frac{v+p}{2}}$

1.3 Conjugate Bayesian linear regression and M&m formula

Let $$y_{n \times 1}$$ be outcome variable and $$X_{n \times p}$$ be corresponding covariates. Assume $$V$$ is known. The model is given by

\begin{align} & y=X \beta+\epsilon \ , \ \epsilon \sim N\left(0, \sigma^2 V\right) \\ & \beta=m_0+\omega \ , \ \omega \sim N\left(0, \sigma^2 M_0\right) \\ & \sigma^2 \sim I G\left(a_0, b_0\right) \end{align}

The posterior distribution of $$(\beta, \sigma^2)$$ is given by

\begin{align} P\left(\beta, \sigma^{2} \mid y\right) = NIG\left(\beta, \sigma^{2} \mid M_{1}m_{1}, M_{1}, a_{1}, b_{1}\right) \; \end{align}

where

\begin{align} M_{1}^{-1} &= M_{0}^{-1}+X^{\mathrm{\scriptscriptstyle T}} V^{-1} X \\ m_{1}&=M_{0}^{-1} m_{0}+X^{\mathrm{\scriptscriptstyle T}} V^{-1} y \\ a_{1}&=a_{0}+\frac{p}{2} \\ b_{1}&=b_{0}+\frac{c^{\ast}}{2}= b_{0}+\frac{1}{2}\left(m_{0}^{\mathrm{\scriptscriptstyle T}} M_{0}^{-1} m_{0}+y^{\mathrm{\scriptscriptstyle T}} V^{-1} y-m_{1}^{\mathrm{\scriptscriptstyle T}} M_{1} m_{1}\right) \end{align}

From derivation in marginal priors, the marginal posterior distributions can be easily get by updating corresponding parameters

\begin{aligned} &P\left(\sigma^{2} \mid y\right)=I G\left(\sigma^{2} \mid a_{1}, b_{1}\right) \\ &P(\beta \mid y)=t_{2a_1}\left(M_1m_1, \frac{b_1}{a_1}M_1\right) \end{aligned}

1.4 Updating form of the posterior distribution

We will use two ways to derive the updating form of the posterior distribution.

1.4.1 Method 1: Sherman-Woodbury-Morrison identity

Theorem 1.2 (Sherman-Woodbury-Morrison identity) We have $$$\label{ShermanWoodburyMorrison} \left(A + BDC\right)^{-1} = A^{-1} - A^{-1}B\left(D^{-1}+CA^{-1}B\right)^{-1}CA^{-1}$$$ where $$A$$ and $$D$$ are square matrices that are invertible and $$B$$ and $$C$$ are rectangular (square if $$A$$ and $$D$$ have the same dimensions) matrices such that the multiplications are well-defined.

Sherman-Woodbury-Morrison identity is easily verified by multiplying the right hand side with $$A + BDC$$ and simplifying to reduce it to the identity matrix. Using this formula, we have

\begin{aligned} M_1 & = (M_{0}^{-1} + X^{\mathrm{\scriptscriptstyle T}}V^{-1}X)^{-1} \\ & = M_0-M_0 X^{\mathrm{\scriptscriptstyle T}}\left(V+X M_0 X^{\mathrm{\scriptscriptstyle T}}\right)^{-1} X M_0 \\ & = M_0-M_0 X^{\mathrm{\scriptscriptstyle T}} Q^{-1} X M_0 \end{aligned}

where $$Q = V + X M_0 X^{\mathrm{\scriptscriptstyle T}}$$

We can show that \begin{align} M_1 m_1 & =m_0+M_0 X^{\mathrm{\scriptscriptstyle T}} Q^{-1}\left(y-X m_0\right) \end{align}

Furthermore, we can simplify that \begin{align} m_0^{\mathrm{\scriptscriptstyle T}} M_0^{-1} m_0+y^{\mathrm{\scriptscriptstyle T}} V^{-1} y-m_1^{\mathrm{\scriptscriptstyle T}} M_1 m_1 & = \left(y-X m_0\right)^{\mathrm{\scriptscriptstyle T}} Q^{-1}\left(y-X m_0\right) \end{align}

So, we get the following updating form of the posterior distribution from Bayesian linear regression

\begin{aligned} P\left(\beta, \sigma^{2} \mid y\right) = NIG\left(\beta, \sigma^{2} \mid \tilde{m}_1, \tilde{M}_1, a_{1}, b_{1}\right) \; \end{aligned} where

\begin{aligned} \tilde{m}_1 & =M_1 m_1=m_0+M_0 X^{\mathrm{\scriptscriptstyle T}} Q^{-1}\left(y-X m_0\right) \\ \tilde{M}_1 & =M_1=M_0-M_0 X^{\mathrm{\scriptscriptstyle T}} Q^{-1} X M_0 \\ a_1 & =a_0+\frac{p}{2} \\ b_1 & =b_0+\frac{1}{2}\left(y-X m_0\right)^{\mathrm{\scriptscriptstyle T}} Q^{-1}\left(y-X m_0\right) \\ Q & =V+X M_0 X^{\mathrm{\scriptscriptstyle T}} \end{aligned}

1.4.2 Method 2: Distribution theory

Previously, we got the Bayesian Linear Regression Updater using Sherman-Woodbury-Morrison identity. Here, we will derive the results without resorting to it. Recall that the model is given by

\begin{align} & y=X \beta+\epsilon \ , \ \epsilon \sim N\left(0, \sigma^2 V\right) \\ & \beta=m_0+\omega \ , \ \omega \sim N\left(0, \sigma^2 M_0\right) \\ & \sigma^2 \sim I G\left(a_0, b_0\right) \end{align}

This corresponds to the posterior distribution

\begin{align} P\left(\beta, \sigma^2 \mid y\right) \propto I G\left(\sigma^2 \mid a_0, b_0\right) & \times N\left(\beta \mid m_0, \sigma^2 M_0\right) \times N\left(y \mid X \beta, \sigma^2 V\right) \end{align}

We will derive $$P\left(\sigma^2 \mid y\right)$$ and $$P\left(\beta \mid \sigma^2, y\right)$$ in a form that will reflect updates from the prior to the posterior. Integrating out $$\beta$$ from the model is equivalent to substituting $$\beta$$ from its prior model. Thus, $$P\left(y \mid \sigma^2\right)$$ is derived simply from $$y =X \beta+\epsilon =X\left(m_0+\omega\right)+\epsilon =X m_0 + X \omega + \epsilon =X m_0+ \eta$$

where $$\eta = X \omega + \epsilon \sim N\left(0, \sigma^2Q\right)$$ and $$Q=X M_0 X^{\mathrm{\scriptscriptstyle T}}+V \, .$$

Therefore,

\begin{align} y \mid \sigma^2 \sim N\left(X m_0, \sigma^2 Q\right)\\ \end{align}

The posterior distribution is given by \begin{align} P\left(\sigma^2 \mid y\right) & \propto IG \left(\sigma^2 \mid a_1, b_1\right) \end{align}

where \begin{align} & a_1 = a_0 + \frac{p}{2} \\ & b_1 = b_0 + \frac{1}{2} (y-Xm_0)^{\mathrm{\scriptscriptstyle T}} Q^{-1} \left(y-Xm_0\right) \end{align}

Next, we turn to $$P\left(\beta \mid \sigma^2, y\right)$$. Note that $\left(\begin{array}{l} y \\ \beta \end{array}\right) \mid \sigma^2 \sim N\left(\left(\begin{array}{l} Xm_0 \\ m_0 \end{array}\right), \quad \sigma^2 \left(\begin{array}{cc} Q & X M_0 \\ M_0 X^{\mathrm{\scriptscriptstyle T}} & M_0 \end{array}\right)\right) \;$

From the expression of a conditional distribution derived from a multivariate Gaussian, we obtain $\beta \mid \sigma^2, y \sim N\left(\tilde{m}_1, \sigma^2 \tilde{M}_1\right)$

where \begin{align} & \tilde{m}_1=\operatorname{E}\left[\beta \mid \sigma^2, y\right]=m_0+M_0 X^{\mathrm{\scriptscriptstyle T}} Q^{-1}\left(y-X{m_0}\right) \\ & \tilde{M}_1=M_0-M_0 X^{\mathrm{\scriptscriptstyle T}} Q^{-1} X M_0 \\ \end{align}

1.5 Bayesian prediction

Assume $$V=I_{n}$$. Let $$\tilde{y}$$ denote an $$\tilde{n}\times 1$$ vector of outcomes. $$\tilde{X}$$ is corresponding predictors. We seek to predict $$\tilde{y}$$ based upon $$y$$

\begin{align} P(\tilde{y} \mid y) &= t_{2a_1}\left(\tilde{X} M_1 m_1, \frac{b_1}{a_1}\left(I_{\tilde{n}} + \tilde{X} M_{1} \tilde{X}^{\mathrm{\scriptscriptstyle T}}\right)\right) \; \end{align}

1.6 Sampling from the posterior distribution

We can get the joint posterior density $$P\left(\beta, \sigma^{2}, \tilde{y} \mid y\right)$$ by sampling process

1. Draw $$\hat{\sigma}_{(i)}^{2}$$ from $$I G\left(a_{1}, b_{1}\right)$$

2. Draw $$\hat{\beta}_{(i)}$$ from $$N\left(M_{1} m_{1}, \hat{\sigma}_{(i)}^{2} M_{1}\right)$$

3. Draw $$\tilde{y}_{(i)}$$ from $$N\left(\tilde{X} \hat{\beta}_{(i)}, \hat{\sigma}_{(i)}^{2}I_{\tilde{n}}\right)$$