Chapter 1 Basics of Bayesian linear regression
1.1 Bayes’ theorem
Theorem 1.1 (Bayes' theorem) For events \(U, K\) and \(P(K) \neq 0\), we have
\[P(U \mid K) = \frac{P(K \mid U) P(U)}{P(K)}\]
We denote \(U\) as unknown parameters and \(K\) as known parameters. We call \(P(U)\) prior and \(P(K|U)\) likelihood. The Bayes’ theorem gives us the posterior distribution of unknown parameters given the known parameters \[ P(U \mid K) \propto P(U) \cdot P(K \mid U)\]
1.2 Normal-Inverse-Gamma (NIG) prior
1.2.1 Joint distribution of NIG prior
Definition 1.1 (Normal-Inverse-Gamma Distribution) Suppose \[ \begin{aligned} & \beta \mid \sigma^2, \mu, M \sim N(\mu,\sigma^2 M) \\ & \sigma^2 \mid a, b \sim IG(a, b) \end{aligned} \] Then \((\beta,\sigma^2)\) has a Normal-Inverse-Gamma distribution, denoted as \((\beta,\sigma^2) \sim NIG(\mu,M,a,b)\).
We use a Normal-Inverse-Gamma prior for \((\beta, \sigma^2)\)
\[\begin{align} P(\beta, \sigma^{2}) &= NIG \left(\beta, \sigma^{2} \mid m_{0}, M_{0}, a_{0}, b_{0}\right) \\ &= \frac{b_0^{a_0}}{\Gamma\left(a_{0}\right)} \left(\frac{1}{\sigma^{2}}\right)^{a_{0}+1} e^{-\frac{b_{0}}{\sigma^{2}}} \frac{1}{(2 \pi \sigma^{2})^{\frac{p}{2}}\left| M_{0}\right|^{\frac{1}{2}}} e^{-\frac{1}{2 \sigma^{2}} Q \left(\beta, m_{0}, M_{0}\right)} \end{align}\]
where \(Q(x, m, M)=(x-m)^{\mathrm{\scriptscriptstyle T}} M^{-1} (x-m) \, .\)
1.2.2 Marginal distribution of NIG prior
As for marginal priors, we can can get it by integration
\[ \begin{aligned} P(\sigma^2) & = \int N I G\left(\beta, \sigma^{2} \mid m_{0}, M_{0}, a_{0}, b_{0}\right) \ d\beta=I G\left(\sigma^{2} \mid a_{0}, b_{0}\right) \\ P(\beta) & = \int N I G\left(\beta, \sigma^{2} \mid m_{0}, M_{0}, a_{0}, b_{0}\right) \ d\sigma^{2}=t_{2a_0}\left(m_0, \frac{b_0}{a_0}M_0\right) \\ \end{aligned} \]
Note: the density of multivariate t-distribution is given by
\[ t_v(\mu, \Sigma)=\frac{\Gamma\left(\frac{v+p}{2}\right)}{(v \pi)^{\frac{p}{2}} \Gamma\left(\frac{v}{2}\right) |\Sigma|^{\frac{1}{2}}}\left(1+\frac{1}{v}(x-\mu)^{\mathrm{\scriptscriptstyle T}} \Sigma^{-1}(x-\mu)\right)^{-\frac{v+p}{2}} \]
1.3 Conjugate Bayesian linear regression and M&m formula
Let \(y_{n \times 1}\) be outcome variable and \(X_{n \times p}\) be corresponding covariates. Assume \(V\) is known. The model is given by
\[\begin{align} & y=X \beta+\epsilon \ , \ \epsilon \sim N\left(0, \sigma^2 V\right) \\ & \beta=m_0+\omega \ , \ \omega \sim N\left(0, \sigma^2 M_0\right) \\ & \sigma^2 \sim I G\left(a_0, b_0\right) \end{align}\]
The posterior distribution of \((\beta, \sigma^2)\) is given by
\[\begin{align} P\left(\beta, \sigma^{2} \mid y\right) = NIG\left(\beta, \sigma^{2} \mid M_{1}m_{1}, M_{1}, a_{1}, b_{1}\right) \; \end{align}\]
where
\[\begin{align} M_{1}^{-1} &= M_{0}^{-1}+X^{\mathrm{\scriptscriptstyle T}} V^{-1} X \\ m_{1}&=M_{0}^{-1} m_{0}+X^{\mathrm{\scriptscriptstyle T}} V^{-1} y \\ a_{1}&=a_{0}+\frac{p}{2} \\ b_{1}&=b_{0}+\frac{c^{\ast}}{2}= b_{0}+\frac{1}{2}\left(m_{0}^{\mathrm{\scriptscriptstyle T}} M_{0}^{-1} m_{0}+y^{\mathrm{\scriptscriptstyle T}} V^{-1} y-m_{1}^{\mathrm{\scriptscriptstyle T}} M_{1} m_{1}\right) \end{align}\]
From derivation in marginal priors, the marginal posterior distributions can be easily get by updating corresponding parameters
\[ \begin{aligned} &P\left(\sigma^{2} \mid y\right)=I G\left(\sigma^{2} \mid a_{1}, b_{1}\right) \\ &P(\beta \mid y)=t_{2a_1}\left(M_1m_1, \frac{b_1}{a_1}M_1\right) \end{aligned} \]
1.4 Updating form of the posterior distribution
We will use two ways to derive the updating form of the posterior distribution.
1.4.1 Method 1: Sherman-Woodbury-Morrison identity
Theorem 1.2 (Sherman-Woodbury-Morrison identity) We have \[\begin{equation}\label{ShermanWoodburyMorrison} \left(A + BDC\right)^{-1} = A^{-1} - A^{-1}B\left(D^{-1}+CA^{-1}B\right)^{-1}CA^{-1} \end{equation}\] where \(A\) and \(D\) are square matrices that are invertible and \(B\) and \(C\) are rectangular (square if \(A\) and \(D\) have the same dimensions) matrices such that the multiplications are well-defined.
Sherman-Woodbury-Morrison identity is easily verified by multiplying the right hand side with \(A + BDC\) and simplifying to reduce it to the identity matrix. Using this formula, we have
\[ \begin{aligned} M_1 & = (M_{0}^{-1} + X^{\mathrm{\scriptscriptstyle T}}V^{-1}X)^{-1} \\ & = M_0-M_0 X^{\mathrm{\scriptscriptstyle T}}\left(V+X M_0 X^{\mathrm{\scriptscriptstyle T}}\right)^{-1} X M_0 \\ & = M_0-M_0 X^{\mathrm{\scriptscriptstyle T}} Q^{-1} X M_0 \end{aligned} \]
where \(Q = V + X M_0 X^{\mathrm{\scriptscriptstyle T}}\)
We can show that \[ \begin{align} M_1 m_1 & =m_0+M_0 X^{\mathrm{\scriptscriptstyle T}} Q^{-1}\left(y-X m_0\right) \end{align} \]
Furthermore, we can simplify that \[ \begin{align} m_0^{\mathrm{\scriptscriptstyle T}} M_0^{-1} m_0+y^{\mathrm{\scriptscriptstyle T}} V^{-1} y-m_1^{\mathrm{\scriptscriptstyle T}} M_1 m_1 & = \left(y-X m_0\right)^{\mathrm{\scriptscriptstyle T}} Q^{-1}\left(y-X m_0\right) \end{align} \]
So, we get the following updating form of the posterior distribution from Bayesian linear regression
\[ \begin{aligned} P\left(\beta, \sigma^{2} \mid y\right) = NIG\left(\beta, \sigma^{2} \mid \tilde{m}_1, \tilde{M}_1, a_{1}, b_{1}\right) \; \end{aligned} \] where
\[ \begin{aligned} \tilde{m}_1 & =M_1 m_1=m_0+M_0 X^{\mathrm{\scriptscriptstyle T}} Q^{-1}\left(y-X m_0\right) \\ \tilde{M}_1 & =M_1=M_0-M_0 X^{\mathrm{\scriptscriptstyle T}} Q^{-1} X M_0 \\ a_1 & =a_0+\frac{p}{2} \\ b_1 & =b_0+\frac{1}{2}\left(y-X m_0\right)^{\mathrm{\scriptscriptstyle T}} Q^{-1}\left(y-X m_0\right) \\ Q & =V+X M_0 X^{\mathrm{\scriptscriptstyle T}} \end{aligned} \]
1.4.2 Method 2: Distribution theory
Previously, we got the Bayesian Linear Regression Updater using Sherman-Woodbury-Morrison identity. Here, we will derive the results without resorting to it. Recall that the model is given by
\[\begin{align} & y=X \beta+\epsilon \ , \ \epsilon \sim N\left(0, \sigma^2 V\right) \\ & \beta=m_0+\omega \ , \ \omega \sim N\left(0, \sigma^2 M_0\right) \\ & \sigma^2 \sim I G\left(a_0, b_0\right) \end{align}\]
This corresponds to the posterior distribution
\[\begin{align} P\left(\beta, \sigma^2 \mid y\right) \propto I G\left(\sigma^2 \mid a_0, b_0\right) & \times N\left(\beta \mid m_0, \sigma^2 M_0\right) \times N\left(y \mid X \beta, \sigma^2 V\right) \end{align}\]
We will derive \(P\left(\sigma^2 \mid y\right)\) and \(P\left(\beta \mid \sigma^2, y\right)\) in a form that will reflect updates from the prior to the posterior. Integrating out \(\beta\) from the model is equivalent to substituting \(\beta\) from its prior model. Thus, \(P\left(y \mid \sigma^2\right)\) is derived simply from \(y =X \beta+\epsilon =X\left(m_0+\omega\right)+\epsilon =X m_0 + X \omega + \epsilon =X m_0+ \eta\)
where \(\eta = X \omega + \epsilon \sim N\left(0, \sigma^2Q\right)\) and \(Q=X M_0 X^{\mathrm{\scriptscriptstyle T}}+V \, .\)
Therefore,
\[\begin{align} y \mid \sigma^2 \sim N\left(X m_0, \sigma^2 Q\right)\\ \end{align}\]
The posterior distribution is given by \[\begin{align} P\left(\sigma^2 \mid y\right) & \propto IG \left(\sigma^2 \mid a_1, b_1\right) \end{align}\]
where \[\begin{align} & a_1 = a_0 + \frac{p}{2} \\ & b_1 = b_0 + \frac{1}{2} (y-Xm_0)^{\mathrm{\scriptscriptstyle T}} Q^{-1} \left(y-Xm_0\right) \end{align}\]
Next, we turn to \(P\left(\beta \mid \sigma^2, y\right)\). Note that \[ \left(\begin{array}{l} y \\ \beta \end{array}\right) \mid \sigma^2 \sim N\left(\left(\begin{array}{l} Xm_0 \\ m_0 \end{array}\right), \quad \sigma^2 \left(\begin{array}{cc} Q & X M_0 \\ M_0 X^{\mathrm{\scriptscriptstyle T}} & M_0 \end{array}\right)\right) \; \]
From the expression of a conditional distribution derived from a multivariate Gaussian, we obtain \[ \beta \mid \sigma^2, y \sim N\left(\tilde{m}_1, \sigma^2 \tilde{M}_1\right) \]
where \[\begin{align} & \tilde{m}_1=\operatorname{E}\left[\beta \mid \sigma^2, y\right]=m_0+M_0 X^{\mathrm{\scriptscriptstyle T}} Q^{-1}\left(y-X{m_0}\right) \\ & \tilde{M}_1=M_0-M_0 X^{\mathrm{\scriptscriptstyle T}} Q^{-1} X M_0 \\ \end{align}\]
1.5 Bayesian prediction
Assume \(V=I_{n}\). Let \(\tilde{y}\) denote an \(\tilde{n}\times 1\) vector of outcomes. \(\tilde{X}\) is corresponding predictors. We seek to predict \(\tilde{y}\) based upon \(y\)
\[\begin{align} P(\tilde{y} \mid y) &= t_{2a_1}\left(\tilde{X} M_1 m_1, \frac{b_1}{a_1}\left(I_{\tilde{n}} + \tilde{X} M_{1} \tilde{X}^{\mathrm{\scriptscriptstyle T}}\right)\right) \; \end{align}\]
1.6 Sampling from the posterior distribution
We can get the joint posterior density \(P\left(\beta, \sigma^{2}, \tilde{y} \mid y\right)\) by sampling process
Draw \(\hat{\sigma}_{(i)}^{2}\) from \(I G\left(a_{1}, b_{1}\right)\)
Draw \(\hat{\beta}_{(i)}\) from \(N\left(M_{1} m_{1}, \hat{\sigma}_{(i)}^{2} M_{1}\right)\)
Draw \(\tilde{y}_{(i)}\) from \(N\left(\tilde{X} \hat{\beta}_{(i)}, \hat{\sigma}_{(i)}^{2}I_{\tilde{n}}\right)\)