Chapter 1 Basics of Bayesian linear regression
1.1 Bayes’ theorem
Theorem 1.1 (Bayes' theorem) For events U,K and P(K)≠0, we have
P(U∣K)=P(K∣U)P(U)P(K)
We denote U as unknown parameters and K as known parameters. We call P(U) prior and P(K|U) likelihood. The Bayes’ theorem gives us the posterior distribution of unknown parameters given the known parameters P(U∣K)∝P(U)⋅P(K∣U)
1.2 Normal-Inverse-Gamma (NIG) prior
1.2.1 Joint distribution of NIG prior
Definition 1.1 (Normal-Inverse-Gamma Distribution) Suppose β∣σ2,μ,M∼N(μ,σ2M)σ2∣a,b∼IG(a,b) Then (β,σ2) has a Normal-Inverse-Gamma distribution, denoted as (β,σ2)∼NIG(μ,M,a,b).
We use a Normal-Inverse-Gamma prior for (β,σ2)
P(β,σ2)=NIG(β,σ2∣m0,M0,a0,b0)=ba00Γ(a0)(1σ2)a0+1e−b0σ21(2πσ2)p2|M0|12e−12σ2Q(β,m0,M0)
where Q(x,m,M)=(x−m)TM−1(x−m).
1.2.2 Marginal distribution of NIG prior
As for marginal priors, we can can get it by integration
P(σ2)=∫NIG(β,σ2∣m0,M0,a0,b0) dβ=IG(σ2∣a0,b0)P(β)=∫NIG(β,σ2∣m0,M0,a0,b0) dσ2=t2a0(m0,b0a0M0)
Note: the density of multivariate t-distribution is given by
tv(μ,Σ)=Γ(v+p2)(vπ)p2Γ(v2)|Σ|12(1+1v(x−μ)TΣ−1(x−μ))−v+p2
1.3 Conjugate Bayesian linear regression and M&m formula
Let yn×1 be outcome variable and Xn×p be corresponding covariates. Assume V is known. The model is given by
y=Xβ+ϵ , ϵ∼N(0,σ2V)β=m0+ω , ω∼N(0,σ2M0)σ2∼IG(a0,b0)
The posterior distribution of (β,σ2) is given by
P(β,σ2∣y)=NIG(β,σ2∣M1m1,M1,a1,b1)
where
M−11=M−10+XTV−1Xm1=M−10m0+XTV−1ya1=a0+p2b1=b0+c∗2=b0+12(mT0M−10m0+yTV−1y−mT1M1m1)
From derivation in marginal priors, the marginal posterior distributions can be easily get by updating corresponding parameters
P(σ2∣y)=IG(σ2∣a1,b1)P(β∣y)=t2a1(M1m1,b1a1M1)
1.4 Updating form of the posterior distribution
We will use two ways to derive the updating form of the posterior distribution.
1.4.1 Method 1: Sherman-Woodbury-Morrison identity
Theorem 1.2 (Sherman-Woodbury-Morrison identity) We have (A+BDC)−1=A−1−A−1B(D−1+CA−1B)−1CA−1 where A and D are square matrices that are invertible and B and C are rectangular (square if A and D have the same dimensions) matrices such that the multiplications are well-defined.
Sherman-Woodbury-Morrison identity is easily verified by multiplying the right hand side with A+BDC and simplifying to reduce it to the identity matrix. Using this formula, we have
M1=(M−10+XTV−1X)−1=M0−M0XT(V+XM0XT)−1XM0=M0−M0XTQ−1XM0
where Q=V+XM0XT
We can show that M1m1=m0+M0XTQ−1(y−Xm0)
Furthermore, we can simplify that mT0M−10m0+yTV−1y−mT1M1m1=(y−Xm0)TQ−1(y−Xm0)
So, we get the following updating form of the posterior distribution from Bayesian linear regression
P(β,σ2∣y)=NIG(β,σ2∣˜m1,˜M1,a1,b1) where
˜m1=M1m1=m0+M0XTQ−1(y−Xm0)˜M1=M1=M0−M0XTQ−1XM0a1=a0+p2b1=b0+12(y−Xm0)TQ−1(y−Xm0)Q=V+XM0XT
1.4.2 Method 2: Distribution theory
Previously, we got the Bayesian Linear Regression Updater using Sherman-Woodbury-Morrison identity. Here, we will derive the results without resorting to it. Recall that the model is given by
y=Xβ+ϵ , ϵ∼N(0,σ2V)β=m0+ω , ω∼N(0,σ2M0)σ2∼IG(a0,b0)
This corresponds to the posterior distribution
P(β,σ2∣y)∝IG(σ2∣a0,b0)×N(β∣m0,σ2M0)×N(y∣Xβ,σ2V)
We will derive P(σ2∣y) and P(β∣σ2,y) in a form that will reflect updates from the prior to the posterior. Integrating out β from the model is equivalent to substituting β from its prior model. Thus, P(y∣σ2) is derived simply from y=Xβ+ϵ=X(m0+ω)+ϵ=Xm0+Xω+ϵ=Xm0+η
where η=Xω+ϵ∼N(0,σ2Q) and Q=XM0XT+V.
Therefore,
y∣σ2∼N(Xm0,σ2Q)
The posterior distribution is given by P(σ2∣y)∝IG(σ2∣a1,b1)
where a1=a0+p2b1=b0+12(y−Xm0)TQ−1(y−Xm0)
Next, we turn to P(β∣σ2,y). Note that (yβ)∣σ2∼N((Xm0m0),σ2(QXM0M0XTM0))
From the expression of a conditional distribution derived from a multivariate Gaussian, we obtain \beta \mid \sigma^2, y \sim N\left(\tilde{m}_1, \sigma^2 \tilde{M}_1\right)
where \begin{align} & \tilde{m}_1=\operatorname{E}\left[\beta \mid \sigma^2, y\right]=m_0+M_0 X^{\mathrm{\scriptscriptstyle T}} Q^{-1}\left(y-X{m_0}\right) \\ & \tilde{M}_1=M_0-M_0 X^{\mathrm{\scriptscriptstyle T}} Q^{-1} X M_0 \\ \end{align}
1.5 Bayesian prediction
Assume V=I_{n}. Let \tilde{y} denote an \tilde{n}\times 1 vector of outcomes. \tilde{X} is corresponding predictors. We seek to predict \tilde{y} based upon y
\begin{align} P(\tilde{y} \mid y) &= t_{2a_1}\left(\tilde{X} M_1 m_1, \frac{b_1}{a_1}\left(I_{\tilde{n}} + \tilde{X} M_{1} \tilde{X}^{\mathrm{\scriptscriptstyle T}}\right)\right) \; \end{align}
1.6 Sampling from the posterior distribution
We can get the joint posterior density P\left(\beta, \sigma^{2}, \tilde{y} \mid y\right) by sampling process
Draw \hat{\sigma}_{(i)}^{2} from I G\left(a_{1}, b_{1}\right)
Draw \hat{\beta}_{(i)} from N\left(M_{1} m_{1}, \hat{\sigma}_{(i)}^{2} M_{1}\right)
Draw \tilde{y}_{(i)} from N\left(\tilde{X} \hat{\beta}_{(i)}, \hat{\sigma}_{(i)}^{2}I_{\tilde{n}}\right)