Chapter 1 Basics of Bayesian linear regression
1.1 Bayes’ theorem
Theorem 1.1 (Bayes' theorem) For events U,K and P(K)≠0, we have
P(U∣K)=P(K∣U)P(U)P(K)
We denote U as unknown parameters and K as known parameters. We call P(U) prior and P(K|U) likelihood. The Bayes’ theorem gives us the posterior distribution of unknown parameters given the known parameters P(U∣K)∝P(U)⋅P(K∣U)
1.2 Normal-Inverse-Gamma (NIG) prior
1.2.1 Joint distribution of NIG prior
Definition 1.1 (Normal-Inverse-Gamma Distribution) Suppose β∣σ2,μ,M∼N(μ,σ2M)σ2∣a,b∼IG(a,b) Then (β,σ2) has a Normal-Inverse-Gamma distribution, denoted as (β,σ2)∼NIG(μ,M,a,b).
We use a Normal-Inverse-Gamma prior for (β,σ2)
P(β,σ2)=NIG(β,σ2∣m0,M0,a0,b0)=ba00Γ(a0)(1σ2)a0+1e−b0σ21(2πσ2)p2|M0|12e−12σ2Q(β,m0,M0)
where Q(x,m,M)=(x−m)TM−1(x−m).
1.2.2 Marginal distribution of NIG prior
As for marginal priors, we can can get it by integration
P(σ2)=∫NIG(β,σ2∣m0,M0,a0,b0) dβ=IG(σ2∣a0,b0)P(β)=∫NIG(β,σ2∣m0,M0,a0,b0) dσ2=t2a0(m0,b0a0M0)
Note: the density of multivariate t-distribution is given by
tv(μ,Σ)=Γ(v+p2)(vπ)p2Γ(v2)|Σ|12(1+1v(x−μ)TΣ−1(x−μ))−v+p2
1.3 Conjugate Bayesian linear regression and M&m formula
Let yn×1 be outcome variable and Xn×p be corresponding covariates. Assume V is known. The model is given by
y=Xβ+ϵ , ϵ∼N(0,σ2V)β=m0+ω , ω∼N(0,σ2M0)σ2∼IG(a0,b0)
The posterior distribution of (β,σ2) is given by
P(β,σ2∣y)=NIG(β,σ2∣M1m1,M1,a1,b1)
where
M−11=M−10+XTV−1Xm1=M−10m0+XTV−1ya1=a0+p2b1=b0+c∗2=b0+12(mT0M−10m0+yTV−1y−mT1M1m1)
From derivation in marginal priors, the marginal posterior distributions can be easily get by updating corresponding parameters
P(σ2∣y)=IG(σ2∣a1,b1)P(β∣y)=t2a1(M1m1,b1a1M1)
1.4 Updating form of the posterior distribution
We will use two ways to derive the updating form of the posterior distribution.
1.4.1 Method 1: Sherman-Woodbury-Morrison identity
Theorem 1.2 (Sherman-Woodbury-Morrison identity) We have (A+BDC)−1=A−1−A−1B(D−1+CA−1B)−1CA−1 where A and D are square matrices that are invertible and B and C are rectangular (square if A and D have the same dimensions) matrices such that the multiplications are well-defined.
Sherman-Woodbury-Morrison identity is easily verified by multiplying the right hand side with A+BDC and simplifying to reduce it to the identity matrix. Using this formula, we have
M1=(M−10+XTV−1X)−1=M0−M0XT(V+XM0XT)−1XM0=M0−M0XTQ−1XM0
where Q=V+XM0XT
We can show that M1m1=m0+M0XTQ−1(y−Xm0)
Furthermore, we can simplify that mT0M−10m0+yTV−1y−mT1M1m1=(y−Xm0)TQ−1(y−Xm0)
So, we get the following updating form of the posterior distribution from Bayesian linear regression
P(β,σ2∣y)=NIG(β,σ2∣˜m1,˜M1,a1,b1) where
˜m1=M1m1=m0+M0XTQ−1(y−Xm0)˜M1=M1=M0−M0XTQ−1XM0a1=a0+p2b1=b0+12(y−Xm0)TQ−1(y−Xm0)Q=V+XM0XT
1.4.2 Method 2: Distribution theory
Previously, we got the Bayesian Linear Regression Updater using Sherman-Woodbury-Morrison identity. Here, we will derive the results without resorting to it. Recall that the model is given by
y=Xβ+ϵ , ϵ∼N(0,σ2V)β=m0+ω , ω∼N(0,σ2M0)σ2∼IG(a0,b0)
This corresponds to the posterior distribution
P(β,σ2∣y)∝IG(σ2∣a0,b0)×N(β∣m0,σ2M0)×N(y∣Xβ,σ2V)
We will derive P(σ2∣y) and P(β∣σ2,y) in a form that will reflect updates from the prior to the posterior. Integrating out β from the model is equivalent to substituting β from its prior model. Thus, P(y∣σ2) is derived simply from y=Xβ+ϵ=X(m0+ω)+ϵ=Xm0+Xω+ϵ=Xm0+η
where η=Xω+ϵ∼N(0,σ2Q) and Q=XM0XT+V.
Therefore,
y∣σ2∼N(Xm0,σ2Q)
The posterior distribution is given by P(σ2∣y)∝IG(σ2∣a1,b1)
where a1=a0+p2b1=b0+12(y−Xm0)TQ−1(y−Xm0)
Next, we turn to P(β∣σ2,y). Note that (yβ)∣σ2∼N((Xm0m0),σ2(QXM0M0XTM0))
From the expression of a conditional distribution derived from a multivariate Gaussian, we obtain β∣σ2,y∼N(˜m1,σ2˜M1)
where ˜m1=E[β∣σ2,y]=m0+M0XTQ−1(y−Xm0)˜M1=M0−M0XTQ−1XM0
1.5 Bayesian prediction
Assume V=In. Let ˜y denote an ˜n×1 vector of outcomes. ˜X is corresponding predictors. We seek to predict ˜y based upon y
P(˜y∣y)=t2a1(˜XM1m1,b1a1(I˜n+˜XM1˜XT))
1.6 Sampling from the posterior distribution
We can get the joint posterior density P\left(\beta, \sigma^{2}, \tilde{y} \mid y\right) by sampling process
Draw \hat{\sigma}_{(i)}^{2} from I G\left(a_{1}, b_{1}\right)
Draw \hat{\beta}_{(i)} from N\left(M_{1} m_{1}, \hat{\sigma}_{(i)}^{2} M_{1}\right)
Draw \tilde{y}_{(i)} from N\left(\tilde{X} \hat{\beta}_{(i)}, \hat{\sigma}_{(i)}^{2}I_{\tilde{n}}\right)