Chapter 2 Updating form of the posterior distribution
Assume y∼N(Xβ,σ2V) and P(β,σ2)=NIG(β,σ2∣m0,M0,a0,b0), the posterior distribution is given by
P(β,σ2∣y)=NIG(β,σ2∣M1m1,M1,a1,b1)
where
M1=(M−10+X⊤V−1X)−1;m1=M−10m0+X⊤V−1y;a1=a0+p2;b1=b0+12(m⊤0M−10m0+y⊤V−1y−m⊤1M1m1).
We will use two ways to calculate M1.
2.1 Method 1: Sherman-Woodbury-Morrison identity
Theorem 2.1 (Sherman-Woodbury-Morrison identity) We have (A+BDC)−1=A−1−A−1B(D−1+CA−1B)−1CA−1 where A and D are square matrices that are invertible and B and C are rectangular (square if A and D have the same dimensions) matrices such that the multiplications are well-defined.
Sherman-Woodbury-Morrison identity is easily verified by multiplying the right hand side with A+BDC and simplifying to reduce it to the identity matrix. Using this formula, we have
M1=(M−10+X⊤V−1X)−1=M0−M0X⊤(V+XM0X⊤)−1XM0=M0−M0X⊤Q−1XM0
where Q=V+XM0X⊤
We can show that M1m1=m0+M0X⊤Q−1(y−Xm0).
Click to show or hide details
M1m1=(M−10+X⊤V−1X)−1m1=[M0−M0X⊤(V+XM0X⊤)−1XM0]m1=(M0−M0X⊤Q−1XM0)m1=(M0−M0X⊤Q−1XM0)(M−10m0+X⊤V−1y)=m0+M0X⊤V−1y−M0X⊤Q−1Xm0−M0X⊤Q−1XM0X⊤V−1y=m0+M0X⊤(I−Q−1XM0X⊤)V−1y−M0X⊤Q−1Xm0=m0+M0X⊤Q−1(Q−XM0X⊤)V−1y−M0X⊤Q−1Xm0( since Q=V+XM0X⊤)=m0+M0X⊤Q−1(V)V−1y−M0X⊤Q−1Xm0=m0+M0X⊤Q−1y−M0X⊤Q−1Xm0=m0+M0X⊤Q−1(y−Xm0)
Furthermore, we can simplify that m⊤0M−10m0+y⊤V−1y−m⊤1M1m1=(y−Xm0)⊤Q−1(y−Xm0).
Click to show or hide details
m⊤0M−10m0+y⊤V−1y−m⊤1M1m1=m⊤0M−10m0+y⊤V−1y−m⊤1[m0+M0X⊤Q−1(y−Xm0)]=m⊤0M−10m0+y⊤V−1y−m⊤1m0−m⊤1M0X⊤Q−1(y−Xm0)=m⊤0M−10m0+y⊤V−1y−m⊤0(M−10m0+X⊤V−1y)−m⊤1M0X⊤Q−1(y−Xm0)=y⊤V−1y−y⊤V−1Xm0−m⊤1M0X⊤Q−1(y−Xm0)=y⊤V−1(y−Xm0)−m⊤1M0X⊤Q−1(y−Xm0)=y⊤V−1(y−Xm0)−m⊤1M0X⊤Q−1(y−Xm0)⏟ simplify from left to right =y⊤V−1(y−Xm0)−(M0m1)⊤X⊤Q−1(y−Xm0)=y⊤V−1(y−Xm0)−(m0+M0X⊤V−1y)⊤X⊤Q−1(y−m0)=y⊤V−1(y−Xm0)−(Xm0+XM0X⊤V−1y)⊤Q−1(y−Xm0)=y⊤V−1(y−Xm0)−(Q−1Xm0+Q−1(XM0X⊤)V−1y)(y−Xm0)=y⊤V−1(y−Xm0)−[Q−1Xm0+Q−1(Q−V)V−1y]⊤(y−Xm0)=y⊤V−1(y−Xm0)−(Q−1Xm0+V−1y−Q−1y)⊤(y−Xm0)=y⊤V−1(y−Xm0)−[V−1y+Q−1(Xm0−y)]⊤(y−Xm0)=y⊤V−1(y−Xm0)−y⊤V−1(y−Xm0)+(y−Xm0)⊤Q−1(y−Xm0)=(y−Xm0)⊤Q−1(y−Xm0)
So, we get the following updating form of the posterior distribution from Bayesian linear regression
P(β,σ2∣y)=NIG(β,σ2∣˜m1,˜M1,a1,b1) where
˜m1=M1m1=m0+M0X⊤Q−1(y−Xm0)˜M1=M1=M0−M0X⊤Q−1XM0a1=a0+p2b1=b0+12(y−Xm0)⊤Q−1(y−Xm0)Q=V+XM0X⊤
2.2 Method 2: distribution theory
Previously, we got the Bayesian Linear Regression Updater using Sherman-Woodbury-Morrison identity. Here, we will derive the results without resorting to it. The model is given by y=Xβ+ϵ,ϵ∼N(0,σ2V);β=m0+ω,ω∼N(0,σ2M0);σ2∼IG(a0,b0).
This corresponds to the posterior distribution
P(β,σ2∣y)∝IG(σ2∣a0,b0)×N(β∣m0,σ2M0)×N(y∣Xβ,σ2V).
We will derive P(σ2∣y) and P(β∣σ2,y) in a form that will reflect updates from the prior to the posterior.
Integrating out β from the model is equivalent to substituting β from its prior model. Thus, P(y∣σ2) is derived simply from y=Xβ+ϵ=X(m0+ω)+ϵ=Xm0+Xω+ϵ=Xm0+η,
where η=Xω+ϵ∼N(0,σ2Q);Q=XM0X⊤+V.
Therefore, y∣σ2∼N(Xm0,σ2Q).
The posterior distribution is given by: P(σ2∣y)∝P(σ2)P(y∣σ2)=IG(σ2∣a0,b0)×N(y∣Xm0,σ2Q)∝(1σ2)a0+1e−b0σ2×(1σ2)n2e−12σ2(y−Xm0)⊤Q−1(y−Xm0)∝(1σ2)a0+p2+1e−1σ2{b0+12(y−Xm0)⊤Q−1(y−Xm0)∝IG(σ2∣a1,b1),
where a1=a0+p2;b1=b0+12(y−Xm0)⊤Q−1(y−Xm0).
Next, we turn to P(β∣σ2,y). Note that [yβ]∣σ2∼N([Xm0m0],σ2[QXM0M0X⊤M0]).Click to show or hide details
where we have used the facts
E[y∣σ2]=Xm0;E[β∣σ2]=m0;Var(y∣σ2)=σ2Q;Var(β∣σ2)=σ2M0;
Cov(y,β∣σ2)=Cov(Xβ+ϵ,β∣σ2)=Cov(X(m0+ω)+ϵ,m0+ω∣σ2)=Cov(Xω,ω∣σ2)(Since m0 is constant and Cov(ω,ϵ)=0)=σ2XM0.
From the expression of a conditional distribution derived from a multivariate Gaussian, we obtain β∣σ2,y∼N(˜m1,σ2˜M1),
where ˜m1=E[β∣σ2,y]=m0+M0X⊤Q−1(y−Xm0);˜M1=M0−M0X⊤Q−1XM0.
Click to show or hide details
Note:
[X1X2]∼N([μ1μ2],[Σ11Σ12Σ21Σ22]) with Σ21=Σ⊤12,⇒X2∣X1∼N(μ2⋅1,Σ2⋅1),where μ2⋅1=μ2+Σ21Σ−111(X1−μ1) and Σ2⋅1=Σ22−Σ21Σ−111Σ12.