Bayesian Linear Regression
2023-01-23
Chapter 1 Basics of Bayesian linear regression
1.1 Bayes’ theorem
Theorem 1.1 (Bayes' theorem) For events A,B and P(B)≠0, we have
P(A∣B)=P(B∣A)P(A)P(B)
We denote U as unknown parameters and K as known parameters. We call P(U) prior and P(K|U) likelihood. The Bayes’ theorem gives us the posterior distribution of unknown parameters given the known parameters P(U∣K)∝P(U)⋅P(K∣U)
Let K={yn×1,Xn×p} and assume y∼N(Xβ,σ2V), where V is known and U={β,σ2} is unknown. The likelihood is given by
P(K∣U)=N(y∣Xβ,σ2V)
1.2 Normal Inverse-Gamma (NIG) prior
1.2.1 Joint distribution of NIG prior
Definition 1.1 (Normal Inverse-Gamma Distribution) Suppose x∣σ2,μ,M∼N(μ,σ2M)σ2∣α,β∼IG(α,β) Then (x,σ2) has a Normal-Inverse-Gamma distribution, denoted as (x,σ2)∼NIG(μ,M,α,β).
We use a Normal Inverse-Gamma prior for (β,σ2)
P(β,σ2)=NIG(β,σ2∣m0,M0,a0,b0)=IG(σ2∣a0,b0)⋅N(β∣m0,σ2M0)=ba00Γ(a0)(1σ2)a0+1e−b0σ21(2π)p2|σ2M0|12e−12σ2Q(β,m0,M0)=ba00Γ(a0)(1σ2)a0+1e−b0σ21(2πσ2)p2|M0|12e−12σ2Q(β,m0,M0)
where Q(x,m,M)=(x−m)⊤M−1(x−m)
Note: the Inverse-Gamma (IG) distribution has a relationship with Gamma distribution. X∼Gamma(α,β), the density function of X is f(x)=βαΓ(α)xα−1e−βx. Let Y=1X∼IG(α,β), the density function of Y is f(y)=βαΓ(α)x−α−1e−βx.
1.2.2 Marginal distribution of NIG prior
As for marginal priors, we can can get it by integration
P(σ2)=∫NIG(β,σ2∣m0,M0,a0,b0) dβ=IG(σ2∣a0,b0)P(β)=∫NIG(β,σ2∣m0,M0,a0,b0) dσ2=t2a0(m0,b0a0M0)
Click to show or hide details
P(σ2)=∫NIG(β,σ2∣m0,M0,a0,b0) dβ=IG(σ2∣a0,b0)∫N(β∣m0,σ2M0) dβ=IG(σ2∣a0,b0)
P(β)=∫NIG(β,σ2∣m0,M0,a0,b0) dσ2=∫ba00Γ(a0)(1σ2)a0+1e−b0σ21(2πσ2)p2|M0|12e−12σ2Q(β,m0,M0) dσ2=ba00Γ(a0)(2π)p2|M0|12∫(1σ2)a0+p2+1e−1σ2(b0+12Q(β,m0,M0)) dσ2(let u=1σ2,|dσ2|=1u2du)=ba00Γ(a0)(2π)p2|M0|12∫ua0+p2+1e−(b0+12Q(β,m0,M0))u1u2 du=ba00Γ(a0)(2π)p2|M0|12∫ua0+p2−1e−(b0+12Q(β,m0,M0))u du(by Gamma integral function:∫xα−1exp−βxdx=Γ(α)βα)=ba00Γ(a0)(2π)p2|M0|12Γ(a0+p2)[b0+12Q(β,m0,M0)](a0+p2)=ba00Γ(a0+p2)Γ(a0)(2π)p2|M0|12[b0(1+12b0Q(β,m0,M0))]−(a0+p2)=ba00Γ(a0+p2)b−(a0+p2)0Γ(a0)(2π)p2|M0|12[1+12b0(β−m0)⊤M−10(β−m0)]−(a0+p2)=Γ(a0+p2)(2π)p2bp20Γ(a0)|M|12[1+12b0(β−m0)⊤M−10(β−m0)]−(a0+p2)=Γ(a0+p2)(2π)p2(a0⋅b0a0)p2Γ(a0)|M|12[1+12a0⋅b0a0(β−m0)⊤M−10(β−m0)]−(a0+p2)=Γ(a0+p2)(2a0π)p2Γ(a0)|b0a0M|12[1+12a0(β−m0)⊤(b0a0M0)−1(β−m0)]−(a0+p2)=t2a0(m0,b0a0M0)
Note: the density of multivariate t-distribution is given by
tv(μ,Σ)=Γ(v+p2)(vπ)p2Γ(v2)|Σ|12[1+1v(x−μ)⊤Σ−1(x−μ)]−v+p2
1.3 Posterior distribution
The posterior distribution of (β,σ2) is given by
P(β,σ2∣y)=NIG(β,σ2∣M1m1,M1,a1,b1)
where
M1=(M−10+X⊤V−1X)−1;m1=M−10m0+X⊤V−1y;a1=a0+p2;b1=b0+c∗2=b0+12(m⊤0M−10m0+y⊤V−1y−m⊤1M1m1).
Click to show or hide details
P(β,σ2∣y)∝NIG(β,σ2∣m0,M0,a0,b0)⋅N(y∣Xβ,σ2V)∝IG(σ2∣a0,b0)⋅N(β∣m0,σ2M0)⋅N(y∣Xβ,σ2V)∝ba00Γ(a0)(1σ2)a0+1e−b0σ21(2πσ2)p2|M0|12e−12σ2Q(β,m0,M0)1(2πσ2)p2|V|12e−12σ2Q(y,Xβ,V)∝(1σ2)a0+p+1e−b0σ2e−12σ2(Q(β,m0,M0)+Q(y,Xβ,V))
where
Q(β,m0,M0)+Q(y,Xβ,V)=(β−m0)⊤M−10(β−m0)+(y−Xβ)⊤V−1(y−Xβ)=β⊤M−10β−2β⊤M−10m0+m⊤0M−10m0+β⊤X⊤V−1Xβ−2β⊤X⊤V−1y+y⊤V−1y=β⊤(M−10+X⊤V−1X)β−2β⊤(M−10m0+X⊤V−1y)+m⊤0M−10m0+y⊤V−1y=β⊤M−11β−2β⊤m1+c=(β−M1m1)⊤M−11(β−M1m1)−m⊤1M1m1+c=(β−M1m1)⊤M−11(β−M1m1)+c∗
where M1 is a symmetric positive definite matrix, m1 is a vector, and c & c∗ are scalars given by
M1=(M−10+X⊤V−1X)−1;m1=M−10m0+X⊤V−1y;c=m⊤0M−10m0+y⊤V−1y;c∗=c−m⊤Mm=m⊤0M−10m0+y⊤V−1y−m⊤1M1m1.
Note: M1, m1 and c do not depend upon β.
Then, we have
P(β,σ2∣y)∝(1σ2)a0+p+1e−b0σ2e−12σ2((β−M1m1)⊤M−11(β−M1m1)+c∗)∝(1σ2)a0+p+1e−b0+c∗2σ2e−12σ2(β−M1m1)⊤M−11(β−M1m1)∝(1σ2)a0+p2+1e−b0+c∗2σ2(1σ2)p2e−12σ2(β−M1m1)⊤M−11(β−M1m1)=IG(σ2∣a0+p2,b0+c∗2)⋅N(β∣M1m1,σ2M1)=IG(σ2∣a1,b1)⋅N(β∣M1m1,σ2M1)=NIG(β,σ2∣M1m1,M1,a1,b1)
where
M1=(M−10+X⊤V−1X)−1;m1=M−10m0+X⊤V−1y;a1=a0+p2;b1=b0+c∗2=b0+12(m⊤0M−10m0+y⊤V−1y−m⊤1M1m1).
From derivation in marginal priors, the marginal posterior distributions can be easily get by updating corresponding parameters
P(σ2∣y)=IG(σ2∣a1,b1)P(β∣y)=t2a1(M1m1,b1a1M1)
1.4 Bayesian prediction
Assume V=In. Let ˜y denote an ˜n×1 vector of outcomes. ˜X is corresponding predictors. We seek to predict ˜y based upon y
P(˜y∣y)=t2a1(˜XM1m1,b1a1(I˜n+˜XM1˜X⊤))
Click to show or hide details
P(β,σ2,˜y∣y)=P(β,σ2∣y)⋅P(˜y∣β,σ2,y)∝P(β,σ2)⋅P(y∣β,σ2)⋅P(˜y∣β,σ2,y)=NIG(β,σ2∣m0,M0,a0,b0)⋅N(y∣Xβ,σ2In)⋅N(˜y∣˜Xβ,σ2I˜n)=NIG(β,σ2∣M1m1,M1,a1,b1)⋅N(˜y∣˜Xβ,σ2I˜n)=IG(σ2∣a1,b1)⋅N(β∣M1m1,σ2M1)⋅N(˜y∣˜Xβ,σ2I˜n)
Then we can calculate posterior predictive density P(˜y∣y) from P(β,σ2,˜y∣y)
P(˜y∣y)=∬
As for \int N\left(\beta \mid M_{1} m_{1}, \sigma^{2} M_{1}\right) \cdot N\left(\tilde{y} \mid \tilde{X} \beta, \sigma^{2} I_{\tilde{n}}\right) \ d\beta, we provide an easy way to derive it avoiding any integration at all. Note that we can write the above model as
\begin{align} \tilde{y} &= \tilde{X} \beta + \tilde{\epsilon}, \text{ where } \tilde{\epsilon} \sim N(0,\sigma^2 I_{\tilde{n}}) \\ \beta &= M_{1} m_{1} + \epsilon_{\beta \mid y}, \text{ where } \epsilon_{\beta \mid y} \sim N(0,\sigma^2M_{1}) \end{align}
where \tilde{\epsilon} and \epsilon_{\beta \mid y} are independent of each other. It then follows that
\begin{align} \tilde{y} &= \tilde{X} M_{1} m_{1} + \tilde{X} \epsilon_{\beta \mid y} + \tilde{\epsilon} \sim N(\tilde{X} M_{1} m_{1}, \sigma^2(I_{\tilde{n}} + \tilde{X} M_{1} \tilde{X}^{\top})) \end{align}
As a result
\begin{align} P(\tilde{y} \mid y) &=\int IG(\sigma^{2} \mid a_{1}, b_{1}) \cdot N(\tilde{X} M_{1} m_{1}, \sigma^2(I_{\tilde{n}} + \tilde{X} M_{1} \tilde{X}^{\top})) \ d\sigma^{2} \\ &= t_{2a_1}(\tilde{X} M_1 m_1, \frac{b_1}{a_1}(I_{\tilde{n}} + \tilde{X} M_{1} \tilde{X}^{\top})) \; \end{align}
1.5 Sampling process
We can get the joint posterior density P\left(\beta, \sigma^{2}, \tilde{y} \mid y\right) by sampling process
Draw \hat{\sigma}_{(i)}^{2} from I G\left(a_{1}, b_{1}\right)
Draw \hat{\beta}_{(i)} from N\left(M_{1} m_{1}, \hat{\sigma}_{(i)}^{2} M_{1}\right)
Draw \tilde{y}_{(i)} from N\left(\tilde{X} \hat{\beta}_{(i)}, \hat{\sigma}_{(i)}^{2}I_{\tilde{n}}\right)