Chapter 1 Basics of Bayesian linear regression

1.1 Bayes’ theorem

Theorem 1.1 (Bayes' theorem) For events A,B and P(B)0, we have

P(AB)=P(BA)P(A)P(B)

We denote U as unknown parameters and K as known parameters. We call P(U) prior and P(K|U) likelihood. The Bayes’ theorem gives us the posterior distribution of unknown parameters given the known parameters P(UK)P(U)P(KU)

Let K={yn×1,Xn×p} and assume yN(Xβ,σ2V), where V is known and U={β,σ2} is unknown. The likelihood is given by

P(KU)=N(yXβ,σ2V)

1.2 Normal Inverse-Gamma (NIG) prior

1.2.1 Joint distribution of NIG prior

Definition 1.1 (Normal Inverse-Gamma Distribution) Suppose xσ2,μ,MN(μ,σ2M)σ2α,βIG(α,β) Then (x,σ2) has a Normal-Inverse-Gamma distribution, denoted as (x,σ2)NIG(μ,M,α,β).

We use a Normal Inverse-Gamma prior for (β,σ2)

P(β,σ2)=NIG(β,σ2m0,M0,a0,b0)=IG(σ2a0,b0)N(βm0,σ2M0)=ba00Γ(a0)(1σ2)a0+1eb0σ21(2π)p2|σ2M0|12e12σ2Q(β,m0,M0)=ba00Γ(a0)(1σ2)a0+1eb0σ21(2πσ2)p2|M0|12e12σ2Q(β,m0,M0)

where Q(x,m,M)=(xm)M1(xm)

Note: the Inverse-Gamma (IG) distribution has a relationship with Gamma distribution. XGamma(α,β), the density function of X is f(x)=βαΓ(α)xα1eβx. Let Y=1XIG(α,β), the density function of Y is f(y)=βαΓ(α)xα1eβx.

1.2.2 Marginal distribution of NIG prior

As for marginal priors, we can can get it by integration

P(σ2)=NIG(β,σ2m0,M0,a0,b0) dβ=IG(σ2a0,b0)P(β)=NIG(β,σ2m0,M0,a0,b0) dσ2=t2a0(m0,b0a0M0)

Click to show or hide details

P(σ2)=NIG(β,σ2m0,M0,a0,b0) dβ=IG(σ2a0,b0)N(βm0,σ2M0) dβ=IG(σ2a0,b0)

P(β)=NIG(β,σ2m0,M0,a0,b0) dσ2=ba00Γ(a0)(1σ2)a0+1eb0σ21(2πσ2)p2|M0|12e12σ2Q(β,m0,M0) dσ2=ba00Γ(a0)(2π)p2|M0|12(1σ2)a0+p2+1e1σ2(b0+12Q(β,m0,M0)) dσ2(let u=1σ2,|dσ2|=1u2du)=ba00Γ(a0)(2π)p2|M0|12ua0+p2+1e(b0+12Q(β,m0,M0))u1u2 du=ba00Γ(a0)(2π)p2|M0|12ua0+p21e(b0+12Q(β,m0,M0))u du(by Gamma integral function:xα1expβxdx=Γ(α)βα)=ba00Γ(a0)(2π)p2|M0|12Γ(a0+p2)[b0+12Q(β,m0,M0)](a0+p2)=ba00Γ(a0+p2)Γ(a0)(2π)p2|M0|12[b0(1+12b0Q(β,m0,M0))](a0+p2)=ba00Γ(a0+p2)b(a0+p2)0Γ(a0)(2π)p2|M0|12[1+12b0(βm0)M10(βm0)](a0+p2)=Γ(a0+p2)(2π)p2bp20Γ(a0)|M|12[1+12b0(βm0)M10(βm0)](a0+p2)=Γ(a0+p2)(2π)p2(a0b0a0)p2Γ(a0)|M|12[1+12a0b0a0(βm0)M10(βm0)](a0+p2)=Γ(a0+p2)(2a0π)p2Γ(a0)|b0a0M|12[1+12a0(βm0)(b0a0M0)1(βm0)](a0+p2)=t2a0(m0,b0a0M0)

Note: the density of multivariate t-distribution is given by

tv(μ,Σ)=Γ(v+p2)(vπ)p2Γ(v2)|Σ|12[1+1v(xμ)Σ1(xμ)]v+p2

1.3 Posterior distribution

The posterior distribution of (β,σ2) is given by

P(β,σ2y)=NIG(β,σ2M1m1,M1,a1,b1)

where

M1=(M10+XV1X)1;m1=M10m0+XV1y;a1=a0+p2;b1=b0+c2=b0+12(m0M10m0+yV1ym1M1m1).

Click to show or hide details

P(β,σ2y)NIG(β,σ2m0,M0,a0,b0)N(yXβ,σ2V)IG(σ2a0,b0)N(βm0,σ2M0)N(yXβ,σ2V)ba00Γ(a0)(1σ2)a0+1eb0σ21(2πσ2)p2|M0|12e12σ2Q(β,m0,M0)1(2πσ2)p2|V|12e12σ2Q(y,Xβ,V)(1σ2)a0+p+1eb0σ2e12σ2(Q(β,m0,M0)+Q(y,Xβ,V))

where

Q(β,m0,M0)+Q(y,Xβ,V)=(βm0)M10(βm0)+(yXβ)V1(yXβ)=βM10β2βM10m0+m0M10m0+βXV1Xβ2βXV1y+yV1y=β(M10+XV1X)β2β(M10m0+XV1y)+m0M10m0+yV1y=βM11β2βm1+c=(βM1m1)M11(βM1m1)m1M1m1+c=(βM1m1)M11(βM1m1)+c

where M1 is a symmetric positive definite matrix, m1 is a vector, and c & c are scalars given by

M1=(M10+XV1X)1;m1=M10m0+XV1y;c=m0M10m0+yV1y;c=cmMm=m0M10m0+yV1ym1M1m1.

Note: M1, m1 and c do not depend upon β.

Then, we have

P(β,σ2y)(1σ2)a0+p+1eb0σ2e12σ2((βM1m1)M11(βM1m1)+c)(1σ2)a0+p+1eb0+c2σ2e12σ2(βM1m1)M11(βM1m1)(1σ2)a0+p2+1eb0+c2σ2(1σ2)p2e12σ2(βM1m1)M11(βM1m1)=IG(σ2a0+p2,b0+c2)N(βM1m1,σ2M1)=IG(σ2a1,b1)N(βM1m1,σ2M1)=NIG(β,σ2M1m1,M1,a1,b1)

where

M1=(M10+XV1X)1;m1=M10m0+XV1y;a1=a0+p2;b1=b0+c2=b0+12(m0M10m0+yV1ym1M1m1).

From derivation in marginal priors, the marginal posterior distributions can be easily get by updating corresponding parameters

P(σ2y)=IG(σ2a1,b1)P(βy)=t2a1(M1m1,b1a1M1)

1.4 Bayesian prediction

Assume V=In. Let ˜y denote an ˜n×1 vector of outcomes. ˜X is corresponding predictors. We seek to predict ˜y based upon y

P(˜yy)=t2a1(˜XM1m1,b1a1(I˜n+˜XM1˜X))

Click to show or hide details

P(β,σ2,˜yy)=P(β,σ2y)P(˜yβ,σ2,y)P(β,σ2)P(yβ,σ2)P(˜yβ,σ2,y)=NIG(β,σ2m0,M0,a0,b0)N(yXβ,σ2In)N(˜y˜Xβ,σ2I˜n)=NIG(β,σ2M1m1,M1,a1,b1)N(˜y˜Xβ,σ2I˜n)=IG(σ2a1,b1)N(βM1m1,σ2M1)N(˜y˜Xβ,σ2I˜n)

Then we can calculate posterior predictive density P(˜yy) from P(β,σ2,˜yy)

P(˜yy)=

As for \int N\left(\beta \mid M_{1} m_{1}, \sigma^{2} M_{1}\right) \cdot N\left(\tilde{y} \mid \tilde{X} \beta, \sigma^{2} I_{\tilde{n}}\right) \ d\beta, we provide an easy way to derive it avoiding any integration at all. Note that we can write the above model as

\begin{align} \tilde{y} &= \tilde{X} \beta + \tilde{\epsilon}, \text{ where } \tilde{\epsilon} \sim N(0,\sigma^2 I_{\tilde{n}}) \\ \beta &= M_{1} m_{1} + \epsilon_{\beta \mid y}, \text{ where } \epsilon_{\beta \mid y} \sim N(0,\sigma^2M_{1}) \end{align}

where \tilde{\epsilon} and \epsilon_{\beta \mid y} are independent of each other. It then follows that

\begin{align} \tilde{y} &= \tilde{X} M_{1} m_{1} + \tilde{X} \epsilon_{\beta \mid y} + \tilde{\epsilon} \sim N(\tilde{X} M_{1} m_{1}, \sigma^2(I_{\tilde{n}} + \tilde{X} M_{1} \tilde{X}^{\top})) \end{align}

As a result

\begin{align} P(\tilde{y} \mid y) &=\int IG(\sigma^{2} \mid a_{1}, b_{1}) \cdot N(\tilde{X} M_{1} m_{1}, \sigma^2(I_{\tilde{n}} + \tilde{X} M_{1} \tilde{X}^{\top})) \ d\sigma^{2} \\ &= t_{2a_1}(\tilde{X} M_1 m_1, \frac{b_1}{a_1}(I_{\tilde{n}} + \tilde{X} M_{1} \tilde{X}^{\top})) \; \end{align}

1.5 Sampling process

We can get the joint posterior density P\left(\beta, \sigma^{2}, \tilde{y} \mid y\right) by sampling process

  1. Draw \hat{\sigma}_{(i)}^{2} from I G\left(a_{1}, b_{1}\right)

  2. Draw \hat{\beta}_{(i)} from N\left(M_{1} m_{1}, \hat{\sigma}_{(i)}^{2} M_{1}\right)

  3. Draw \tilde{y}_{(i)} from N\left(\tilde{X} \hat{\beta}_{(i)}, \hat{\sigma}_{(i)}^{2}I_{\tilde{n}}\right)