Chapter 1 Basics of Bayesian linear regression

1.1 Bayes’ theorem

Theorem 1.1 (Bayes' theorem) For events U,K and P(K)0, we have

P(UK)=P(KU)P(U)P(K)

We denote U as unknown parameters and K as known parameters. We call P(U) prior and P(K|U) likelihood. The Bayes’ theorem gives us the posterior distribution of unknown parameters given the known parameters P(UK)P(U)P(KU)

1.2 Normal-Inverse-Gamma (NIG) prior

1.2.1 Joint distribution of NIG prior

Definition 1.1 (Normal-Inverse-Gamma Distribution) Suppose βσ2,μ,MN(μ,σ2M)σ2a,bIG(a,b) Then (β,σ2) has a Normal-Inverse-Gamma distribution, denoted as (β,σ2)NIG(μ,M,a,b).

We use a Normal-Inverse-Gamma prior for (β,σ2)

P(β,σ2)=NIG(β,σ2m0,M0,a0,b0)=ba00Γ(a0)(1σ2)a0+1eb0σ21(2πσ2)p2|M0|12e12σ2Q(β,m0,M0)

where Q(x,m,M)=(xm)TM1(xm).

1.2.2 Marginal distribution of NIG prior

As for marginal priors, we can can get it by integration

P(σ2)=NIG(β,σ2m0,M0,a0,b0) dβ=IG(σ2a0,b0)P(β)=NIG(β,σ2m0,M0,a0,b0) dσ2=t2a0(m0,b0a0M0)

Note: the density of multivariate t-distribution is given by

tv(μ,Σ)=Γ(v+p2)(vπ)p2Γ(v2)|Σ|12(1+1v(xμ)TΣ1(xμ))v+p2

1.3 Conjugate Bayesian linear regression and M&m formula

Let yn×1 be outcome variable and Xn×p be corresponding covariates. Assume V is known. The model is given by

y=Xβ+ϵ , ϵN(0,σ2V)β=m0+ω , ωN(0,σ2M0)σ2IG(a0,b0)

The posterior distribution of (β,σ2) is given by

P(β,σ2y)=NIG(β,σ2M1m1,M1,a1,b1)

where

M11=M10+XTV1Xm1=M10m0+XTV1ya1=a0+p2b1=b0+c2=b0+12(mT0M10m0+yTV1ymT1M1m1)

From derivation in marginal priors, the marginal posterior distributions can be easily get by updating corresponding parameters

P(σ2y)=IG(σ2a1,b1)P(βy)=t2a1(M1m1,b1a1M1)

1.4 Updating form of the posterior distribution

We will use two ways to derive the updating form of the posterior distribution.

1.4.1 Method 1: Sherman-Woodbury-Morrison identity

Theorem 1.2 (Sherman-Woodbury-Morrison identity) We have (A+BDC)1=A1A1B(D1+CA1B)1CA1 where A and D are square matrices that are invertible and B and C are rectangular (square if A and D have the same dimensions) matrices such that the multiplications are well-defined.

Sherman-Woodbury-Morrison identity is easily verified by multiplying the right hand side with A+BDC and simplifying to reduce it to the identity matrix. Using this formula, we have

M1=(M10+XTV1X)1=M0M0XT(V+XM0XT)1XM0=M0M0XTQ1XM0

where Q=V+XM0XT

We can show that M1m1=m0+M0XTQ1(yXm0)

Furthermore, we can simplify that mT0M10m0+yTV1ymT1M1m1=(yXm0)TQ1(yXm0)

So, we get the following updating form of the posterior distribution from Bayesian linear regression

P(β,σ2y)=NIG(β,σ2˜m1,˜M1,a1,b1) where

˜m1=M1m1=m0+M0XTQ1(yXm0)˜M1=M1=M0M0XTQ1XM0a1=a0+p2b1=b0+12(yXm0)TQ1(yXm0)Q=V+XM0XT

1.4.2 Method 2: Distribution theory

Previously, we got the Bayesian Linear Regression Updater using Sherman-Woodbury-Morrison identity. Here, we will derive the results without resorting to it. Recall that the model is given by

y=Xβ+ϵ , ϵN(0,σ2V)β=m0+ω , ωN(0,σ2M0)σ2IG(a0,b0)

This corresponds to the posterior distribution

P(β,σ2y)IG(σ2a0,b0)×N(βm0,σ2M0)×N(yXβ,σ2V)

We will derive P(σ2y) and P(βσ2,y) in a form that will reflect updates from the prior to the posterior. Integrating out β from the model is equivalent to substituting β from its prior model. Thus, P(yσ2) is derived simply from y=Xβ+ϵ=X(m0+ω)+ϵ=Xm0+Xω+ϵ=Xm0+η

where η=Xω+ϵN(0,σ2Q) and Q=XM0XT+V.

Therefore,

yσ2N(Xm0,σ2Q)

The posterior distribution is given by P(σ2y)IG(σ2a1,b1)

where a1=a0+p2b1=b0+12(yXm0)TQ1(yXm0)

Next, we turn to P(βσ2,y). Note that (yβ)σ2N((Xm0m0),σ2(QXM0M0XTM0))

From the expression of a conditional distribution derived from a multivariate Gaussian, we obtain βσ2,yN(˜m1,σ2˜M1)

where ˜m1=E[βσ2,y]=m0+M0XTQ1(yXm0)˜M1=M0M0XTQ1XM0

1.5 Bayesian prediction

Assume V=In. Let ˜y denote an ˜n×1 vector of outcomes. ˜X is corresponding predictors. We seek to predict ˜y based upon y

P(˜yy)=t2a1(˜XM1m1,b1a1(I˜n+˜XM1˜XT))

1.6 Sampling from the posterior distribution

We can get the joint posterior density P\left(\beta, \sigma^{2}, \tilde{y} \mid y\right) by sampling process

  1. Draw \hat{\sigma}_{(i)}^{2} from I G\left(a_{1}, b_{1}\right)

  2. Draw \hat{\beta}_{(i)} from N\left(M_{1} m_{1}, \hat{\sigma}_{(i)}^{2} M_{1}\right)

  3. Draw \tilde{y}_{(i)} from N\left(\tilde{X} \hat{\beta}_{(i)}, \hat{\sigma}_{(i)}^{2}I_{\tilde{n}}\right)