4.3 Linear regression: The conjugate normal-normal/inverse gamma model

In this setting we analyze the conjugate normal-normal/inverse gamma model which is the workhorse in econometrics. In this model, the dependent variable yi is related to a set of regressors xi=(xi1,xi2,,xiK) in a linear way, that is, yi=β1xi1+β2xi2++βKxiK+μi=xiβ+μi where β=(β1,β2,,βK) and μiiidN(0,σ2) is an stochastic error that is independent of the regressors, xiμi.

Defining y=[y1y2yN], X=[x11x12x1Kx21x22x2KxN1xN2xNK] and μ=[μ1μ2μN], we can write the model in matrix form: y=Xβ+μ, where μN(0,σ2I) which implies that yN(Xβ,σ2I). Then, the likelihood function is

p(y|β,σ2,X)=(2πσ2)N2exp{12σ2(yXβ)(yXβ)}(σ2)N2exp{12σ2(yXβ)(yXβ)}.

The conjugate priors for the parameters are β|σ2N(β0,σ2B0),σ2IG(α0/2,δ0/2).

Then, the posterior distribution is

π(β,σ2|y,X)(σ2)N2exp{12σ2(yXβ)(yXβ)}×(σ2)K2exp{12σ2(ββ0)B10(ββ0)}×(δ0/2)α0/2Γ(α0/2)(1σ2)α0/2+1exp{δ02σ2}(σ2)K2exp{12σ2[β(B10+XX)β2β(B10β0+XXˆβ)]}×(1σ2)(α0+N)/2+1exp{δ0+yy+β0B10β02σ2},

where ˆβ=(XX)1Xy is the maximum likelihood estimator.

Adding and subtracting βnB1nβn to complete the square, where Bn=(B10+XX)1 and βn=Bn(B10β0+XXˆβ),

π(β,σ2|y,X)(σ2)K2exp{12σ2(ββn)B1n(ββn)}1×(σ2)(αn2+1)exp{δn2σ2}2.

The first expression is the kernel of a normal density function, β|σ2,y,XN(βn,σ2Bn). The second expression is the kernel of a inverse gamma density, σ2|y,XIG(αn/2,δn/2), where αn=α0+N and δn=δ0+yy+β0B10β0βnB1nβn.

Taking into account that βn=(B10+XX)1(B10β0+XXˆβ)=(B10+XX)1B10β0+(B10+XX)1XXˆβ,

where (B10+XX)1B10=IK(B10+XX)1XX (Smith 1973). Setting W=(B10+XX)1XX we have βn=(IKW)β0+Wˆβ, that is, the posterior mean of β is a weighted average between the sample and prior information, where the weights depend on the precision of each piece of information. Observe that when the prior covariance matrix is highly vague (non–informative), such that B100K, we obtain WIK, such that βnˆβ, that is, the posterior mean location parameter converges to the maximum likelihood estimator.

In addition, we know that the posterior conditional covariance matrix of the location parameters σ2(B10+XX)1=σ2(XX)1σ2((XX)1(B0+(XX)1)1(XX)1) is positive semi-definite.17 Given that σ2(XX)1 is the covariance matrix of the maximum likelihood estimator, we observe that prior information reduces estimation uncertainty.

Now, we calculate the posterior marginal distribution of β,

π(β|y,X)=0π(β,σ2|y,X)dσ2=0(1σ2)αn+K2+1exp{s2σ2}dσ2, where s=δn+(ββn)B1n(ββn). Then we can write π(β|y,X)=0(1σ2)αn+K2+1exp{s2σ2}dσ2=Γ((αn+K)/2)(s/2)(αn+K)/20(s/2)(αn+K)/2Γ((αn+K)/2)(σ2)(αn+K)/21exp{s2σ2}dσ2.

The right term is the integral of the probability density function of an inverse gamma distribution with parameters ν=(αn+K)/2 and τ=s/2. Since we are integrating over the whole support of σ2, the integral is equal to 1, and therefore π(β|y,X)=Γ((αn+K)/2)(s/2)(αn+K)/2s(αn+K)/2=[δn+(ββn)B1n(ββn)](αn+K)/2=[1+(ββn)(δnαnBn)1(ββn)αn](αn+K)/2(δn)(αN+K)/2[1+(ββn)H1n(ββn)αn](αn+K)/2, where Hn=δnαnBn. This last expression is a multivariate Student’s t distribution for β, β|y,XtK(αn,βn,Hn).

Observe that as we have incorporated the uncertainty of the variance, the posterior for β changes from a normal to a Students’ t distribution, which has heavier tails.

The marginal likelihood of this model is

p(y)=0RKπ(β|σ2,B0,β0)π(σ2|α0/2,δ0/2)p(y|β,σ2,X)dσ2dβ.

Taking into account that (yXβ)(yXβ)+(ββ0)B10(ββ0)=(ββn)B1n(ββn)+m, where m=yy+β0B10β0βnB1nβn, we have that

p(y)=0RKπ(β|σ2)π(σ2)p(y|β,σ2,X)dσ2dβ=0π(σ2)1(2πσ2)N/2exp{12σ2m}1(2πσ2)K/2|B0|1/2×RKexp{12σ2(ββn)B1n(ββn)}dσ2dβ=0π(σ2)1(2πσ2)N/2exp{12σ2m}|Bn|1/2|B0|1/2dσ2=0(δ0/2)α0/2Γ(α0/2)(1σ2)α0/2+1exp{(δ02σ2)}1(2πσ2)N/2exp{12σ2m}|Bn|1/2|B0|1/2dσ2=1(2π)N/2(δ0/2)α0/2Γ(α0/2)|Bn|1/2|B0|1/20(1σ2)α0+N2+1exp{(δ0+m2σ2)}dσ2=1πN/2δα0/20δαn/2n|Bn|1/2|B0|1/2Γ(αn/2)Γ(α0/2).

The posterior predictive is equal to

π(Y0|y)=0RKp(Y0|β,σ2,y)π(β|σ2,y)π(σ2|y)dβdσ2=0RKp(Y0|β,σ2)π(β|σ2,y)π(σ2|y)dβdσ2,

where we take into account independence between Y0 and Y. Given X0, which is the N0×K matrix of regressors associated with Y0, Then,

π(Y0|y)=0RK{(2πσ2)N02exp{12σ2(Y0X0β)(Y0X0β)}×(2πσ2)K2|Bn|1/2exp{12σ2(ββn)B1n(ββn)}×(δn/2)αn/2Γ(αn/2)(1σ2)αn/2+1exp{δn2σ2}}dβdσ2.

Setting M=(X0X0+B1n) and β=M1(B1nβn+X0Y0), we have

(Y0X0β)(Y0X0β)+(ββn)B1n(ββn)=(ββ)M(ββ)+βnB1nβn+Y0Y0βMβ,

Thus,

π(Y0|y)0{(1σ2)K+N0+αn2+1exp{12σ2(βnB1nβn+Y0Y0βMβ+δn)}×RKexp{12σ2(ββ)M(ββ)}dβ}dσ2, where the term in the second integral is the kernel of a multivariate normal density with mean β and covariance matrix σ2M1. Then,

π(Y0|y)0(1σ2)N0+αn2+1exp{12σ2(βnB1nβn+Y0Y0βMβ+δn)}dσ2,

which is the kernel of an inverse gamma density. Thus,

π(Y0|y)[βnB1nβn+Y0Y0βMβ+δn2]αn+N02.

Setting C1=IN0+X0BnX0 such that C=IN0X0(B1n+X0X0)1X0=IN0X0M1X0,18 and β=C1X0M1B1nβn, then

βnB1nβn+Y0Y0βMβ=βnB1nβn+Y0Y0(βnB1n+Y0X0)M1(B1nβn+X0Y0)=βn(B1nB1nM1B1n)βn+Y0CY02Y0CC1X0M1B1nβn+βCββCβ=βn(B1nB1nM1B1n)βn+(Y0β)C(Y0β)βCβ,

where βn(B1nB1nM1B1n)βn=βCβ and β=X0βn (see Exercise 6).

Then,

π(Y0|y)[(Y0X0βn)C(Y0X0βn)+δn2]αn+N02[(Y0X0βn)(Cαnδn)(Y0X0βn)αn+1]αn+N02.

Then, the posterior predictive is a multivariate Student’s t, Y0|yt(X0βn,δn(IN0+X0BnX0)αn,αn).


  1. A particular case of the Woodbury matrix identity↩︎

  2. Using this result (A+BDC)1=A1A1B(D1+CA1B)1CA1↩︎