4.3 Linear regression: The conjugate normal-normal/inverse gamma model
In this setting we analyze the conjugate normal-normal/inverse gamma model which is the workhorse in econometrics. In this model, the dependent variable yi is related to a set of regressors xi=(xi1,xi2,…,xiK)⊤ in a linear way, that is, yi=β1xi1+β2xi2+…+βKxiK+μi=x⊤iβ+μi where β=(β1,β2,…,βK)⊤ and μiiid∼N(0,σ2) is an stochastic error that is independent of the regressors, xi⊥μi.
Defining y=[y1y2⋮yN], X=[x11x12…x1Kx21x22…x2K⋮⋮⋮⋮xN1xN2…xNK] and μ=[μ1μ2⋮μN], we can write the model in matrix form: y=Xβ+μ, where μ∼N(0,σ2I) which implies that y∼N(Xβ,σ2I). Then, the likelihood function is
p(y|β,σ2,X)=(2πσ2)−N2exp{−12σ2(y−Xβ)⊤(y−Xβ)}∝(σ2)−N2exp{−12σ2(y−Xβ)⊤(y−Xβ)}.
The conjugate priors for the parameters are β|σ2∼N(β0,σ2B0),σ2∼IG(α0/2,δ0/2).
Then, the posterior distribution is
π(β,σ2|y,X)∝(σ2)−N2exp{−12σ2(y−Xβ)⊤(y−Xβ)}×(σ2)−K2exp{−12σ2(β−β0)⊤B−10(β−β0)}×(δ0/2)α0/2Γ(α0/2)(1σ2)α0/2+1exp{−δ02σ2}∝(σ2)−K2exp{−12σ2[β⊤(B−10+X⊤X)β−2β⊤(B−10β0+X⊤Xˆβ)]}×(1σ2)(α0+N)/2+1exp{−δ0+y⊤y+β⊤0B−10β02σ2},
where ˆβ=(X⊤X)−1X⊤y is the maximum likelihood estimator.
Adding and subtracting β⊤nB−1nβn to complete the square, where Bn=(B−10+X⊤X)−1 and βn=Bn(B−10β0+X⊤Xˆβ),
π(β,σ2|y,X)∝(σ2)−K2exp{−12σ2(β−βn)⊤B−1n(β−βn)}⏟1×(σ2)−(αn2+1)exp{−δn2σ2}⏟2.
The first expression is the kernel of a normal density function, β|σ2,y,X∼N(βn,σ2Bn). The second expression is the kernel of a inverse gamma density, σ2|y,X∼IG(αn/2,δn/2), where αn=α0+N and δn=δ0+y⊤y+β⊤0B−10β0−β⊤nB−1nβn.
Taking into account that βn=(B−10+X⊤X)−1(B−10β0+X⊤Xˆβ)=(B−10+X⊤X)−1B−10β0+(B−10+X⊤X)−1X⊤Xˆβ,
where (B−10+X⊤X)−1B−10=IK−(B−10+X⊤X)−1X⊤X (Smith 1973). Setting W=(B−10+X⊤X)−1X⊤X we have βn=(IK−W)β0+Wˆβ, that is, the posterior mean of β is a weighted average between the sample and prior information, where the weights depend on the precision of each piece of information. Observe that when the prior covariance matrix is highly vague (non–informative), such that B−10→0K, we obtain W→IK, such that βn→ˆβ, that is, the posterior mean location parameter converges to the maximum likelihood estimator.
In addition, we know that the posterior conditional covariance matrix of the location parameters σ2(B−10+X⊤X)−1=σ2(X⊤X)−1−σ2((X⊤X)−1(B0+(X⊤X)−1)−1(X⊤X)−1) is positive semi-definite.17 Given that σ2(X⊤X)−1 is the covariance matrix of the maximum likelihood estimator, we observe that prior information reduces estimation uncertainty.
Now, we calculate the posterior marginal distribution of β,
π(β|y,X)=∫∞0π(β,σ2|y,X)dσ2=∫∞0(1σ2)αn+K2+1exp{−s2σ2}dσ2, where s=δn+(β−βn)⊤B−1n(β−βn). Then we can write π(β|y,X)=∫∞0(1σ2)αn+K2+1exp{−s2σ2}dσ2=Γ((αn+K)/2)(s/2)(αn+K)/2∫∞0(s/2)(αn+K)/2Γ((αn+K)/2)(σ2)−(αn+K)/2−1exp{−s2σ2}dσ2.
The right term is the integral of the probability density function of an inverse gamma distribution with parameters ν=(αn+K)/2 and τ=s/2. Since we are integrating over the whole support of σ2, the integral is equal to 1, and therefore π(β|y,X)=Γ((αn+K)/2)(s/2)(αn+K)/2∝s−(αn+K)/2=[δn+(β−βn)⊤B−1n(β−βn)]−(αn+K)/2=[1+(β−βn)⊤(δnαnBn)−1(β−βn)αn]−(αn+K)/2(δn)−(αN+K)/2∝[1+(β−βn)⊤H−1n(β−βn)αn]−(αn+K)/2, where Hn=δnαnBn. This last expression is a multivariate Student’s t distribution for β, β|y,X∼tK(αn,βn,Hn).
Observe that as we have incorporated the uncertainty of the variance, the posterior for β changes from a normal to a Students’ t distribution, which has heavier tails.
The marginal likelihood of this model is
p(y)=∫∞0∫RKπ(β|σ2,B0,β0)π(σ2|α0/2,δ0/2)p(y|β,σ2,X)dσ2dβ.
Taking into account that (y−Xβ)⊤(y−Xβ)+(β−β0)⊤B−10(β−β0)=(β−βn)⊤B−1n(β−βn)+m, where m=y⊤y+β⊤0B−10β0−β⊤nB−1nβn, we have that
p(y)=∫∞0∫RKπ(β|σ2)π(σ2)p(y|β,σ2,X)dσ2dβ=∫∞0π(σ2)1(2πσ2)N/2exp{−12σ2m}1(2πσ2)K/2|B0|1/2×∫RKexp{−12σ2(β−βn)⊤B−1n(β−βn)}dσ2dβ=∫∞0π(σ2)1(2πσ2)N/2exp{−12σ2m}|Bn|1/2|B0|1/2dσ2=∫∞0(δ0/2)α0/2Γ(α0/2)(1σ2)α0/2+1exp{(−δ02σ2)}1(2πσ2)N/2exp{−12σ2m}|Bn|1/2|B0|1/2dσ2=1(2π)N/2(δ0/2)α0/2Γ(α0/2)|Bn|1/2|B0|1/2∫∞0(1σ2)α0+N2+1exp{(−δ0+m2σ2)}dσ2=1πN/2δα0/20δαn/2n|Bn|1/2|B0|1/2Γ(αn/2)Γ(α0/2).
The posterior predictive is equal to
π(Y0|y)=∫∞0∫RKp(Y0|β,σ2,y)π(β|σ2,y)π(σ2|y)dβdσ2=∫∞0∫RKp(Y0|β,σ2)π(β|σ2,y)π(σ2|y)dβdσ2,
where we take into account independence between Y0 and Y. Given X0, which is the N0×K matrix of regressors associated with Y0, Then,
π(Y0|y)=∫∞0∫RK{(2πσ2)−N02exp{−12σ2(Y0−X0β)⊤(Y0−X0β)⊤}×(2πσ2)−K2|Bn|−1/2exp{−12σ2(β−βn)⊤B−1n(β−βn)}×(δn/2)αn/2Γ(αn/2)(1σ2)αn/2+1exp{−δn2σ2}}dβdσ2.
Setting M=(X⊤0X0+B−1n) and β∗=M−1(B−1nβn+X⊤0Y0), we have
(Y0−X0β)⊤(Y0−X0β)⊤+(β−βn)⊤B−1n(β−βn)=(β−β∗)⊤M(β−β∗)+β⊤nB−1nβn+Y⊤0Y0−β⊤∗Mβ∗,
Thus,
π(Y0|y)∝∫∞0{(1σ2)−K+N0+αn2+1exp{−12σ2(β⊤nB−1nβn+Y⊤0Y0−β⊤∗Mβ∗+δn)}×∫RKexp{−12σ2(β−β∗)⊤M(β−β∗)}dβ}dσ2, where the term in the second integral is the kernel of a multivariate normal density with mean β∗ and covariance matrix σ2M−1. Then,
π(Y0|y)∝∫∞0(1σ2)N0+αn2+1exp{−12σ2(β⊤nB−1nβn+Y⊤0Y0−β⊤∗Mβ∗+δn)}dσ2,
which is the kernel of an inverse gamma density. Thus,
π(Y0|y)∝[β⊤nB−1nβn+Y⊤0Y0−β⊤∗Mβ∗+δn2]−αn+N02.
Setting C−1=IN0+X0BnX⊤0 such that C=IN0−X0(B−1n+X⊤0X0)−1X⊤0=IN0−X0M−1X⊤0,18 and β∗∗=C−1X0M−1B−1nβn, then
β⊤nB−1nβn+Y⊤0Y0−β⊤∗Mβ∗=β⊤nB−1nβn+Y⊤0Y0−(β⊤nB−1n+Y⊤0X0)M−1(B−1nβn+X⊤0Y0)=β⊤n(B−1n−B−1nM−1B−1n)βn+Y⊤0CY0−2Y⊤0CC−1X0M−1B−1nβn+β⊤∗∗Cβ∗∗−β⊤∗∗Cβ∗∗=β⊤n(B−1n−B−1nM−1B−1n)βn+(Y0−β∗∗)⊤C(Y0−β∗∗)−β⊤∗∗Cβ∗∗,
where β⊤n(B−1n−B−1nM−1B−1n)βn=β⊤∗∗Cβ∗∗ and β∗∗=X0βn (see Exercise 6).
Then,
π(Y0|y)∝[(Y0−X0βn)⊤C(Y0−X0βn)+δn2]−αn+N02∝[(Y0−X0βn)⊤(Cαnδn)(Y0−X0βn)αn+1]−αn+N02.
Then, the posterior predictive is a multivariate Student’s t, Y0|y∼t(X0βn,δn(IN0+X0BnX⊤0)αn,αn).