Chapter 3 Posteriors Using Sufficient Statistics
Recall the following Bayesian (hierarchical) linear regression model from part 1:
NIG(M0,m0,a0,b0)×N(y∣Xβ,σ2In) where y=(y1,y2,...,yn)T is an n×1 vector of outcomes and X is a fixed n×p matrix of predictors with full column rank.
Suppose that due to privacy reasons, instead of y=(y1,y2,...,yn)T we are given data that only consists of the sufficient statistics, s2 and ˉy. Further suppose that we are working with the following simpler model: y=1nβ0+ϵ.
How do we obtain p(β0,σ2∣y) using this data set?
3.1 Methods
We can use the factorization theorem to factor the normal distribution’s pdf into a function of the sufficient statistics, s2 and ˉy. f(y∣β0,σ2)=Πni=1(2πσ2)−12exp{−12σ2(yi−β0)2}=(2πσ2)−n2exp{−12σ2Σ(yi−β0)2}=(2πσ2)−n2exp{−12σ2Σ(yi−ˉy+ˉy−β0)2}=(2πσ2)−n2exp{−12σ2Σ[(yi−ˉy)2+(ˉy−β0)2]}=(2πσ2)−n2exp{−12σ2[(n−1)s2+n(ˉy−β0)2]}
We have now successfully expressed our likelihood as a function of the sufficient statistics s2 and ˉx and can use this factorized likelihood along with Bayes’ Rule to compute p(β0,σ2).
p(β0,σ2∣y)=p(β0∣μβ,σ2)×p(σ2∣a0,b0)×p(y∣β0,σ2)=N(β0∣μβ,σ2)×IG(σ2∣a0,b0)×N(y∣β0,σ2)=(2πσ2)−12exp{−12σ2(β0−μβ)2}×ba00Γ(a0)(σ2)−a−1exp{−bσ2}×(2πσ2)−n2exp{−12σ2[(n−1)s2+n(ˉy−β0)2]}∝(σ2)−a0−1×(σ2)−n+12×exp{−b0σ2}×exp{−12σ2(β0−μβ)2}×exp{−12σ2[(n−1)s2+n(ˉy−β0)2]}=(σ2)−a0−1×(σ2)−n+12exp{−b0σ2}×exp{−12σ2[(β0−μβ)2+(n−1)s2+n(ˉy−β0)2]}
Complete the square on the following term: (β0−μβ)2+(n−1)s2+n(ˉy−β0)2=(n+1)β20−2(μβ+nˉy)β0+nˉy2+μ2β+(n−1)s2=(n+1)(β0−μβ+nˉyn+1)2+nˉy2+μ2β+(n−1)s2−(μβ+nˉy)2n+1=(n+1)(β0−μβ+nˉyn+1)2+yT(1n1Tnn)y+yT(I−1n1Tnn)y+μ2β−(μβ+nˉy)2n+1=(n+1)(β0−μβ+nˉyn+1)2+yTy+μ2β−(μβ+nˉy)2n+1
We now let:- μ∗β=nˉy+μβn+1
- c=yTy+μ2β−(nˉy+μβ)2n+1,
and apply this expansion of this term into the original problem.
=(σ2)−a0−1×(σ2)−n+12×exp{−b0σ2}×exp{−12σ2(n+1)(β0−μ∗β)2}×exp{−c2σ2}=(σ2)−a0−1×(σ2)−n+12×exp{−b0σ2}×exp{−12σ2∗(β0−μ∗β)2}×exp{−c2σ2}=(σ2)−a0−1−n2×exp{−(b0+c2)σ2}×(σ2)−12×exp{−12σ2∗(β0−μ∗β)2}=(σ2)a1−1×exp{−b1σ2}⏟kernal of IG(a1,b1)×(σ2)12×exp{−12σ2∗(β0−μ∗β)2}⏟kernal of N(μ∗β,σ2∗)
Where:- μ∗β=nˉy+μβn+1
- σ2∗=σ2n+1
- a1=a0+n2
- b1=b0+c2
- c=yTy+μ2β−(nˉy+μβ)2n+1
3.2 Posterior From Improper Priors
Recall that when we work with this model in a general regression setting where X is an n×p matrix predictors with full column rank, the posterior density is:
p(β,σ2∣y)=IG(σ2∣a1,b1)×N(β∣M1m1,σ2M1) We end up with:- m∗0=Mm
- M1=(XTX+M−10)−1
- m1=XTy+M−10m0
- a1=a0+n2
- b1=b0+c2
- c=yTy+mT0M−10m0−mT1M1m1
If we take M−10→0 (i.e. the null matrix), and a0→−p2, and b0→0 we can see that it leads to the improper prior p(β,σ2)∝1σ2
p(β,σ2)=IG(σ2∣a0,b0)×N(β∣m0,σ2M0)=ba00Γ(a0)(σ2)−a0−1exp{−b0σ2}×(2π)−p/2det(σ2M0)−12exp{−12σ2(β−m0)TM−10σ2(β−m0)}∝(σ2)−a0−1exp{−b0σ2}×(σ2)−p2exp{−12(β−m0)TM−10σ2(β−m0)}=(σ2)−(p2+1)exp{−b0σ2}×(σ2)−p2exp{−12(β−m0)T0σ2(β−m0)}=1σ2, and ultimately yields a posterior distribution with:
- m1=(XTX)−1XTy
- a1=n−p2
- b1=b0+c∗2
- c∗=yTy−mT1M1m1=yT(I−Px)y=(n−p)s2
where Px=X(XTX)−1XT.
If we apply this framework, to the simpler model presented at the beginning of the section, y=1nβ0+ϵ, where X is a vector of 1’s instead of an n×p matrix, our posteriors should be:- μ∗β=ˉy
- a1=n−12
- b1=(n−1)s22
- c=yTy−mT1M1m1=yT(I−Px)y=(n−1)s2
3.3 Extension to Divide & Conquer Algorithm
While maintaining the frameworks we established in parts 1 and 2, let’s now additionally assume n is so large that we are unable to store or load y or X into our CPU to carry out computations. As a result, we divide the data set into K mutually exclusive and exhaustive subsets, each comprising a manageable number of points. Let yk denote the qk×1 subvector of y and Xk be the qk×p submatrix of X in subset k, such that qk>p and is small enough so that we can now fit the desired model on {yk,Xk}. It is shown below that it is possible to still compute a1, b1, M1 and m1 wihout ever having to store or compute with y or X, but with quantities computed using only the subsets {yk,Xk}, for k = 1,2,…,K.
Starting with a simple example where K = 2, and where we are taking the assumptions made in parts 1 and 2 into consideration, we note that:
- ¯y1=∑q1i=1y1iq1
- ¯y2=∑q2i=1y2iq2
- This implies that: ˉy=q1¯y1+q2¯y2n, where q1+q2=n
- Extending this to k subsets: ˉy=∑Ki=1qi¯yin, where ∑ki=1qi=n
- s21=∑q1i=1(y1i−¯y1)2q1
- s22=∑q2i=1(y2i−¯y2)2q2
- This implies that: s2=(q1−1)s21+(q2−1)s22q1+q2−1, where q1+q2=n
- Extending this to k subsets: s2=∑ki=1(qi−1)s2i∑ki=1qi, where ∑ki=1qi−1=n
- μ∗iβ=¯yi
- a=i1qi−12
- b∗i1=(qi−1)s2i2
- ci1=(qi−1)s2i
- μ∗β=ˉy
- a1=n−12
- b1=(n−1)s22
- c=yTy−mT1M1m1=yT(I−Px)y=(n−1)s2
- μ∗β=ˉy=q1¯y1+q2¯y2n
- a1=q1+q2−12
- b1=(q1−1)s21+(q2−1)s222
- c=(q1−1)s21+(q2−1)s22
which proves that it is possible to still compute a1, b1, M1 and m1 without ever having to store or compute with y or X, but with quantities computed using only the subsets (for k=1,2).