Chapter 3 Posteriors Using Sufficient Statistics

Recall the following Bayesian (hierarchical) linear regression model from part 1: NIG(M0,m0,a0,b0)×N(yXβ,σ2In) where y=(y1,y2,...,yn)T is an n×1 vector of outcomes and X is a fixed n×p matrix of predictors with full column rank.

Suppose that due to privacy reasons, instead of y=(y1,y2,...,yn)T we are given data that only consists of the sufficient statistics, s2 and ˉy. Further suppose that we are working with the following simpler model: y=1nβ0+ϵ.

How do we obtain p(β0,σ2y) using this data set?

3.1 Methods

We can use the factorization theorem to factor the normal distribution’s pdf into a function of the sufficient statistics, s2 and ˉy. f(yβ0,σ2)=Πni=1(2πσ2)12exp{12σ2(yiβ0)2}=(2πσ2)n2exp{12σ2Σ(yiβ0)2}=(2πσ2)n2exp{12σ2Σ(yiˉy+ˉyβ0)2}=(2πσ2)n2exp{12σ2Σ[(yiˉy)2+(ˉyβ0)2]}=(2πσ2)n2exp{12σ2[(n1)s2+n(ˉyβ0)2]}

We have now successfully expressed our likelihood as a function of the sufficient statistics s2 and ˉx and can use this factorized likelihood along with Bayes’ Rule to compute p(β0,σ2).

p(β0,σ2y)=p(β0μβ,σ2)×p(σ2a0,b0)×p(yβ0,σ2)=N(β0μβ,σ2)×IG(σ2a0,b0)×N(yβ0,σ2)=(2πσ2)12exp{12σ2(β0μβ)2}×ba00Γ(a0)(σ2)a1exp{bσ2}×(2πσ2)n2exp{12σ2[(n1)s2+n(ˉyβ0)2]}(σ2)a01×(σ2)n+12×exp{b0σ2}×exp{12σ2(β0μβ)2}×exp{12σ2[(n1)s2+n(ˉyβ0)2]}=(σ2)a01×(σ2)n+12exp{b0σ2}×exp{12σ2[(β0μβ)2+(n1)s2+n(ˉyβ0)2]}

Complete the square on the following term: (β0μβ)2+(n1)s2+n(ˉyβ0)2=(n+1)β202(μβ+nˉy)β0+nˉy2+μ2β+(n1)s2=(n+1)(β0μβ+nˉyn+1)2+nˉy2+μ2β+(n1)s2(μβ+nˉy)2n+1=(n+1)(β0μβ+nˉyn+1)2+yT(1n1Tnn)y+yT(I1n1Tnn)y+μ2β(μβ+nˉy)2n+1=(n+1)(β0μβ+nˉyn+1)2+yTy+μ2β(μβ+nˉy)2n+1

We now let:
    • μβ=nˉy+μβn+1
    • c=yTy+μ2β(nˉy+μβ)2n+1,

and apply this expansion of this term into the original problem.

=(σ2)a01×(σ2)n+12×exp{b0σ2}×exp{12σ2(n+1)(β0μβ)2}×exp{c2σ2}=(σ2)a01×(σ2)n+12×exp{b0σ2}×exp{12σ2(β0μβ)2}×exp{c2σ2}=(σ2)a01n2×exp{(b0+c2)σ2}×(σ2)12×exp{12σ2(β0μβ)2}=(σ2)a11×exp{b1σ2}kernal of IG(a1,b1)×(σ2)12×exp{12σ2(β0μβ)2}kernal of N(μβ,σ2)

Where:
    • μβ=nˉy+μβn+1
    • σ2=σ2n+1
    • a1=a0+n2
    • b1=b0+c2
    • c=yTy+μ2β(nˉy+μβ)2n+1

3.2 Posterior From Improper Priors

Recall that when we work with this model in a general regression setting where X is an n×p matrix predictors with full column rank, the posterior density is:

p(β,σ2y)=IG(σ2a1,b1)×N(βM1m1,σ2M1) We end up with:
    • m0=Mm
    • M1=(XTX+M10)1
    • m1=XTy+M10m0
    • a1=a0+n2
    • b1=b0+c2
    • c=yTy+mT0M10m0mT1M1m1

If we take M100 (i.e. the null matrix), and a0p2, and b00 we can see that it leads to the improper prior p(β,σ2)1σ2

p(β,σ2)=IG(σ2a0,b0)×N(βm0,σ2M0)=ba00Γ(a0)(σ2)a01exp{b0σ2}×(2π)p/2det(σ2M0)12exp{12σ2(βm0)TM10σ2(βm0)}(σ2)a01exp{b0σ2}×(σ2)p2exp{12(βm0)TM10σ2(βm0)}=(σ2)(p2+1)exp{b0σ2}×(σ2)p2exp{12(βm0)T0σ2(βm0)}=1σ2, and ultimately yields a posterior distribution with:

    • m1=(XTX)1XTy
    • a1=np2
    • b1=b0+c2
    • c=yTymT1M1m1=yT(IPx)y=(np)s2

where Px=X(XTX)1XT.

If we apply this framework, to the simpler model presented at the beginning of the section, y=1nβ0+ϵ, where X is a vector of 1’s instead of an n×p matrix, our posteriors should be:
    • μβ=ˉy
    • a1=n12
    • b1=(n1)s22
    • c=yTymT1M1m1=yT(IPx)y=(n1)s2

3.3 Extension to Divide & Conquer Algorithm

While maintaining the frameworks we established in parts 1 and 2, let’s now additionally assume n is so large that we are unable to store or load y or X into our CPU to carry out computations. As a result, we divide the data set into K mutually exclusive and exhaustive subsets, each comprising a manageable number of points. Let yk denote the qk×1 subvector of y and Xk be the qk×p submatrix of X in subset k, such that qk>p and is small enough so that we can now fit the desired model on {yk,Xk}. It is shown below that it is possible to still compute a1, b1, M1 and m1 wihout ever having to store or compute with y or X, but with quantities computed using only the subsets {yk,Xk}, for k = 1,2,…,K.

Starting with a simple example where K = 2, and where we are taking the assumptions made in parts 1 and 2 into consideration, we note that:

    • ¯y1=q1i=1y1iq1
    • ¯y2=q2i=1y2iq2
    • This implies that: ˉy=q1¯y1+q2¯y2n, where q1+q2=n
    • Extending this to k subsets: ˉy=Ki=1qi¯yin, where ki=1qi=n
Similarly,
    • s21=q1i=1(y1i¯y1)2q1
    • s22=q2i=1(y2i¯y2)2q2
    • This implies that: s2=(q11)s21+(q21)s22q1+q21, where q1+q2=n
    • Extending this to k subsets: s2=ki=1(qi1)s2iki=1qi, where ki=1qi1=n
Applying what we already derived in the previous parts, we can see that for i=1,2,:
    • μiβ=¯yi
    • a=i1qi12
    • bi1=(qi1)s2i2
    • ci1=(qi1)s2i
We also know that the posteriors for the full data set are:
    • μβ=ˉy
    • a1=n12
    • b1=(n1)s22
    • c=yTymT1M1m1=yT(IPx)y=(n1)s2
We can see that the posteriors for the full data set can be expressed as functions of the posteriors of the subsets (k=1,2),
    • μβ=ˉy=q1¯y1+q2¯y2n
    • a1=q1+q212
    • b1=(q11)s21+(q21)s222
    • c=(q11)s21+(q21)s22

which proves that it is possible to still compute a1, b1, M1 and m1 without ever having to store or compute with y or X, but with quantities computed using only the subsets (for k=1,2).