Chapter 3 Posteriors Using Sufficient Statistics

Recall the following Bayesian (hierarchical) linear regression model from part 1: NIG(M0,m0,a0,b0)×N(yXβ,σ2In) where y=(y1,y2,...,yn)T is an n×1 vector of outcomes and X is a fixed n×p matrix of predictors with full column rank.

Suppose that due to privacy reasons, instead of y=(y1,y2,...,yn)T we are given data that only consists of the sufficient statistics, s2 and ˉy. Further suppose that we are working with the following simpler model: y=1nβ0+ϵ.

How do we obtain p(β0,σ2y) using this data set?

3.1 Methods

We can use the factorization theorem to factor the normal distribution’s pdf into a function of the sufficient statistics, s2 and ˉy. f(yβ0,σ2)=Πni=1(2πσ2)12exp{12σ2(yiβ0)2}=(2πσ2)n2exp{12σ2Σ(yiβ0)2}=(2πσ2)n2exp{12σ2Σ(yiˉy+ˉyβ0)2}=(2πσ2)n2exp{12σ2Σ[(yiˉy)2+(ˉyβ0)2]}=(2πσ2)n2exp{12σ2[(n1)s2+n(ˉyβ0)2]}

We have now successfully expressed our likelihood as a function of the sufficient statistics s2 and ˉx and can use this factorized likelihood along with Bayes’ Rule to compute p(β0,σ2).

p(β0,σ2y)=p(β0μβ,σ2)×p(σ2a0,b0)×p(yβ0,σ2)=N(β0μβ,σ2)×IG(σ2a0,b0)×N(yβ0,σ2)=(2πσ2)12exp{12σ2(β0μβ)2}×ba00Γ(a0)(σ2)a1exp{bσ2}×(2πσ2)n2exp{12σ2[(n1)s2+n(ˉyβ0)2]}(σ2)a01×(σ2)n+12×exp{b0σ2}×exp{12σ2(β0μβ)2}×exp{12σ2[(n1)s2+n(ˉyβ0)2]}=(σ2)a01×(σ2)n+12exp{b0σ2}×exp{12σ2[(β0μβ)2+(n1)s2+n(ˉyβ0)2]}

Complete the square on the following term: (β0μβ)2+(n1)s2+n(ˉyβ0)2=(n+1)β202(μβ+nˉy)β0+nˉy2+μ2β+(n1)s2=(n+1)(β0μβ+nˉyn+1)2+nˉy2+μ2β+(n1)s2(μβ+nˉy)2n+1=(n+1)(β0μβ+nˉyn+1)2+yT(1n1Tnn)y+yT(I1n1Tnn)y+μ2β(μβ+nˉy)2n+1=(n+1)(β0μβ+nˉyn+1)2+yTy+μ2β(μβ+nˉy)2n+1

We now let:
    • μβ=nˉy+μβn+1
    • c=yTy+μ2β(nˉy+μβ)2n+1,

and apply this expansion of this term into the original problem.

=(σ2)a01×(σ2)n+12×exp{b0σ2}×exp{12σ2(n+1)(β0μβ)2}×exp{c2σ2}=(σ2)a01×(σ2)n+12×exp{b0σ2}×exp{12σ2(β0μβ)2}×exp{c2σ2}=(σ2)a01n2×exp{(b0+c2)σ2}×(σ2)12×exp{12σ2(β0μβ)2}=(σ2)a11×exp{b1σ2}kernal of IG(a1,b1)×(σ2)12×exp{12σ2(β0μβ)2}kernal of N(μβ,σ2)

Where:
    • μβ=nˉy+μβn+1
    • σ2=σ2n+1
    • a1=a0+n2
    • b1=b0+c2
    • c=yTy+μ2β(nˉy+μβ)2n+1

3.2 Posterior From Improper Priors

Recall that when we work with this model in a general regression setting where X is an n×p matrix predictors with full column rank, the posterior density is:

p(β,σ2y)=IG(σ2a1,b1)×N(βM1m1,σ2M1) We end up with:
    • m0=Mm
    • M1=(XTX+M10)1
    • m1=XTy+M10m0
    • a1=a0+n2
    • b1=b0+c2
    • c=yTy+mT0M10m0mT1M1m1

If we take M100 (i.e. the null matrix), and a0p2, and b00 we can see that it leads to the improper prior p(β,σ2)1σ2

p(β,σ2)=IG(σ2a0,b0)×N(βm0,σ2M0)=ba00Γ(a0)(σ2)a01exp{b0σ2}×(2π)p/2det and ultimately yields a posterior distribution with:

    • m_1 = (X^{\mathrm{\scriptscriptstyle T}}X)^{-1}X^{\mathrm{\scriptscriptstyle T}}y
    • a_1 = \frac{n-p}{2}
    • b_1 = b_0 + \frac{c^{*}}{2}
    • c^{*} = y^{\mathrm{\scriptscriptstyle T}}y - m_1^{\mathrm{\scriptscriptstyle T}}M_1m_1 = y^{\mathrm{\scriptscriptstyle T}}(I - P_x)y = (n-p)s^2

where P_x = X(X^{\mathrm{\scriptscriptstyle T}}X)^{-1}X^{\mathrm{\scriptscriptstyle T}}.

If we apply this framework, to the simpler model presented at the beginning of the section, y = 1_n\beta_0 + \epsilon, where X is a vector of 1’s instead of an n \times p matrix, our posteriors should be:
    • \mu_\beta^{*} = \bar{y}
    • a_1 = \frac{n-1}{2}
    • b_1 = \frac{(n-1)s^2}{2}
    • c = y^{\mathrm{\scriptscriptstyle T}}y - m_1^{\mathrm{\scriptscriptstyle T}}M_1m_1 = y^{\mathrm{\scriptscriptstyle T}}(I - P_x)y = (n-1)s^2

3.3 Extension to Divide & Conquer Algorithm

While maintaining the frameworks we established in parts 1 and 2, let’s now additionally assume n is so large that we are unable to store or load y or X into our CPU to carry out computations. As a result, we divide the data set into K mutually exclusive and exhaustive subsets, each comprising a manageable number of points. Let y_k denote the q_k \times 1 subvector of y and X_k be the q_k \times p submatrix of X in subset k, such that q_k > p and is small enough so that we can now fit the desired model on \{y_k, X_k\}. It is shown below that it is possible to still compute a_1, b_1, M_1 and m_1 wihout ever having to store or compute with y or X, but with quantities computed using only the subsets \{y_k, X_k\}, for k = 1,2,…,K.

Starting with a simple example where K = 2, and where we are taking the assumptions made in parts 1 and 2 into consideration, we note that:

    • \bar{y_1} = \frac{\sum_{i=1}^{q_1}y_{1i}}{q_1}
    • \bar{y_2} = \frac{\sum_{i=1}^{q_2}y_{2i}}{q_2}
    • This implies that: \bar{y} = \frac{q_1\bar{y_1} + q_2\bar{y2}}{n}, where q_1+q_2 = n
    • Extending this to k subsets: \bar{y} = \frac{\sum_{i=1}^K{q_i\bar{y_i}}}{n}, where \sum_{i=1}^k{q_i} = n
Similarly,
    • s_1^2 = \sum_{i=1}^{q_1}\frac{({y_{1i}-\bar{y_1}})^2}{q_1}
    • s_2^2 = \sum_{i=1}^{q_2}\frac{({y_{2i}-\bar{y_2}})^2}{q_2}
    • This implies that: s^2 = \frac{(q_1-1)s_1^2 + (q_2-1)s_2^2}{q_1+q_2-1}, where q_1+q_2 = n
    • Extending this to k subsets: s^2 = \frac{\sum_{i=1}^k(q_i -1)s_i^2}{\sum_{i=1}^{k}{q_i}}, where \sum_{i=1}^{k}{q_i-1} = n
Applying what we already derived in the previous parts, we can see that for i=1,2,:
    • \mu_{i\beta}^* = \bar{y_i}
    • a_{i1}^ = \frac{q_i- 1}{2}
    • b_{i1}^{*} = \frac{(q_i - 1)s_i^2}{2}
    • c_{i1}= (q_i -1)s_i^2
We also know that the posteriors for the full data set are:
    • \mu_\beta^{*} = \bar{y}
    • a_1 = \frac{n-1}{2}
    • b_1 = \frac{(n-1)s^2}{2}
    • c = y^{\mathrm{\scriptscriptstyle T}}y - m_1^{\mathrm{\scriptscriptstyle T}}M_1m_1 = y^{\mathrm{\scriptscriptstyle T}}(I - P_x)y = (n-1)s^2
We can see that the posteriors for the full data set can be expressed as functions of the posteriors of the subsets (k = 1, 2),
    • \mu_\beta^{*} = \bar{y} = \frac{q_1\bar{y_1} + q_2\bar{y2}}{n}
    • a_1 = \frac{q_1 + q_2-1}{2}
    • b_1 = \frac{(q_1 - 1)s_1^2 + (q_2 -1)s_2^2}{2}
    • c = (q_1 - 1)s_1^2 + (q_2 -1)s_2^2

which proves that it is possible to still compute a_1, b_1, M_1 and m_1 without ever having to store or compute with y or X, but with quantities computed using only the subsets (for k = 1,2).