Chapter 3 Posteriors Using Sufficient Statistics

Recall the following Bayesian (hierarchical) linear regression model from part 1: \[NIG(M_0, m_0, a_0, b_0) \times N(y \mid X\beta, \sigma^2 I_n)\] where \(y = (y_1, y_2, ..., y_n)^{\mathrm{\scriptscriptstyle T}}\) is an \(n \times 1\) vector of outcomes and \(X\) is a fixed \(n \times p\) matrix of predictors with full column rank.

Suppose that due to privacy reasons, instead of \(y = (y_1, y_2, ..., y_n)^{\mathrm{\scriptscriptstyle T}}\) we are given data that only consists of the sufficient statistics, \(s^2\) and \(\bar{y}\). Further suppose that we are working with the following simpler model: \[y = 1_n \beta_0 + \epsilon.\]

How do we obtain \(p(\beta_0, \sigma^2 \mid y)\) using this data set?

3.1 Methods

We can use the factorization theorem to factor the normal distribution’s pdf into a function of the sufficient statistics, \(s^2\) and \(\bar{y}.\) \[\begin{align*} f(y \mid \beta_0, \sigma^2) &= \Pi_{i=1}^{n} (2\pi\sigma^2)^{-\frac{1}{2}} \exp\{-\frac{1}{2\sigma^2}(y_i-\beta_0)^2\}\\ &= (2\pi\sigma^2)^{-\frac{n}{2}} \exp\{-\frac{1}{2\sigma^2}\Sigma(y_i - \beta_0)^2\} \\ &= (2\pi\sigma^2)^{-\frac{n}{2}} \exp\{-\frac{1}{2\sigma^2}\Sigma(y_i - \bar{y} + \bar{y} - \beta_0)^2\} \\ &= (2\pi\sigma^2)^{-\frac{n}{2}} \exp\{-\frac{1}{2\sigma^2}\Sigma[(y_i - \bar{y})^2 + (\bar{y} - \beta_0)^2]\} \\ &= (2\pi\sigma^2)^{-\frac{n}{2}} \exp\{-\frac{1}{2\sigma^2}[(n-1)s^2 + n(\bar{y} - \beta_0)^2]\} \end{align*}\]

We have now successfully expressed our likelihood as a function of the sufficient statistics \(s^2\) and \(\bar{x}\) and can use this factorized likelihood along with Bayes’ Rule to compute \(p(\beta_0, \sigma^2)\).

\[\begin{align*} p(\beta_0, \sigma^2 \mid y) &= p(\beta_0 \mid \mu_\beta, \sigma^2) \times p(\sigma^2 \mid a_0, b_0) \times p(y \mid \beta_0, \sigma^2) \\ &= N(\beta_0 \mid \mu_\beta, \sigma^2) \times IG(\sigma^2 \mid a_0,b_0) \times N(y \mid \beta_0, \sigma^2) \\ &= (2\pi\sigma^2)^{-\frac{1}{2}}\exp\{-\frac{1}{2\sigma^2}(\beta_0 - \mu_\beta)^2\} \times \frac{b_0^{a_0}}{\Gamma(a_0)}(\sigma^2)^{-a-1}\exp\{-\frac{b}{\sigma^2}\} \\ &\times (2\pi\sigma^2)^{-\frac{n}{2}}\exp\{-\frac{1}{2\sigma^2}[(n-1)s^2 + n(\bar{y} - \beta_0)^2]\} \\ &\propto (\sigma^2)^{-a_0-1} \times (\sigma^2)^{-\frac{n+1}{2}} \times \exp\{-\frac{b_0}{\sigma^2}\} \times \exp\{-\frac{1}{2\sigma^2}(\beta_0 - \mu_\beta)^2\} \\ &\times \exp\{-\frac{1}{2\sigma^2}[(n-1)s^2 + n(\bar{y} - \beta_0)^2]\} \\ &= (\sigma^2)^{-a_0-1} \times (\sigma^2)^{-\frac{n+1}{2}} \exp\{-\frac{b_0}{\sigma^2}\} \times \exp\{-\frac{1}{2\sigma^2}[(\beta_0 - \mu_{\beta})^2 +(n-1)s^2 + n(\bar{y}-\beta_0)^2]\} \end{align*}\]

Complete the square on the following term: \[\begin{align*} &(\beta_0 - \mu_{\beta})^2 +(n-1)s^2 + n(\bar{y}-\beta_0)^2 \\ &= (n+1)\beta_0^2 - 2(\mu_{\beta} + n\bar{y})\beta_0 + n\bar{y}^2 + \mu_\beta^2 + (n-1)s^2 \\ &= (n+1)(\beta_0 - \frac{\mu_{\beta} + n\bar{y}}{n+1})^2 + n\bar{y}^2 + \mu_\beta^2 + (n-1)s^2 - \frac{(\mu_\beta + n\bar{y})^2}{n+1}\\ &= (n+1)(\beta_0 - \frac{\mu_{\beta} + n\bar{y}}{n+1})^2 + y^{\mathrm{\scriptscriptstyle T}}\Big(\frac{1_n1_n^{\mathrm{\scriptscriptstyle T}}}{n}\Big)y + y^{\mathrm{\scriptscriptstyle T}}\Big(I - \frac{1_n1_n^{\mathrm{\scriptscriptstyle T}}}{n}\Big)y + \mu_\beta^2 - \frac{(\mu_\beta + n\bar{y})^2}{n+1} \\ &= (n+1)(\beta_0 - \frac{\mu_{\beta} + n\bar{y}}{n+1})^2 + y^{\mathrm{\scriptscriptstyle T}}y + \mu_\beta^2 - \frac{(\mu_\beta + n\bar{y})^2}{n+1} \end{align*}\]

We now let:
    • \(\mu_\beta^{*} = \frac{n\bar{y} + \mu_\beta}{n+1}\)
    • \(c = y^{\mathrm{\scriptscriptstyle T}}y + \mu_\beta^2 - \frac{(n\bar{y} + \mu_\beta)^2}{n+1}\),

and apply this expansion of this term into the original problem.

\[\begin{align*} &=(\sigma^2)^{-a_0-1} \times (\sigma^2)^{-\frac{n+1}{2}} \times \exp\{-\frac{b_0}{\sigma^2}\} \times \exp\{-\frac{1}{2\sigma^2}(n+1)(\beta_0 - \mu_\beta^{*})^2\} \times \exp\{-\frac{c}{2\sigma^2}\} \\ &= (\sigma^2)^{-a_0-1} \times (\sigma^2)^{-\frac{n+1}{2}} \times \exp\{-\frac{b_0}{\sigma^2}\} \times \exp\{-\frac{1}{2\sigma^{2\ast}}(\beta_0 - \mu_\beta^{*})^2\} \times \exp\{-\frac{c}{2\sigma^2}\} \\ &= (\sigma^2)^{-a_0-1-\frac{n}{2}} \times \exp\{-\frac{(b_0 + \frac{c}{2})}{\sigma^2}\} \times (\sigma^2)^{-\frac{1}{2}} \times \exp\{-\frac{1}{2\sigma^{2\ast}}(\beta_0 - \mu_\beta^{*})^2\} \\ &= \underbrace{(\sigma^2)^{a_1-1} \times \exp\{-\frac{b_1}{\sigma^2}\}}_\text{kernal of $IG(a_1, b_1)$} \times \underbrace{(\sigma^2)^{\frac{1}{2}} \times \exp\{-\frac{1}{2\sigma^{2\ast}}(\beta_0 - \mu_\beta^{*})^2\}}_\text{kernal of $N(\mu_\beta^{\ast}, \sigma^{2\ast})$} \end{align*}\]

Where:
    • \(\mu_\beta^{*} = \frac{n\bar{y} + \mu_\beta}{n+1}\)
    • \(\sigma^{2*} = \frac{\sigma^2}{n+1}\)
    • \(a_1 = a_0 + \frac{n}{2}\)
    • \(b_1 = b_0 + \frac{c}{2}\)
    • \(c = y^{\mathrm{\scriptscriptstyle T}}y + \mu_\beta^2 - \frac{(n\bar{y} + \mu_\beta)^2}{n+1}\)

3.2 Posterior From Improper Priors

Recall that when we work with this model in a general regression setting where \(X\) is an \(n \times p\) matrix predictors with full column rank, the posterior density is:

\[\begin{align*} p(\beta, \sigma^2 \mid y) &= IG(\sigma^2 \mid a_1, b_1) \times N(\beta \mid M_1m_1, \sigma^2M_1) \end{align*}\] We end up with:
    • \(m_0^{*} = Mm\)
    • \(M_1 = (X^{\mathrm{\scriptscriptstyle T}}X + M_{0}^{-1})^{-1}\)
    • \(m_1 = X^{\mathrm{\scriptscriptstyle T}}y + M_{0}^{-1}m_0\)
    • \(a_1 = a_0 + \frac{n}{2}\)
    • \(b_1 = b_0 + \frac{c}{2}\)
    • \(c = y^{\mathrm{\scriptscriptstyle T}}y + m_0^{\mathrm{\scriptscriptstyle T}}M_0^{-1}m_0 - m_1^{\mathrm{\scriptscriptstyle T}}M_1m_1\)

If we take \(M_0^{-1} \rightarrow 0\) (i.e. the null matrix), and \(a_0 \rightarrow -\frac{p}{2}\), and \(b_0 \rightarrow 0\) we can see that it leads to the improper prior \(p(\beta, \sigma^2) \propto \frac{1}{\sigma^2}\)

\[\begin{align*} p(\beta, \sigma^2) &= IG(\sigma^2 \mid a_0, b_0) \times N(\beta \mid m_0, \sigma^2M_0) \\ &= \frac{b_0^{a_0}}{\Gamma(a_0)}(\sigma^2)^{-a_0-1} \exp\{-\frac{b_0}{\sigma^2}\} \times (2\pi)^{-p/2}\det(\sigma^2M_0)^{-\frac{1}{2}} \exp\{-\frac{1}{2\sigma^2}(\beta - m_0)^{\mathrm{\scriptscriptstyle T}}\frac{M_0^{-1}}{\sigma^2}(\beta - m_0)\} \\ &\propto (\sigma^2)^{-a_0-1}\exp\{-\frac{b_0}{\sigma^2}\} \times (\sigma^2)^{-\frac{p}{2}}\exp\{-\frac{1}{2}(\beta - m_0)^{\mathrm{\scriptscriptstyle T}}\frac{M_0^{-1}}{\sigma^2}(\beta - m_0)\} \\ &= (\sigma^2)^{-(\frac{p}{2}+1})\exp\{-\frac{b_0}{\sigma^2}\} \times (\sigma^2)^{-\frac{p}{2}}\exp\{-\frac{1}{2}(\beta - m_0)^{\mathrm{\scriptscriptstyle T}} \frac{0}{\sigma^2}(\beta - m_0)\}\\ &= \frac{1}{\sigma^2}, \end{align*}\] and ultimately yields a posterior distribution with:

    • \(m_1 = (X^{\mathrm{\scriptscriptstyle T}}X)^{-1}X^{\mathrm{\scriptscriptstyle T}}y\)
    • \(a_1 = \frac{n-p}{2}\)
    • \(b_1 = b_0 + \frac{c^{*}}{2}\)
    • \(c^{*} = y^{\mathrm{\scriptscriptstyle T}}y - m_1^{\mathrm{\scriptscriptstyle T}}M_1m_1 = y^{\mathrm{\scriptscriptstyle T}}(I - P_x)y = (n-p)s^2\)

where \(P_x = X(X^{\mathrm{\scriptscriptstyle T}}X)^{-1}X^{\mathrm{\scriptscriptstyle T}}\).

If we apply this framework, to the simpler model presented at the beginning of the section, \(y = 1_n\beta_0 + \epsilon\), where \(X\) is a vector of 1’s instead of an \(n \times p\) matrix, our posteriors should be:
    • \(\mu_\beta^{*} = \bar{y}\)
    • \(a_1 = \frac{n-1}{2}\)
    • \(b_1 = \frac{(n-1)s^2}{2}\)
    • \(c = y^{\mathrm{\scriptscriptstyle T}}y - m_1^{\mathrm{\scriptscriptstyle T}}M_1m_1 = y^{\mathrm{\scriptscriptstyle T}}(I - P_x)y = (n-1)s^2\)

3.3 Extension to Divide & Conquer Algorithm

While maintaining the frameworks we established in parts 1 and 2, let’s now additionally assume n is so large that we are unable to store or load \(y\) or \(X\) into our CPU to carry out computations. As a result, we divide the data set into \(K\) mutually exclusive and exhaustive subsets, each comprising a manageable number of points. Let \(y_k\) denote the \(q_k \times 1\) subvector of \(y\) and \(X_k\) be the \(q_k \times p\) submatrix of \(X\) in subset \(k\), such that \(q_k > p\) and is small enough so that we can now fit the desired model on \(\{y_k, X_k\}\). It is shown below that it is possible to still compute \(a_1\), \(b_1\), \(M_1\) and \(m_1\) wihout ever having to store or compute with \(y\) or \(X\), but with quantities computed using only the subsets \(\{y_k, X_k\}\), for \(k\) = 1,2,…,\(K\).

Starting with a simple example where \(K\) = 2, and where we are taking the assumptions made in parts 1 and 2 into consideration, we note that:

    • \(\bar{y_1} = \frac{\sum_{i=1}^{q_1}y_{1i}}{q_1}\)
    • \(\bar{y_2} = \frac{\sum_{i=1}^{q_2}y_{2i}}{q_2}\)
    • This implies that: \(\bar{y} = \frac{q_1\bar{y_1} + q_2\bar{y2}}{n}\), where \(q_1+q_2 = n\)
    • Extending this to k subsets: \(\bar{y} = \frac{\sum_{i=1}^K{q_i\bar{y_i}}}{n}\), where \(\sum_{i=1}^k{q_i} = n\)
Similarly,
    • \(s_1^2 = \sum_{i=1}^{q_1}\frac{({y_{1i}-\bar{y_1}})^2}{q_1}\)
    • \(s_2^2 = \sum_{i=1}^{q_2}\frac{({y_{2i}-\bar{y_2}})^2}{q_2}\)
    • This implies that: \(s^2 = \frac{(q_1-1)s_1^2 + (q_2-1)s_2^2}{q_1+q_2-1}\), where \(q_1+q_2 = n\)
    • Extending this to k subsets: \(s^2 = \frac{\sum_{i=1}^k(q_i -1)s_i^2}{\sum_{i=1}^{k}{q_i}}\), where \(\sum_{i=1}^{k}{q_i-1} = n\)
Applying what we already derived in the previous parts, we can see that for \(i=1,2\),:
    • \(\mu_{i\beta}^* = \bar{y_i}\)
    • \(a_{i1}^ = \frac{q_i- 1}{2}\)
    • \(b_{i1}^{*} = \frac{(q_i - 1)s_i^2}{2}\)
    • \(c_{i1}= (q_i -1)s_i^2\)
We also know that the posteriors for the full data set are:
    • \(\mu_\beta^{*} = \bar{y}\)
    • \(a_1 = \frac{n-1}{2}\)
    • \(b_1 = \frac{(n-1)s^2}{2}\)
    • \(c = y^{\mathrm{\scriptscriptstyle T}}y - m_1^{\mathrm{\scriptscriptstyle T}}M_1m_1 = y^{\mathrm{\scriptscriptstyle T}}(I - P_x)y = (n-1)s^2\)
We can see that the posteriors for the full data set can be expressed as functions of the posteriors of the subsets \((k = 1, 2)\),
    • \(\mu_\beta^{*} = \bar{y} = \frac{q_1\bar{y_1} + q_2\bar{y2}}{n}\)
    • \(a_1 = \frac{q_1 + q_2-1}{2}\)
    • \(b_1 = \frac{(q_1 - 1)s_1^2 + (q_2 -1)s_2^2}{2}\)
    • \(c = (q_1 - 1)s_1^2 + (q_2 -1)s_2^2\)

which proves that it is possible to still compute \(a_1\), \(b_1\), \(M_1\) and \(m_1\) without ever having to store or compute with \(y\) or \(X\), but with quantities computed using only the subsets (for \(k = 1,2\)).