Chapter 3 Posteriors Using Sufficient Statistics

Recall the following Bayesian (hierarchical) linear regression model from part 1: $NIG(M_0, m_0, a_0, b_0) \times N(y \mid X\beta, \sigma^2 I_n)$ where $y = (y_1, y_2, ..., y_n)^{\mathrm{\scriptscriptstyle T}}$ is an $n \times 1$ vector of outcomes and $X$ is a fixed $n \times p$ matrix of predictors with full column rank.

Suppose that due to privacy reasons, instead of $y = (y_1, y_2, ..., y_n)^{\mathrm{\scriptscriptstyle T}}$ we are given data that only consists of the sufficient statistics, $s^2$ and $\bar{y}$ . Further suppose that we are working with the following simpler model: $y = 1_n \beta_0 + \epsilon.$

How do we obtain $p(\beta_0, \sigma^2 \mid y)$ using this data set?

3.1 Methods

We can use the factorization theorem to factor the normal distribution’s pdf into a function of the sufficient statistics, $s^2$ and $\bar{y}.$ $\begin{align*} f(y \mid \beta_0, \sigma^2) &= \Pi_{i=1}^{n} (2\pi\sigma^2)^{-\frac{1}{2}} \exp\{-\frac{1}{2\sigma^2}(y_i-\beta_0)^2\}\\ &= (2\pi\sigma^2)^{-\frac{n}{2}} \exp\{-\frac{1}{2\sigma^2}\Sigma(y_i - \beta_0)^2\} \\ &= (2\pi\sigma^2)^{-\frac{n}{2}} \exp\{-\frac{1}{2\sigma^2}\Sigma(y_i - \bar{y} + \bar{y} - \beta_0)^2\} \\ &= (2\pi\sigma^2)^{-\frac{n}{2}} \exp\{-\frac{1}{2\sigma^2}\Sigma[(y_i - \bar{y})^2 + (\bar{y} - \beta_0)^2]\} \\ &= (2\pi\sigma^2)^{-\frac{n}{2}} \exp\{-\frac{1}{2\sigma^2}[(n-1)s^2 + n(\bar{y} - \beta_0)^2]\} \end{align*}$

We have now successfully expressed our likelihood as a function of the sufficient statistics $s^2$ and $\bar{x}$ and can use this factorized likelihood along with Bayes’ Rule to compute $p(\beta_0, \sigma^2)$ .

$\begin{align*} p(\beta_0, \sigma^2 \mid y) &= p(\beta_0 \mid \mu_\beta, \sigma^2) \times p(\sigma^2 \mid a_0, b_0) \times p(y \mid \beta_0, \sigma^2) \\ &= N(\beta_0 \mid \mu_\beta, \sigma^2) \times IG(\sigma^2 \mid a_0,b_0) \times N(y \mid \beta_0, \sigma^2) \\ &= (2\pi\sigma^2)^{-\frac{1}{2}}\exp\{-\frac{1}{2\sigma^2}(\beta_0 - \mu_\beta)^2\} \times \frac{b_0^{a_0}}{\Gamma(a_0)}(\sigma^2)^{-a-1}\exp\{-\frac{b}{\sigma^2}\} \\ &\times (2\pi\sigma^2)^{-\frac{n}{2}}\exp\{-\frac{1}{2\sigma^2}[(n-1)s^2 + n(\bar{y} - \beta_0)^2]\} \\ &\propto (\sigma^2)^{-a_0-1} \times (\sigma^2)^{-\frac{n+1}{2}} \times \exp\{-\frac{b_0}{\sigma^2}\} \times \exp\{-\frac{1}{2\sigma^2}(\beta_0 - \mu_\beta)^2\} \\ &\times \exp\{-\frac{1}{2\sigma^2}[(n-1)s^2 + n(\bar{y} - \beta_0)^2]\} \\ &= (\sigma^2)^{-a_0-1} \times (\sigma^2)^{-\frac{n+1}{2}} \exp\{-\frac{b_0}{\sigma^2}\} \times \exp\{-\frac{1}{2\sigma^2}[(\beta_0 - \mu_{\beta})^2 +(n-1)s^2 + n(\bar{y}-\beta_0)^2]\} \end{align*}$

Complete the square on the following term: $\begin{align*} &(\beta_0 - \mu_{\beta})^2 +(n-1)s^2 + n(\bar{y}-\beta_0)^2 \\ &= (n+1)\beta_0^2 - 2(\mu_{\beta} + n\bar{y})\beta_0 + n\bar{y}^2 + \mu_\beta^2 + (n-1)s^2 \\ &= (n+1)(\beta_0 - \frac{\mu_{\beta} + n\bar{y}}{n+1})^2 + n\bar{y}^2 + \mu_\beta^2 + (n-1)s^2 - \frac{(\mu_\beta + n\bar{y})^2}{n+1}\\ &= (n+1)(\beta_0 - \frac{\mu_{\beta} + n\bar{y}}{n+1})^2 + y^{\mathrm{\scriptscriptstyle T}}\Big(\frac{1_n1_n^{\mathrm{\scriptscriptstyle T}}}{n}\Big)y + y^{\mathrm{\scriptscriptstyle T}}\Big(I - \frac{1_n1_n^{\mathrm{\scriptscriptstyle T}}}{n}\Big)y + \mu_\beta^2 - \frac{(\mu_\beta + n\bar{y})^2}{n+1} \\ &= (n+1)(\beta_0 - \frac{\mu_{\beta} + n\bar{y}}{n+1})^2 + y^{\mathrm{\scriptscriptstyle T}}y + \mu_\beta^2 - \frac{(\mu_\beta + n\bar{y})^2}{n+1} \end{align*}$

We now let:

$\mu_\beta^{*} = \frac{n\bar{y} + \mu_\beta}{n+1}$
$c = y^{\mathrm{\scriptscriptstyle T}}y + \mu_\beta^2 - \frac{(n\bar{y} + \mu_\beta)^2}{n+1}$ ,

and apply this expansion of this term into the original problem.

$\begin{align*} &=(\sigma^2)^{-a_0-1} \times (\sigma^2)^{-\frac{n+1}{2}} \times \exp\{-\frac{b_0}{\sigma^2}\} \times \exp\{-\frac{1}{2\sigma^2}(n+1)(\beta_0 - \mu_\beta^{*})^2\} \times \exp\{-\frac{c}{2\sigma^2}\} \\ &= (\sigma^2)^{-a_0-1} \times (\sigma^2)^{-\frac{n+1}{2}} \times \exp\{-\frac{b_0}{\sigma^2}\} \times \exp\{-\frac{1}{2\sigma^{2\ast}}(\beta_0 - \mu_\beta^{*})^2\} \times \exp\{-\frac{c}{2\sigma^2}\} \\ &= (\sigma^2)^{-a_0-1-\frac{n}{2}} \times \exp\{-\frac{(b_0 + \frac{c}{2})}{\sigma^2}\} \times (\sigma^2)^{-\frac{1}{2}} \times \exp\{-\frac{1}{2\sigma^{2\ast}}(\beta_0 - \mu_\beta^{*})^2\} \\ &= \underbrace{(\sigma^2)^{a_1-1} \times \exp\{-\frac{b_1}{\sigma^2}\}}_\text{kernal of $IG(a_1, b_1)$} \times \underbrace{(\sigma^2)^{\frac{1}{2}} \times \exp\{-\frac{1}{2\sigma^{2\ast}}(\beta_0 - \mu_\beta^{*})^2\}}_\text{kernal of $N(\mu_\beta^{\ast}, \sigma^{2\ast})$} \end{align*}$

Where:

$\mu_\beta^{*} = \frac{n\bar{y} + \mu_\beta}{n+1}$
$\sigma^{2*} = \frac{\sigma^2}{n+1}$
$a_1 = a_0 + \frac{n}{2}$
$b_1 = b_0 + \frac{c}{2}$
$c = y^{\mathrm{\scriptscriptstyle T}}y + \mu_\beta^2 - \frac{(n\bar{y} + \mu_\beta)^2}{n+1}$

3.2 Posterior From Improper Priors

Recall that when we work with this model in a general regression setting where $X$ is an $n \times p$ matrix predictors with full column rank, the posterior density is:

$\begin{align*} p(\beta, \sigma^2 \mid y) &= IG(\sigma^2 \mid a_1, b_1) \times N(\beta \mid M_1m_1, \sigma^2M_1) \end{align*}$ We end up with:

$m_0^{*} = Mm$
$M_1 = (X^{\mathrm{\scriptscriptstyle T}}X + M_{0}^{-1})^{-1}$
$m_1 = X^{\mathrm{\scriptscriptstyle T}}y + M_{0}^{-1}m_0$
$a_1 = a_0 + \frac{n}{2}$
$b_1 = b_0 + \frac{c}{2}$
$c = y^{\mathrm{\scriptscriptstyle T}}y + m_0^{\mathrm{\scriptscriptstyle T}}M_0^{-1}m_0 - m_1^{\mathrm{\scriptscriptstyle T}}M_1m_1$

If we take $M_0^{-1} \rightarrow 0$ (i.e. the null matrix), and $a_0 \rightarrow -\frac{p}{2}$ , and $b_0 \rightarrow 0$ we can see that it leads to the improper prior $p(\beta, \sigma^2) \propto \frac{1}{\sigma^2}$

$\begin{align*} p(\beta, \sigma^2) &= IG(\sigma^2 \mid a_0, b_0) \times N(\beta \mid m_0, \sigma^2M_0) \\ &= \frac{b_0^{a_0}}{\Gamma(a_0)}(\sigma^2)^{-a_0-1} \exp\{-\frac{b_0}{\sigma^2}\} \times (2\pi)^{-p/2}\det(\sigma^2M_0)^{-\frac{1}{2}} \exp\{-\frac{1}{2\sigma^2}(\beta - m_0)^{\mathrm{\scriptscriptstyle T}}\frac{M_0^{-1}}{\sigma^2}(\beta - m_0)\} \\ &\propto (\sigma^2)^{-a_0-1}\exp\{-\frac{b_0}{\sigma^2}\} \times (\sigma^2)^{-\frac{p}{2}}\exp\{-\frac{1}{2}(\beta - m_0)^{\mathrm{\scriptscriptstyle T}}\frac{M_0^{-1}}{\sigma^2}(\beta - m_0)\} \\ &= (\sigma^2)^{-(\frac{p}{2}+1})\exp\{-\frac{b_0}{\sigma^2}\} \times (\sigma^2)^{-\frac{p}{2}}\exp\{-\frac{1}{2}(\beta - m_0)^{\mathrm{\scriptscriptstyle T}} \frac{0}{\sigma^2}(\beta - m_0)\}\\ &= \frac{1}{\sigma^2}, \end{align*}$ and ultimately yields a posterior distribution with:

$m_1 = (X^{\mathrm{\scriptscriptstyle T}}X)^{-1}X^{\mathrm{\scriptscriptstyle T}}y$
$a_1 = \frac{n-p}{2}$
$b_1 = b_0 + \frac{c^{*}}{2}$
$c^{*} = y^{\mathrm{\scriptscriptstyle T}}y - m_1^{\mathrm{\scriptscriptstyle T}}M_1m_1 = y^{\mathrm{\scriptscriptstyle T}}(I - P_x)y = (n-p)s^2$

where $P_x = X(X^{\mathrm{\scriptscriptstyle T}}X)^{-1}X^{\mathrm{\scriptscriptstyle T}}$ .

If we apply this framework, to the simpler model presented at the beginning of the section,

$y = 1_n\beta_0 + \epsilon$ , where

$X$ is a vector of 1’s instead of an

$n \times p$ matrix, our posteriors should be:

$\mu_\beta^{*} = \bar{y}$
$a_1 = \frac{n-1}{2}$
$b_1 = \frac{(n-1)s^2}{2}$
$c = y^{\mathrm{\scriptscriptstyle T}}y - m_1^{\mathrm{\scriptscriptstyle T}}M_1m_1 = y^{\mathrm{\scriptscriptstyle T}}(I - P_x)y = (n-1)s^2$

3.3 Extension to Divide & Conquer Algorithm

While maintaining the frameworks we established in parts 1 and 2, let’s now additionally assume n is so large that we are unable to store or load $y$ or $X$ into our CPU to carry out computations. As a result, we divide the data set into $K$ mutually exclusive and exhaustive subsets, each comprising a manageable number of points. Let $y_k$ denote the $q_k \times 1$ subvector of $y$ and $X_k$ be the $q_k \times p$ submatrix of $X$ in subset $k$ , such that $q_k > p$ and is small enough so that we can now fit the desired model on $\{y_k, X_k\}$ . It is shown below that it is possible to still compute $a_1$ , $b_1$ , $M_1$ and $m_1$ wihout ever having to store or compute with $y$ or $X$ , but with quantities computed using only the subsets $\{y_k, X_k\}$ , for $k$ = 1,2,…, $K$ .

Starting with a simple example where $K$ = 2, and where we are taking the assumptions made in parts 1 and 2 into consideration, we note that:

$\bar{y_1} = \frac{\sum_{i=1}^{q_1}y_{1i}}{q_1}$
$\bar{y_2} = \frac{\sum_{i=1}^{q_2}y_{2i}}{q_2}$
This implies that: $\bar{y} = \frac{q_1\bar{y_1} + q_2\bar{y2}}{n}$ , where $q_1+q_2 = n$
Extending this to k subsets: $\bar{y} = \frac{\sum_{i=1}^K{q_i\bar{y_i}}}{n}$ , where $\sum_{i=1}^k{q_i} = n$

Similarly,

$s_1^2 = \sum_{i=1}^{q_1}\frac{({y_{1i}-\bar{y_1}})^2}{q_1}$
$s_2^2 = \sum_{i=1}^{q_2}\frac{({y_{2i}-\bar{y_2}})^2}{q_2}$
This implies that: $s^2 = \frac{(q_1-1)s_1^2 + (q_2-1)s_2^2}{q_1+q_2-1}$ , where $q_1+q_2 = n$
Extending this to k subsets: $s^2 = \frac{\sum_{i=1}^k(q_i -1)s_i^2}{\sum_{i=1}^{k}{q_i}}$ , where $\sum_{i=1}^{k}{q_i-1} = n$

Applying what we already derived in the previous parts, we can see that for

$i=1,2$ ,:

$\mu_{i\beta}^* = \bar{y_i}$
$a_{i1}^ = \frac{q_i- 1}{2}$
$b_{i1}^{*} = \frac{(q_i - 1)s_i^2}{2}$
$c_{i1}= (q_i -1)s_i^2$

We also know that the posteriors for the full data set are:

$\mu_\beta^{*} = \bar{y}$
$a_1 = \frac{n-1}{2}$
$b_1 = \frac{(n-1)s^2}{2}$
$c = y^{\mathrm{\scriptscriptstyle T}}y - m_1^{\mathrm{\scriptscriptstyle T}}M_1m_1 = y^{\mathrm{\scriptscriptstyle T}}(I - P_x)y = (n-1)s^2$

We can see that the posteriors for the full data set can be expressed as functions of the posteriors of the subsets

$(k = 1, 2)$ ,

$\mu_\beta^{*} = \bar{y} = \frac{q_1\bar{y_1} + q_2\bar{y2}}{n}$
$a_1 = \frac{q_1 + q_2-1}{2}$
$b_1 = \frac{(q_1 - 1)s_1^2 + (q_2 -1)s_2^2}{2}$
$c = (q_1 - 1)s_1^2 + (q_2 -1)s_2^2$

which proves that it is possible to still compute $a_1$ , $b_1$ , $M_1$ and $m_1$ without ever having to store or compute with $y$ or $X$ , but with quantities computed using only the subsets (for $k = 1,2$ ).