Chapter 2 The Divide & Conquer Algorithm

The divide and conquer algorithm is a strategy of solving a large problem by breaking it down into two or more smaller and more manageable sub-problems of the same or of a similar genre. The solutions to the sub-problems are then combined to get the desired output: the solution to the original problem.

In this section, we apply the divide and conquer algorithm to the Bayesian linear regression framework.

2.1 The Problem

Let y=(y1,y2,...,yn)T be an n×1 random vector of outcomes and X be a fixed n×p matrix of predictors with full column rank. Consider the following Bayesian (hierarchical) linear regression model,

NIG(β,σ2m0,M0,a0,b0)×N(yXβ,σ2In). The posterior density is p(β,σ2y)=IG(σ2a1,b1)×N(βM1m1,σ2M1), and to carry out Bayesian inference, we first sample σ2IG(a1,b1) and then, for each sampled σ2, we sample βN(M1m1,σ2M1).

Assume that n is so large that we are unable to store or load y or X into our CPU to carry out computations for them. We decide to divide our data set into K mutually exclusive and exhaustive subsets, each comprising a manageable number of points. Note that p is small, so computations involving p×p matrices are fine. Let yk denote the qk×1 sub-vector of y, and Xk be the qk×p sub-matrix of X in subset k, where each qk has been chosen by us so that qk>p, and is small enough such that we can fit the above model on {yk,Xk}. This section will clearly explain how we can still compute a1,b1,M1 and m1 without ever having to store or compute with y or X, but with quantities computed using only the subsets {yk,Xk} for k=1,2,...,K.

2.2 Parallel Computing

2.2.1 Background and Motivation

A first approach to the problem presented above is called ‘parallel computing’. Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously - which in our context, entails dividing our data set into manageable subsets, calculating posteriors for each of these subsets simultaneously, and finally expressing the posteriors of the entire data set as functions of the posteriors of the subsets. The motivation behind this method is to make computation more efficient.

2.2.2 Solution

Using the multivariate completing the square method, we know that the explicit expressions for a1,b1,m1 and M1 are given by:

  • a1 = a0+n2
  • b1 = b0+c2
  • c = mT0M10m0+yTymT1M1m1
  • m1 = (XTy+M10m0)
  • M11 = XTX+M10
This implies that the explicit expressions for a1,b1,m1 and M1 for the ith subset are given by:
  • a1i = a0+qi2
  • b1i = b0+c2
  • ci = mT0M10m0+yTiyimT1iM1im1i
  • mi = (XTiyi+M10m0)
  • M1i = XTiXi+M10

We can express the posteriors of the entire data set as a function of the posteriors of the subsets as follows:

  • a1 = ki=1a1i+(k1)a0
  • b1 = b0+mT0M10m0+ki=1yTiyimT1iM1im1i
  • m1 = ki=1m1i(k1)M10m0
  • M11 = ki=1xTixi+M10=ki=1M11i(k1)M10

2.2.3 Algebra

This section details the algebra used to obtain the posteriors of the entire data set expressed as functions of the posteriors of the subsets.

Starting with a1:

ki=1ai1=ki=1a0+ki=1qi2=ka0+n2=(k1)a0+a0+n2=(k1)a0+a1

a1=ki=1ai1(k1)a0

Moving onto m:

Recall that m1=XTy+V1βμβ,

where X=[x1x2xk], where xiRmi×1 and y=[y1y2yn], where yiR.

Using linear algebra we can see that:

XTy=[xT1xT2xTk][y1y2yn]=ki=1xTiyi Thus:

ki=1mi=ki=1xTiyi+kVβμ1β=XTy+kV1βμβ=XTy+V1βμβ+(k1)V1βμβ=m+(k1)V1βμβ

m1=ki=1mi(k1)V1βμβ

Next is M11:

Recall that M11=XTX+V1β.

Using linear algebra again, we can see that:

XTX=[xT1xT2xTk][x1x2xk]=ki=1xTixi M11=ki=1xTixi+V1β

Last is b1:

Recall that b1=b0+mT0M10m0+yTymT1M1m12.

Using linear algebra once again, we can see that:

yTy=[yT1yT2yTn][y1y2yn]=ni=1yTiyi b1=b0+mT0M10m0+ni=1yTiyimT1M1m12

Which concludes the algebra behind obtaining a1,b1,M1,and m1 from the posteriors computed using the subsetted data.

2.3 Sequential Computing

2.3.1 Background and Motivation

A second approach to the problem presented in this section is called ‘sequential computing’. Sequential computing is a type of computation where one instruction is given at a particular time and the next instruction has to wait for the first instruction to execute. For our problem, this entails computing the posterior for the first subset, using it to compute posterior of next subset, and so on until we get to the last subset, which will upon computation will give us the posterior of the entire data set.

2.3.2 Solution

Let Dk={yk,Xk} be the ith subset of the entire data set for k=1,...,K, such that DiDj,  ij. We will start with a simple example of how sequential computing works by setting k=2.

The posterior density of the first subset is: p(β,σ2D1)IG(σ2a1,b1)×N(βM1m1,σ2M1)

The posteriors {a1,b1,M1,and m1} then replace {a0,b0M0,and m0} as priors when using the second subset to calculate posteriors:

p(β,σ2D1,D1)p(β,σ2,D1,D2)p(β,σ2)×p(D1,D2,β,σ2)p(β,σ2)×p(D1β,σ2)×p(D2β,σ2)p(β,σ2D1)×p(D2β,σ2)

Which illustrates how the posteriors of the previous subset act as a priors when calculating the posteriors of the current subset.

If we generalize this to k subsets and do some algebra, we can derive equations for the posteriors of the last subset, which are equivalent to the posteriors obtained using the full data set.

  • a1k = a0+ki=1qi2
  • b1 = b0+c2
  • c = mT0M10m0+yTymT1M1m1
  • m1 = (XTy+M10m0)
  • M11 = XTX+M10