10.1 Foundation
Remember from Chapter 1 that Bayesian model averaging (BMA) is an approach which takes into account model uncertainty. In particular, we consider uncertainty in the regressors (variable selection) in a regression framework where there are K possible explanatory variables.54 This implies 2K potential models indexed by parameters \boldsymbol{\theta}_m, m=1,2,\dots,2^K.
Following Simmons et al. (2010), the posterior model probability is \begin{equation*} \pi(\mathcal{M}_j |\boldsymbol{y})=\frac{p(\boldsymbol{y} | \mathcal{M}_j)\pi(\mathcal{M}_j)}{\sum_{m=1}^{2^K}p(\boldsymbol{y} | \mathcal{M}_m)\pi(\mathcal{M}_m)}, \end{equation*} where \pi(\mathcal{M}_j) is the prior model probability,55 \begin{equation*} p(\boldsymbol{y} | \mathcal{M}_j)=\int_{\boldsymbol{\Theta}_j} p(\boldsymbol{y}| \boldsymbol{\theta}_j,\mathcal{M}_j)\pi(\boldsymbol{\theta}_j | \mathcal{M}_j) d\boldsymbol{\theta}_{j} \end{equation*} is the marginal likelihood, and \pi(\boldsymbol{\theta}_j | \mathcal{M}_j) is the prior distribution of \boldsymbol{\theta}_j conditional on model \mathcal{M}_j.
Following Adrian E. Raftery (1993), the posterior distribution of \boldsymbol{\theta} is \begin{equation*} \pi(\boldsymbol{\theta}|\boldsymbol{y})= \sum_{m=1}^{2^K}\pi(\boldsymbol{\theta}_m|\boldsymbol{y},\mathcal{M}_m) \pi(\mathcal{M}_m|\boldsymbol{y}) \end{equation*} The posterior distribution of the parameter vector \boldsymbol{\theta} under model \mathcal{M}_m is denoted as \pi(\boldsymbol{\theta}_m|\boldsymbol{y}, \mathcal{M}_m). The posterior mean of \boldsymbol{\theta} is given by: \mathbb{E}[\boldsymbol{\theta}|\boldsymbol{y}] = \sum_{m=1}^{2^K} \hat{\boldsymbol{\theta}}_m \, \pi(\mathcal{M}_m|\boldsymbol{y}),
where \hat{\boldsymbol{\theta}}_m represents the posterior mean under model \mathcal{M}_m.
The variance of the k-th element of \boldsymbol{\theta} given the data \boldsymbol{y} is: \text{Var}(\theta_{km}|\boldsymbol{y}) = \sum_{m=1}^{2^K} \pi(\mathcal{M}_m|\boldsymbol{y}) \, \widehat{\text{Var}}(\theta_{km}|\boldsymbol{y}, \mathcal{M}_m) + \sum_{m=1}^{2^K} \pi(\mathcal{M}_m|\boldsymbol{y}) \left( \hat{\theta}_{km} - \mathbb{E}[\theta_{km}|\boldsymbol{y}] \right)^2,
where \widehat{\text{Var}}(\theta_{km}|\boldsymbol{y}, \mathcal{M}_m) denotes the posterior variance of the k-th element of \boldsymbol{\theta} under model \mathcal{M}_m.
The posterior variance highlights how BMA accounts for model uncertainty. The first term represents the weighted variance of each model, averaged across all potential models, while the second term reflects the stability of the estimates across models. The greater the variation in estimates between models, the higher the posterior variance.
The posterior predictive distribution is \begin{equation*} \pi(\boldsymbol{y}_0|\boldsymbol{y})= \sum_{m=1}^{2^K}p_m(\boldsymbol{y}_0|\boldsymbol{y},\mathcal{M}_m) \pi(M_m|\boldsymbol{y}) \end{equation*}
where p_m(\boldsymbol{y}_0|\boldsymbol{y},\mathcal{M}_m)=\int_{\boldsymbol{\Theta}_m} p(\boldsymbol{y}_0|\boldsymbol{y},\boldsymbol{\theta}_m,\mathcal{M}_m)\pi(\boldsymbol{\theta}_m |\boldsymbol{y}, \mathcal{M}_m) d\boldsymbol{\theta}_{m} is the posterior predictive distribution under model \mathcal{M}_m.
Another important statistic in BMA is the posterior inclusion probability associated with variable \boldsymbol{x}_k, k=1,2,\dots,K, which is
\begin{equation*} PIP(\boldsymbol{x}_k)=\sum_{m=1}^{2^K}\pi(\mathcal{M}_m|\boldsymbol{y})\times \mathbb{1}_{k,m}, \end{equation*} where \mathbb{1}_{k,m}= \left\{ \begin{array}{lcc} 1& if & \boldsymbol{x}_{k}\in \mathcal{M}_m \\ \\ 0 & if & \boldsymbol{x}_{k}\not \in \mathcal{M}_m \end{array} \right\}.
R. E. Kass and Raftery (1995) suggest that posterior inclusion probabilities (PIP) less than 0.5 are evidence against the regressor, 0.5\leq PIP<0.75 is weak evidence, 0.75\leq PIP<0.95 is positive evidence, 0.95\leq PIP<0.99 is strong evidence, and PIP\geq 0.99 is very strong evidence.
There are two main computational issues in implementing BMA based on variable selection. First, the number of models in the model space is 2^K, which sometimes can be enormous. For instance, three regressors imply just eight models, see the next Table, but 40 regressors implies approximately 1.1e+12 models. Take into account that models always include the intercept, and all regressors should be standardized to avoid scale issues.56 The second computational issue is calculating the marginal likelihood p(\boldsymbol{y} | \mathcal{M}_j)=\int_{\boldsymbol{\Theta}_j} p(\boldsymbol{y}| \boldsymbol{\theta}_j,\mathcal{M}_j)\pi(\boldsymbol{\theta}_j | \mathcal{M}_j) d\boldsymbol{\theta}_{j}, which most of the time does not have an analytic solution.
Variable | M_{1} | M_{2} | M_{3} | M_{4} | M_{5} | M_{6} | M_{7} | M_{8} |
---|---|---|---|---|---|---|---|---|
x_1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
x_2 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
x_3 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
The first computational issue is basically a problem of ranking models. This can be tackled using different approaches, such as Occam’s window criterion (D. Madigan and Raftery 1994; Adrian E. Raftery, Madigan, and Hoeting 1997), reversible jump Markov chain Monte Carlo computation (Green 1995), Markov chain Monte Carlo model composition (D. Madigan, York, and Allard 1995), and multiple testing using intrinsic priors (G. Casella and Moreno 2006) or nonlocal prior densities (Jhonson and Rossell 2012). We focus on Occam’s window and Markov chain Monte Carlo model composition in our GUI.57
In Occam’s window, a model is discarded if its predictive performance is much worse than that of the best model (D. Madigan and Raftery 1994; Adrian E. Raftery, Madigan, and Hoeting 1997). Thus, models not belonging to \mathcal{M}'=\left\{\mathcal{M}_j:\frac{\max_m {\pi(\mathcal{M}_m|\boldsymbol{y})}}{\pi(\mathcal{M}_j|\boldsymbol{y})}\leq c\right\} should be discarded, where c is chosen by the user (D. Madigan and Raftery (1994) propose c=20). In addition, complicated models than are less supported by the data than simpler models are also discarded, that is, \mathcal{M}''=\left\{\mathcal{M}_j:\exists \mathcal{M}_m\in\mathcal{M}',\mathcal{M}_m\subset \mathcal{M}_j,\frac{\pi(\mathcal{M}_m|\boldsymbol{y})}{\pi(\mathcal{M}_j|\boldsymbol{y})}>1\right\}. Then, the set of models used in BMA is \mathcal{M}^*=\mathcal{M}'\cap \mathcal{M}''^c\in\mathcal{M}. Adrian E. Raftery, Madigan, and Hoeting (1997) find that the number of models in \mathcal{M}^* is normally less than 25.
However, the previous theoretical framework requires finding the model with the maximum a posteriori model probability (\max_m {\pi(\mathcal{M}_m|\boldsymbol{y})}), which implies calculating all possible models in \mathcal{M}. This is computationally burdensome. Hence, a heuristic approach is proposed by Adrian Raftery et al. (2012) based on ideas of D. Madigan and Raftery (1994). The search strategy is based on a series of nested comparisons of ratios of posterior model probabilities. Let \mathcal{M}_0 be a model with one regressor less than model \mathcal{M}_1, then:
If \log(\pi(\mathcal{M}_0|\boldsymbol{y})/\pi(\mathcal{M}_1|\boldsymbol{y}))>\log(O_R), then \mathcal{M}_1 is rejected and \mathcal{M}_0 is considered.
If \log(\pi(\mathcal{M}_0|\boldsymbol{y})/\pi(\mathcal{M}_1|\boldsymbol{y}))\leq -\log(O_L), then \mathcal{M}_0 is rejected, and \mathcal{M}_1 is considered.
If \log(O_L)<\log(\pi(\mathcal{M}_0|\boldsymbol{y})/\pi(\mathcal{M}_1|\boldsymbol{y}))\leq \log(O_R), \mathcal{M}_0 and \mathcal{M}_1 are considered.
Here O_R is a number specifying the maximum ratio for excluding models in Occam’s window, and O_L=1/O_R^{2} is defined by default in Adrian Raftery et al. (2012). The search strategy can be “up’‘, adding one regressor, or “down’’, dropping one regressor (see D. Madigan and Raftery (1994) for details about the down and up algorithms). The leaps and bounds algorithm (Furnival and Wilson 1974) is implemented to improve the computational efficiency of this search strategy (Adrian Raftery et al. 2012). Once the set of potentially acceptable models is defined, we discard all the models that are not in \mathcal{M}', and the models that are in \mathcal{M}'' where 1 is replaced by \exp\left\{O_R\right\} due to the leaps and bounds algorithm giving an approximation to BIC, so as to ensure that no good models are discarded.
The second approach that we consider in our GUI to tackle the model space size issue is Markov chain Monte Carlo model composition (MC3) (David Madigan, York, and Allard 1995). In particular, given the space of models \mathcal{M}_m, we simulate a chain of \mathcal{M}_s models, s = 1, 2, ..., S<<2^K, where the algorithm randomly extracts a candidate model \mathcal{M}_c from a neighborhood of models (nbd(\mathcal{M}_m)) that consists of the actual model itself and the set of models with either one variable more or one variable less (Adrian E. Raftery, Madigan, and Hoeting 1997). Therefore, there is a transition kernel in the space of models q(\mathcal{M}_m\rightarrow \mathcal{M}_c), such that q(\mathcal{M}_m\rightarrow \mathcal{M}_{c})=0 \ \forall \mathcal{M}_{c}\notin nbd(\mathcal{M}_m) and q(\mathcal{M}_m\rightarrow \mathcal{M}_{c})=\frac{1}{|nbd(\mathcal{M}_m)|} \ \forall \mathcal{M}_m\in nbd(\mathcal{M}_m), |nbd(\mathcal{M}_m)| being the number of neighbors of \mathcal{M}_m. This candidate model is accepted with probability
\begin{equation*} \alpha (\mathcal{M}_{s-1},\mathcal{M}_{c})=\min \bigg \{ \frac{|nbd(\mathcal{M}_m)|p(\boldsymbol{y} | \mathcal{M}_c)\pi(\mathcal{M}_c)}{|nbd(\mathcal{M}^{c})|p(\boldsymbol{y}| \mathcal{M}_{(s-1)})\pi(\mathcal{M}_{(s-1)})},1 \bigg \}. \end{equation*}
Observe that by construction |nbd(\mathcal{M}_m)|=|nbd(\mathcal{M}_c)|=k, except in extreme cases where a model has only one regressor or has all regressors.
The Bayesian information criterion is a possible solution for the second computational issue in BMA, that is, calculating the marginal likelihood when there is no an analytic solution. Defining h(\boldsymbol{\theta}|\mathcal{M}_j)=-\frac{\log(p(\boldsymbol{y}| \boldsymbol{\theta}_j,\mathcal{M}_j)\pi(\boldsymbol{\theta}_j | \mathcal{M}_j))}{N}, then p(\boldsymbol{y} | \mathcal{M}_j)=\int_{\boldsymbol{\Theta}_j} \exp\left\{-N h(\boldsymbol{\theta}|\mathcal{M}_j)\right\} d\boldsymbol{\theta}_{j}. If N is sufficiently large (technically N\to \infty), we can make the following assumptions (Hoeting et al. 1999):
- We can use the Laplace method for approximating integrals (Tierney and Kadane 1986).
- The posterior mode is reached at the same point as the maximum likelihood estimator (MLE), denoted by \hat{\boldsymbol{\theta}}_{MLE}.
We get the following results under these assumptions: \begin{align*} p(\boldsymbol{y} | \mathcal{M}_j)\approx&\left( \frac{2\pi}{N}\right)^{K_j/2}|\boldsymbol{\Sigma}|^{-1/2} \exp\left\{-N h(\boldsymbol{\hat{\theta}}_j^{MLE}|\mathcal{M}_j)\right\}, \ N\rightarrow\infty, \end{align*} where \boldsymbol{\Sigma} is the inverse of the Hessian matrix of h(\boldsymbol{\hat{\theta}}_j^{MLE}|\mathcal{M}_j), and K_j=dim\left\{\boldsymbol{\theta}_j\right\}.
This implies \begin{align*} \log\left(p(\boldsymbol{y} | \mathcal{M}_j)\right)\approx& \frac{K_j}{2}\log(2\pi)- \frac{K_j}{2}\log(N) -\frac{1}{2}\log(|\boldsymbol{\Sigma}|) + \log(p(\boldsymbol{y}| \boldsymbol{\hat{\theta}}_j^{MLE},\mathcal{M}_j))+\log(\pi(\boldsymbol{\hat{\theta}}_j^{MLE} | \mathcal{M}_j)), \ N\rightarrow\infty. \end{align*}
Since \frac{K_j}{2}\log(2\pi) and \log(\pi(\boldsymbol{\hat{\theta}}_j^{MLE} | \mathcal{M}_j)) are constants as functions of \boldsymbol{y}, and |\boldsymbol{\Sigma}| is bounded by a finite constant, we have \begin{align*} log\left(p(\boldsymbol{y} | \mathcal{M}_j)\right)\approx& -\frac{K_j}{2}\log(N)+\log(p(\boldsymbol{y}| \boldsymbol{\hat{\theta}}_j^{MLE},\mathcal{M}_j))= -\frac{BIC}{2}, \ N \rightarrow \infty. \end{align*}
The marginal likelihood thus asymptotically converges to a linear transformation of the Bayesian Information Criterion (BIC), significantly simplifying its calculation. In addition, the BIC is consistent, that is, the probability of uncovering the population statistical model converges to one as the sample size converges to infinity given a \mathcal{M}-closed view (J. Bernardo and Smith 1994), that is, one of the models in consideration is the population statistical model (data generating process) (Schwarz 1978; Burnham and Anderson 2004). In case that there is an \mathcal{M}-completed view of nature, that is, there is a true data generating process, but the space of models that we are comparing does not include it, the BIC asymptotically selects the model that minimizes the Kullback-Leiber (KL) divergence to the true (population) model (Claeskens and Hjort 2008).
References
Take into account that K can increase when interaction terms and/or polynomial terms of the original control variables are included.↩︎
We attach equal prior probabilities to each model in our GUI. However, this choice gives more prior probability to the set of models of medium size (think about the k-th row of Pascal’s triangle). An interesting alternative is to use the Beta-Binomial prior proposed by Ley and Steel (2009).↩︎
Scaling variables is always an important step in variable selection.↩︎
Variable selection (model selection or regularization) is a topic related to model uncertainty. Approaches such as stochastic search variable selection (spike and slab) (George and McCulloch 1993, 1997) and Bayesian Lasso (Park and Casella 2008) are good examples of how to tackle this issue. See Chapter 12.↩︎