1.3 Bayesian reports: Decision theory under uncertainty

The Bayesian framework allows reporting the full posterior distributions. However, some situations demand to report a specific value of the posterior distribution (point estimate), an informative interval (set), point or interval predictions and/or selecting a specific model. Decision theory offers an elegant framework to make a decision regarding what are the optimal posterior values to report (Berger 2013).

The point of departure is a loss function, which is a non-negative real value function whose arguments are the unknown state of nature (\(\mathbf{\Theta}\)), and a set of actions to be made (\(\mathcal{A}\)), that is, \[\begin{equation} L(\mathbf{\theta}, a):\mathbf{\Theta}\times \mathcal{A}\rightarrow R^+. \end{equation}\]

This function is a mathematical expression of the loss of making mistakes. In particular, selecting action \(a\in\mathcal{A}\) when \(\mathbf{\theta}\in\mathbf{\Theta}\) is the true. In our case, the unknown state of nature can be parameters, functions of them, future or unknown realizations, models, etc.

From a Bayesian perspective, we should choose the action (\(\delta(\mathbf{y})\)) that minimizes the posterior expected loss, which is the posterior risk function (\(\mathbb{E}[L(\mathbf{\theta}, a)|\mathbf{y}]\)),

\[\begin{equation} \delta(\mathbf{y})=\underset{a \in \mathcal{A}}{\mathrm{argmin}} \ \mathbb{E}[L(\mathbf{\theta}, a)|\mathbf{y}], \end{equation}\]

where \(\mathbb{E}[L(\mathbf{\theta}, a)|\mathbf{y}]= \int_{\mathbf{\Theta}} L(\mathbf{\theta}, a)\pi(\mathbf{\theta}|\mathbf{y})d\mathbf{\theta}\).12

Obviously, different loss functions imply different optimal decisions. We illustrate this assuming \(\theta \in R\).

  • \(L({\theta},a)=[{\theta}-a]^2\), then

\[\begin{equation} \mathbb{E}[{\theta}|\mathbf{y}] = \underset{a \in \mathcal{A}}{\mathrm{argmin}} \ \int_{{\Theta}} [{\theta}-a]^2\pi({\theta}|\mathbf{y})d{\theta}. \end{equation}\]

Using the first condition order with respect to \(a\), and interchanging differentiation with integrals, we get that the posterior mean is the Bayesian optimal action, \(\delta(\mathbf{y})=\mathbb{E}[{\theta}|\mathbf{y}]\). This means that we should report just the posterior mean as a point estimate of \(\theta\) when facing the quadratic loss function.

  • \(L({\theta},a)=w({\theta})[{\theta}-a]^2\), where \(w({\theta})>0\) is a weighting function. Then using same steps as the previous result we have that \(\delta(\mathbf{y})=\frac{\mathbb{E}[w({\theta})\times{\theta}|\mathbf{y}]}{\mathbb{E}[w({\theta})|\mathbf{y}]}\), that is, we should report a weighted average driven by \(w({\theta})\) when facing a generalized quadratic loss function.

  • In the case of an absolute error loss function, \(L({\theta},a)=|{\theta}-a|\), the optimal action is the posterior median (exercise 5).

  • Given the loss function,

\[\begin{equation} L(\theta,a)=\begin{Bmatrix} K_0(\theta-a), \theta-a\geq 0\\ K_1(a-\theta), \theta-a < 0 \end{Bmatrix}, K_0, K_1 >0, \end{equation}\]

then,

\[\begin{align} \mathbb{E}[L(\theta, a)|\mathbf{y}]&=\int_{-\infty}^a K_1(a-\theta)\pi(\theta|\mathbf{y})d\theta + \int_a^{\infty} K_0(\theta-a)\pi(\theta|\mathbf{y})d\theta. \end{align}\]

Differentiating w.r.t \(a\), and equating to zero, \[\begin{align} K_1\int_{-\infty}^a \pi(\theta|\mathbf{y})d\theta-K_0\int_a^{\infty} \pi(\theta|\mathbf{y})d\theta&=0, \end{align}\]

then, \(\int_{-\infty}^a \pi(\theta|\mathbf{y})d\theta=\frac{K_0}{K_0+K_1}\), that is, any \(K_0/(K_0+K_1)\)-percentile of \(\pi(\theta|\mathbf{y})\) is an optimal Bayesian estimate of \(\theta\).

We can also use decision theory under uncertainty in hypothesis testing. In particular, testing \(H_0:\theta\in\Theta_0\) versus \(H_1:\theta\in\Theta_1\), \(\Theta=\Theta_0 \cup \Theta_1\) and \(\emptyset=\Theta_0 \cap \Theta_1\), there are two actions of interest, \(a_0\) and \(a_1\), where \(a_j\) denotes no rejecting \(H_j\), \(j=\left\{0,1\right\}\). Given the loss function,

\[\begin{equation} L(\theta,a_j)=\begin{Bmatrix} 0, & \theta\in\Theta_j\\ K_j, & \theta\in\Theta_j, j\neq i \end{Bmatrix}. \end{equation}\]

The posterior expected loss associated with \(a_j\) is \(K_jP(\Theta_i|\mathbf{y})\), \(j\neq i\). Therefore, the Bayes optimal decision is the one that gives the smallest posterior expected loss, that is, the null hypothesis is rejected (\(a_1\) is not rejected), when \(K_0P(\Theta_1|\mathbf{y}) > K_1P(\Theta_0|\mathbf{y})\). Given our framework \((\Theta=\Theta_0 \cup \Theta_1, \emptyset=\Theta_0 \cap \Theta_1)\), then \(P(\Theta_0|\mathbf{y})=1-P(\Theta_1|\mathbf{y})\), and as a consequence, \(P(\Theta_1|\mathbf{y})>\frac{K_1}{K_1+K_0}\), that is, the rejection region of the Bayesian test is \(R=\left\{\mathbf{y}:P(\Theta_1|\mathbf{y})>\frac{K_1}{K_1+K_0}\right\}\).

Decision theory also helps to construct interval (region) estimates. Let \(\Theta_{C(\mathbf{y})}\subset \Theta\) a credible set for \(\theta\), and \(L(\theta,\Theta_{C(\mathbf{y})})=1-\mathbb{I}\left\{\theta\in \Theta_{C(\mathbf{y})}\right\}\), where

\[\begin{equation} \mathbb{I}\left\{\theta\in \Theta_{C(\mathbf{y})}\right\}=\begin{Bmatrix}1, & \theta\in \Theta_{C(\mathbf{y})}\\ 0, & \theta\notin \Theta_{C(\mathbf{y})} \end{Bmatrix}. \end{equation}\]

Then,

\[\begin{equation} L(\theta,\Theta_{C(\mathbf{y})})=\begin{Bmatrix}0, & \theta\in \Theta_{C(\mathbf{y})}\\ 1, & \theta\notin \Theta_{C(\mathbf{y})} \end{Bmatrix}. \end{equation}\]

Then, the risk function is \(1-P(\theta\in \Theta_{C(\mathbf{y})})\).

Given a measure of credibility (\(\alpha(\mathbf{y})\)) that defines the level of trust that \(\theta\in \Theta_{C(\mathbf{y})}\); then, we can measure the accuracy of the report by \(L(\theta, \alpha(\mathbf{y}))=[\mathbb{I}\left\{\theta\in \Theta_{C(\mathbf{y})}\right\}-\alpha(\mathbf{y})]^2\). This loss function could be used to suggest a choice of the report \(\alpha(\mathbf{y})\). Given that this is a quadratic loss function, the optimal action is the posterior mean, that is \(\mathbb{E}[\mathbb{I}\left\{\theta\in \Theta_{C(\mathbf{y})}\right\}|\mathbf{y}]=P(\theta\in \Theta_{C(\mathbf{y})}|\mathbf{y})\). This probability can be calculated given the posterior distribution, that is, \(P(\theta\in \Theta_{C(\mathbf{y})}|\mathbf{y})=\int_{\Theta_{C(\mathbf{y})}}\pi(\theta|\mathbf{y})d\theta\). This is a measure of the belief that \(\theta\in \Theta_{C(\mathbf{y})}\) given the prior beliefs and sample information. The set \(\Theta_{C(\mathbf{y})}\in\Theta\) is a \(100(1-\alpha)\%\) credible set with respect to \(\pi(\theta|\mathbf{y})\) if \(P(\theta\in \Theta_{C(\mathbf{y})}|\mathbf{y})=\int_{\Theta_{C(\mathbf{y})}}\pi(\theta|\mathbf{y})=1-\alpha\).

Two alternatives to report credible sets are the symmetric credible set and the highest posterior density set (HPD). The former is based on \(\frac{\alpha}{2}\)% and \((1-\frac{\alpha}{2})\)% percentiles of the posterior distribution, and the latter is a \(100(1-\alpha)\%\) credible interval for \(\theta\) with the property that it has a smaller space than any other \(100(1-\alpha)\%\) credible interval for \(\theta\). That is, \(C(\mathbf{y})=\left\{\theta:\pi(\theta|\mathbf{y})\geq k(\alpha)\right\}\), where \(k(\alpha)\) is the largest number such that \(\int_{\theta:\pi(\theta|\mathbf{y})\geq k(\alpha)}\pi(\theta|\mathbf{y})d\theta=1-\alpha\). The HPDs can be a collection of disjoint intervals when working with multimodal posterior densities. In addition, they have the limitation of not necessary being invariant under transformations.

Decision theory can be used to perform prediction (point, sets or probabilistic). Suppose that one has a loss \(L(Y_0,a)\) involving the prediction of \(Y_0\). Then, \(L(\theta,a)=\mathbb{E}_{\theta}^{Y_0}[Y_0,a]=\int_{\mathcal{Y}_0}L(y_0,a)g(y_0|\theta)dy_0\), where \(g(y_0|\theta)\) is the density function of \(Y_0\). Predictive exercises can be based on predictive densities, that is, \(\pi(Y_0|\mathbf{y})\). Then, the predictive density can be used to obtain a point prediction given a loss function \(L(Y_0,y_0)\), where \(y_0\) is a point prediction for \(Y_0\). We can seek \(y_0\) that minimizes the mathematical expectation of the loss function.

Although BMA allows incorporating model uncertainty in a regression framework, sometimes it is desirable to select just one model. A compelling alternative is the model with the highest posterior model probability. This model is the best alternative for prediction in the case of a 0–1 loss functions (Clyde and George 2004).

References

Berger, James O. 2013. Statistical Decision Theory and Bayesian Analysis. Springer Science & Business Media.
Chernozhukov, V., and H. Hong. 2003. “An MCMC Approach to Classical Estimation.” Journal of Econometrics 115: 293–346.
Clyde, M., and E. George. 2004. “Model Uncertatinty.” Statistical Science 19 (1): 81–94.

  1. (Chernozhukov and Hong 2003) propose Laplace type estimators (LTE) based on the quasi-posterior, \(p(\mathbf{\theta})=\frac{\exp\left\{L_n(\mathbf{\theta})\right\}\pi(\mathbf{\theta})}{\int_{\mathbf{\Theta}}\exp\left\{L_n(\mathbf{\theta})\right\}\pi(\mathbf{\theta})d\theta}\) where \(L_n(\mathbf{\theta})\) is not necessarily a log-likelihood function. The LTE minimizes the quasi-posterior risk.↩︎