1.3 Bayesian reports: Decision theory under uncertainty
The Bayesian framework allows reporting the full posterior distributions. However, some situations demand to report a specific value of the posterior distribution (point estimate), an informative interval (set), point or interval predictions and/or selecting a specific model. Decision theory offers an elegant framework to make a decision regarding what are the optimal posterior values to report (Berger 2013).
The point of departure is a loss function, which is a non-negative real value function whose arguments are the unknown state of nature (Θ), and a set of actions to be made (A), that is, L(θ,a):Θ×A→R+.
This function is a mathematical expression of the loss of making mistakes. In particular, selecting action a∈A when θ∈Θ is the true. In our case, the unknown state of nature can be parameters, functions of them, future or unknown realizations, models, etc.
From a Bayesian perspective, we should choose the action (a∗(y)) that minimizes the posterior expected loss, which is the posterior risk function (E[L(θ,a)|y]),
a∗(y)=argmina∈A E[L(θ,a)|y],
where E[L(θ,a)|y]=∫ΘL(θ,a)π(θ|y)dθ.12
Different loss functions imply different optimal decisions. We illustrate this assuming θ∈R.
- The quadratic loss function, L(θ,a)=[θ−a]2, gives as optimal decision the posterior mean, a∗(y)=E[θ|y], that is
E[θ|y]=argmina∈A ∫Θ[θ−a]2π(θ|y)dθ.
To get this results, let us use the first condition order, differentiate the risk function with respect to a, interchange differential and integral order, and set this equal to zero, −2∫Θ[θ−a∗]π(θ|y)dθ=0 implies that a∗∫Θπ(θ|y)dθ=a∗(y)=∫Θθπ(θ|y)dθ=E[θ|y], that is, the posterior mean is the Bayesian optimal action. This means that we should report the posterior mean as a point estimate of θ when facing the quadratic loss function.
The generalized quadratic loss function, L(θ,a)=w(θ)[θ−a]2, where w(θ)>0 is a weighting function, gives as optimal decision rule the weighted mean. We should follow same steps as the previous result to get a∗(y)=E[w(θ)×θ|y]E[w(θ)|y]. Observe that the weighted average is driven by the weighted function w(θ).
The absolute error loss function, L(θ,a)=|θ−a|, gives as optimal action the posterior median (exercise 5).
The generalized absolute error function,
L(θ,a)={K0(θ−a),θ−a≥0K1(a−θ),θ−a<0},K0,K1>0,
implies the following risk function,
E[L(θ,a)|y]=∫a−∞K1(a−θ)π(θ|y)dθ+∫∞aK0(θ−a)π(θ|y)dθ.
Differentiating with respect to a, interchanging differentials and integrals, and equating to zero,
K1∫a∗−∞π(θ|y)dθ−K0∫∞a∗π(θ|y)dθ=0,
then, ∫a∗−∞π(θ|y)dθ=K0K0+K1, that is, any K0/(K0+K1)-percentile of π(θ|y) is an optimal Bayesian estimate of θ.
We can also use decision theory under uncertainty in hypothesis testing. In particular, testing H0:θ∈Θ0 versus H1:θ∈Θ1, Θ=Θ0∪Θ1 and ∅=Θ0∩Θ1, there are two actions of interest, a0 and a1, where aj denotes no rejecting Hj, j={0,1}.
Given the 0−Kj loss function,
L(θ,aj)={0,θ∈ΘjKj,θ∈Θj,j≠i}.
where there is no loss if the right decision is made, for instance, no rejecting H0 when θ∈Θ0, and the loss is Kj when an error is made, for instance, type I error, rejecting the null hypothesis (H0) when it is true (θ∈Θ0), implies a loss equal to K1 due to picking a1, no rejecting H1.
The posterior expected loss associated with decision aj, that is, no rejecting Hj, is E[L(θ,aj)|y]=0×P(Θj|y)+KjP(Θi|y)=KjP(Θi|y), j≠i. Therefore, the Bayes optimal decision is the one that gives the smallest posterior expected loss, that is, the null hypothesis is rejected (a1 is not rejected), when K0P(Θ1|y)>K1P(Θ0|y). Given our framework (Θ=Θ0∪Θ1,∅=Θ0∩Θ1), then P(Θ0|y)=1−P(Θ1|y), and as a consequence, P(Θ1|y)>K1K1+K0, that is, the rejection region of the Bayesian test is R={y:P(Θ1|y)>K1K1+K0}.
Decision theory also helps to construct interval (region) estimates. Let ΘC(y)⊂Θ a credible set for θ, and L(θ,ΘC(y))=1−I{θ∈ΘC(y)}, where
I{θ∈ΘC(y)}={1,θ∈ΘC(y)0,θ∉ΘC(y)}.
Then,
L(θ,ΘC(y))={0,θ∈ΘC(y)1,θ∉ΘC(y)}.
where the 0–1 loss function is equal to zero if θ∈ΘC(y), and one if θ∉ΘC(y). Then, the risk function is 1−P(θ∈ΘC(y)).
Given a measure of credibility (α(y)) that defines the level of trust that θ∈ΘC(y); then, we can measure the accuracy of the report by L(θ,α(y))=[I{θ∈ΘC(y)}−α(y)]2. This loss function could be used to suggest a choice of the report α(y). Given that this is a quadratic loss function, the optimal action is the posterior mean, that is E[I{θ∈ΘC(y)}|y]=P(θ∈ΘC(y)|y). This probability can be calculated given the posterior distribution, that is, P(θ∈ΘC(y)|y)=∫ΘC(y)π(θ|y)dθ. This is a measure of the belief that θ∈ΘC(y) given the prior beliefs and sample information.
The set ΘC(y)∈Θ is a 100(1−α)% credible set with respect to π(θ|y) if P(θ∈ΘC(y)|y)=∫ΘC(y)π(θ|y)=1−α.
Two alternatives to report credible sets are the symmetric credible set and the highest posterior density set (HPD). The former is based on α2% and (1−α2)% percentiles of the posterior distribution, and the latter is a 100(1−α)% credible interval for θ with the property that it has the smallest distance compared to any other 100(1−α)% credible interval for θ based on the posterior distribution. That is, C(y)={θ:π(θ|y)≥k(α)}, where k(α) is the largest number such that ∫θ:π(θ|y)≥k(α)π(θ|y)dθ=1−α. The HPDs can be a collection of disjoint intervals when working with multimodal posterior densities. In addition, they have the limitation of not necessary being invariant under transformations.
Decision theory can be used to perform prediction (point, sets or probabilistic). Suppose that there is a loss function L(Y0,a) involving the prediction of Y0. Then, EY0[L(Y0,a)]=∫Y0L(Y0,a)π(Y0|y)dY0, where π(Y0|y) is the predictive density function. Thus, we make an optimal choice for prediction that minimizes the risk function given a specific loss function.
BMA allows incorporating model uncertainty in a regression framework, sometimes it is desirable to select just one model. A compelling alternative is the model with the highest posterior model probability. This model is the best alternative for prediction in the case of a 0–1 loss function (Clyde and George 2004).
Example: Health insurance continues
We show some optimal rules in the health insurance example. In particular, the best point estimates of λ given the quadratic, absolute and generalized absolute loss functions. For the latter, we assume that underestimating λ is twice as costly as overestimating it, that is, K0=2 and K1=1.
Taking into account that the posterior distribution of λ is G(α0+∑Ni=1yi,β0/(β0N+1)), using the hyperparameters from empirical Bayes, we have that E[λ|y]=αnβn=1.2, the median is 1.19, and the 2/3-th quantile is 1.26. Those are the optimal point estimates for the quadratic, absolute and generalized absolute loss functions.
In addition, we test the null hypothesis H0.λ∈[0,1) versus H1.λ∈[1,∞) setting K0=K1=1 we should reject the null hypothesis due to P(λ∈[0,1))=0.9>K1/(K0+K1)=0.5.
We get that the 95% symmetric credible interval is (0.91, 1.53), and the highest posterior density interval is (0.9, 1.51). Finally, the optimal point prediction under a quadratic loss function is 1.2, which is the mean value of the posterior predictive distribution, and the optimal model assuming a 0-1 loss function is the model using the hyperparameters from the empirical Bayes procedure due to the posterior model probability of this model being approximately 1, whereas the posterior model probability of the model using vague hyperparameters is approximately 0.
an <- sum(y) + a0EB # Posterior shape parameter
bn <- b0EB / (N*b0EB + 1) # Posterior scale parameter
S <- 1000000 # Number of posterior draws
Draws <- rgamma(1000000, shape = an, scale = bn) # Posterior draws
###### Point estimation ########
OptQua <- an*bn # Mean: Optimal choice quadratic loss function
OptQua
## [1] 1.200952
OptAbs <- qgamma(0.5, shape = an, scale = bn) # Median: Optimal choice absolute loss function
OptAbs
## [1] 1.194034
# Setting K0 = 2 and K1 = 1, that is, to underestimate lambda is twice as costly as to overestimate it.
K0 <- 2; K1 <- 1
OptGenAbs <- quantile(Draws, K0/(K0 + K1)) # Median: Optimal choice generalized absolute loss function
OptGenAbs
## 66.66667%
## 1.262986
###### Hypothesis test ########
# H0: lambda in [0,1) vs H1: lambda in [1, Inf]
K0 <- 1; K1 <- 1
ProbH0 <- pgamma(1, shape = an, scale = bn)
ProbH0 # Posterior probability H0
## [1] 0.09569011
## [1] 0.9043099
# we should reject H0 given ProbH1 > K1 / (K0 + K1)
###### Credible intervals ########
LimInf <- qgamma(0.025, shape = an, scale = bn) # Lower bound
LimInf
## [1] 0.9114851
## [1] 1.529724
## lower upper
## 0.8971505 1.5125911
## attr(,"credMass")
## [1] 0.95
###### Predictive optimal choices ########
p <- bn / (bn + 1) # Probability negative binomial density
OptPred <- p/(1-p)*an # Optimal point prediction given a quadratic loss function in prediction
OptPred
## [1] 1.200952
References
(Chernozhukov and Hong 2003) propose Laplace type estimators (LTE) based on the quasi-posterior, p(θ)=exp{Ln(θ)}π(θ)∫Θexp{Ln(θ)}π(θ)dθ where Ln(θ) is not necessarily a log-likelihood function. The LTE minimizes the quasi-posterior risk.↩︎