2.3 Estimation, hypothesis testing and prediction
All what is required to perform estimation, hypothesis testing (model selection) and prediction in the Bayesian approach is to apply the Bayes’ rule. This means coherence under a probabilistic view. But, there is no free lunch, coherence reduces flexibility. On the other hand, the Frequestist approach may be not coherent from a probabilistic point of view, but it is very flexible. This approach can be seen as a tool kit that offers inferential solutions under the umbrella of understanding probability as relative frequency. For instance, a point estimator in a Frequentist approach is found such that satisfies good sampling properties like unbiasness, efficiency, or a large sample property as consistency.
A remarkable difference is that optimal Bayesian decisions are calculated minimizing the expected value of the loss function with respect to the posterior distribution, that is, it is conditional on observed data. On the other hand, Frequentist “optimal” actions are base on the expected values over the distribution of the estimator (a function of data) conditional on the unknown parameters, that is, it considers sampling variability.
The Bayesian approach allows to obtain the posterior distribution of any unknown object such as parameters, latent variables, future or unobserved variables or models. A nice advantage is that prediction can take into account estimation error, and predictive distributions (probabilistic forecasts) can be easily recovered.
Hypothesis testing (model selection) is based on inductive logic reasoning (Inverse probability); on the basis of what we see, we evaluate what hypothesis is most tenable, and is performed using posterior odds, which in turn are based on Bayes factors that evaluate evidence in favor of a null hypothesis taking explicitly the alternative (R. E. Kass and Raftery 1995), following the rules of probability (D. V. Lindley 2000) comparing how well the hypothesis predicts data (Goodman 1999), minimizing the weighted sum of type I and type II error probabilities ((DeGroot 1975), (Pericchi and Pereira 2015)), and taking the implicit balance of losses ((Jeffreys 1961), (Bernardo and Smith 1994)) into account. Posterior odds allows to use the same framework to analyze nested and non-nested models and perform model average. However, Bayes factors cannot be based on improper or vague priors (Koop 2003), the practical interplay between model selection and posterior distributions is not as easy as it maybe in the Frequentist approach, and the computational burden can be more demanding due to solving potentially difficult integrals.
On the other hand, the Frequentist approach establishes most of its estimators as the solution of a system of equations. Observe that optimization problems reduce to solve systems. We can potentially get the distribution of these estimators, but most of the time it is needed asymptotic arguments or resampling techniques. Hypothesis testing requires pivotal quantities and/or also resampling, and prediction most of the time is based on a plug-in approach, which means not taking estimation error into account.15 In addition, ancillary statistics can be used to build prediction intervals.16 Comparing models depends on their structure, for instance, there are different Frequentist statistical approaches to compare nested and non-nested models. A nice feature in some situations is that there is a practical interplay between hypothesis testing and confidence intervals, for instance in the normal population mean hypothesis framework you cannot reject at \(\alpha\) significance level (Type I error) any null hypothesis \(H_0. \ \mu=\mu^0\) if \(\mu^0\) is in the \(1-\alpha\) confidence interval \(P(\mu\in[\hat{\mu}-|t_{N-1}^{\alpha/2}|\times\hat \sigma_{\hat{\mu}},\hat{\mu}+|t_{N-1}|^{\alpha/2}\times \hat\sigma_{\hat{\mu}}])=1-\alpha\), where \(\hat{\mu}\) and \(\hat\sigma_{\hat{\mu}}\) are the maximum likelihood estimators of the mean and standard error, and \(t_{N-1}^{\alpha/2}\) is the quantile value of the Student’s t distribution at \(\alpha/2\) probability and \(N-1\) degrees of freedom, \(N\) is the sample size.
A remarkable difference between the Bayesian and the Frequentist inferential frameworks is the interpretation of credible/confidence intervals. Observe that once we have estimates, such that for example the previous interval is \([0.2, 0.4]\) given a 95% confidence level, we cannot say that \(P(\mu\in [0.2, 0.4])=0.95\) in the Frequentist framework. In fact, this probability is 0 or 1 under this approach, as \(\mu\) can be there or not, the problem is that we will never know in applied settings. This due to that \(P(\mu\in[\hat{\mu}-|t_{N-1}^{0.025}|\hat\times \sigma_{\hat{\mu}},\hat{\mu}+|t_{N-1}^{0.025}|\times \hat\sigma_{\hat{\mu}}])=0.95\) being in the sense of repeated sampling. On the other hand, once we have the posterior distribution, we can say that \(P(\mu\in [0.2, 0.4])=0.95\) under the Bayesian framework.
Following common practice, most of researchers and practitioners do hypothesis testing based on the p-value in the Frequentist framework. But, what is a p–value? Most of the users do not know the answer due to many times statistical inference is not performed by statisticians (J. Berger 2006).17 A p–value is the probability of obtaining a statistical summary of the data equal to or “more extreme” than what was actually observed, assuming that the null hypothesis is true.
Therefore, p–value calculations involve not just the observed data, but also more “extreme” hypothetical observations. So,
“What the use of p implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred.”(Jeffreys 1961)
It seems that common Frequentist inferential practice intertwined two different logic reasoning arguments: the p–value (Fisher 1958) and significance level (Neyman and Pearson 1933). The former is an informal short–run criterion, whose philosophical foundation is reduction to absurdity, which measures the discrepancy between the data and the null hypothesis. So, the p–value is not a direct measure of the probability that the null hypothesis is false. The latter, whose philosophical foundations is deduction, is based on a long–run performance such that controls the overall number of incorrect inferences in the repeated sampling without care of individual cases. The p–value fallacy consists in interpreting the p–value as the strength of evidence against the null hypothesis, and using it simultaneously with the frequency of type I error under the null hypothesis (Goodman 1999).
The American Statistical Association has several concerns regarding the use of the p–value as a cornerstone to perform hypothesis testing in science. This concern motivates the ASA’s statement on p–values (Wasserstein and Lazar 2016), which can be summarized in the following principles:
“P–values can indicate how incompatible the data are with a specified statistical model.”
“P–values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.”
“Scientific conclusions and business or policy decisions should not be based only on whether a p–value passes a specific threshold.”
“Proper inference requires full reporting and transparency.”
“A p–value, or statistical significance, does not measure the size of an effect or the importance of a result.”
“By itself, a p–value does not provide a good measure of evidence regarding a model or hypothesis.”
To sum up, Fisher proposed the p-value as a witness rather than a judge. So, a p-value lower than the significance level means more inspection of the null hypothesis, but it is not a final conclusion about it.
Another difference between the Frequentists and the Bayesians is the way how scientific hypothesis are tested. The former use the p-value, whereas the latter use the Bayes factor. Observe that the p–value is associated with the probability of the data given the hypothesis, whereas the Bayes factor is associated with the probability of the hypothesis given the data. However, there is an approximate link between the \(t\) statistic and the Bayes factor for regression coefficients (A. Raftery 1995). In particular, \(|t|>(log(N)+6)^{1/2}\), corresponds to strong evidence in favor of rejecting the not relevance of a control in a regression. Observe that in this setting the threshold of the \(t\) statistic, and as a consequence the significant level, depends on the sample size. Observe that this setting agrees with the idea in experimental designs of selecting the sample size such that we control Type I and Type II errors. In observational studies we cannot control the sample size, but we can select the significance level.
See also (Sellke, Bayarri, and Berger 2001) and (Benjamin et al. 2018) for nice exercises to reveal potential flaws of the p–value (\(p\)) due to \(p\sim U[0,1]\) under the null hypothesis,18 and calibrations of the p-value to be interpretable as the odds ratio and the error probability. In particular, \(B(p)=-e\times p\times\log(p)\) when \(p<e^{-1}\), and interpret this as the Bayes factor of \(H_0\) to \(H_1\), where \(H_1\) denotes the unspecified alternative to \(H_0\), and \(\alpha(p)=(1+[-e\times p\times \log(p)]^{-1})^{-1}\) as the error probability \(\alpha\) in rejecting \(H_0\). Take into account that \(B(p)\) and \(\alpha(p)\) are lower bounds.
Logic of argumentation in the Frequentist approach is based on deductive logic, this means that it starts from a statement about the true state of nature (null hypothesis), and predicts what should be seen if this statement were true. On the other hand, the Bayesian approach is based on inductive logic, this means that it defines what hypothesis is more consistent with what is seen. The former inferential approach establishes that the true of the premises implies the true of the conclusion, that is why we reject or not reject hypothesis. The latter establishes that the premises supply some evidence, but not full assurance, of the true of the conclusion, that is why we get probabilistic statements.
Here, there is a difference between effects of causes (forward causal inference) and causes of effects (reverse causal inference) ((Andrew Gelman and Imbens 2013), (Dawid, Musio, and Fienberg 2016)). To illustrate this point, imagine that a firm increases the price of a specific good, then economic theory would say that its demand decreases. The premise (null hypothesis) is a price increase, and the consequence is a demand reduction. Another view would be to observe a demand reduction, and try to identify which cause is more tenable. For instance, demand reduction can be caused by any positive supply shocks or any negative demand shocks. The Frequentist logic sees the first view, and the Bayesian reasoning gives the probability associated with possible causes.