2.3 Estimation, hypothesis testing and prediction

All what is required to perform estimation, hypothesis testing (model selection) and prediction in the Bayesian approach is to apply the Bayes’ rule. This means coherence under a probabilistic view. But, there is no free lunch, coherence reduces flexibility. On the other hand, the frequestist approach may be not coherent from a probabilistic point of view, but it is very flexible. This approach can be seen as a tool kit that offers inferential solutions under the umbrella of understanding probability as relative frequency. For instance, a point estimator in a Frequentist approach is found such that satisfies good sampling properties such as unbiasness, efficienciency, or a large sample property such as consistency.

A remarkable diference is that optimal Bayesian decisions are calculated minimizing the expected value of the loss function with respect to the posterior distribution, that is, it is conditional on observed data. On the other hand, Frequentist “optimal” actions are base on the expected values over the distribution of the estimator (a function of data) conditional on the unknown parameters, that is, it considers sampling variability.

The Bayesian approach allows to obtain the posterior distribution of any unknown object such as parameters, latent variables, future or unobserved variables or models. A nice advantage is that prediction can take into account estimation error, and predictive distribution (probabilistic forecasts) can be easily recovered. Hypothesis testing (model selection) is based on inductive logic reasoning (Inverse probability); on the basis of what we see, we evaluate what hypothesis is most tenable, and is performed using posterior odds, which in turn are based on Bayes factors that evaluate evidence in favor of a null hypothesis taking explicitly the alternative (R. E. Kass and Raftery 1995), following the rules of probability (Lindley 2000) comparing how well the hypothesis predicts data (Goodman 1999), minimizing the weighted sum of type I and type II error probabilities ((DeGroot 1975), (Pericchi and Pereira 2015)), and taking the implicit balance of losses ((Jeffreys 1961), (Bernardo and Smith 1994)) into account. Posterior odds allows to use the same framework to analyze nested and non-nested models and perform model average. However, Bayes factors cannot be based on improper or vague priors (Koop 2003), the practical interplay between model selection and posterior distributions is not as easy as the Frequentist approach, and the computational burden can be more demanding due to solving potentially difficult integrals.

On the other hand, the Frequentist approach establishes most of its estimators as the solution of a system of equations. Observe that optimization problems reduce to solve systems. We can potentially get the distribution of these estimators, but most of the time it is needed asymptotic arguments or resampling techniques. Hypothesis testing requires pivotal quantities15 and/or also resampling, and prediction most of the time is based on a plug-in approach, which means not taking estimation error into account. In addition, ancillary statistics can be used to build prediction intervals.16 Comparing models depends on their structure, for instance, there are different Frequentist statistical approaches to compare nested and non-nested models. A nice feature in some situations is that there is a practical interplay between hypothesis testing and confidence intervals, where you cannot reject at \(\alpha\) significance level (Type I error) any null hypothesis \(H_0. \ \beta_k=\beta_k^0\) if \(\beta_k^0\) is in the \(1-\alpha\) confidence interval \(P(\beta_k\in[\hat{\beta_k}-|t_{N-K}^{\alpha/2}|\times\hat \sigma_{\hat{\beta_k}},\hat{\beta_k}+|t_{N-k}|^{\alpha/2}\times \hat\sigma_{\hat{\beta_k}}])=1-\alpha\) in a linear model, \(\hat{\beta_k}\) and \(\hat\sigma_{\hat{\beta_k}}\) are the least squares estimators of \(\beta_k\) and its standard error, and \(t_{N-K}^{\alpha/2}\) is the quantile value of the Student’s t distribution at \(\alpha/2\) probability and \(N-K\) degrees of freedom, \(N\) is the sample size, and \(K\) the number of location parameters.

A remarkable difference between the Bayesian and the Frequentist inferential frameworks is the interpretation of credible/confidence intervals. Observe that once we have estimates, such that for example the previous interval is \([0.2, 0.4]\) given a 95% confidence level, we cannot say that \(P(\beta_k\in [0.2, 0.4])=0.95\) in the Frequentist framework. In fact, this probability is 0 or 1 under this approach, as \(\beta_k\) can be there or not, the problem is that we will never know in applied settings. This due to that \(P(\beta_k\in[\hat{\beta_k}-|t_{N-K}^{0.025}|\hat\times \sigma_{\hat{\beta_k}},\hat{\beta_k}+|t_{N-K}^{0.025}|\times \hat\sigma_{\hat{\beta_k}}])=0.95\) being in the sense of repeated sampling. On the other hand, once we have the posterior distribution, we can say that \(P(\beta_k\in [0.2, 0.4])=0.95\) under the Bayesian framework.

Following common practice, most of researchers and practitioners do hypothesis testing based on the p-value in the Frequentist framework. But, what is a p–value? Most of the users do not know the answer due to many times statistical inference is not performed by statisticians (J. Berger 2006).17 A p–value is the probability of obtaining a statistical summary of the data equal to or “more extreme” than what was actually observed, assuming that the null hypothesis is true.

Therefore, p–value calculations involve not just the observed data, but also more “extreme” hypothetical observations. So,

“What the use of p implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred.”(Jeffreys 1961)

It seems that common Frequentist inferential practice intertwined two different logic reasoning arguments: the p–value (Fisher 1958) and significance level (Neyman and Pearson 1933). The former is an informal short–run criterion, whose philosophical foundation is reduction to absurdity, which measures the discrepancy between the data and the null hypothesis. So, the p–value is not a direct measure of the probability that the null hypothesis is false. The latter, whose philosophical foundations is deduction, is based on a long–run performance such that controls the overall number of incorrect inferences in the repeated sampling without care of individual cases. The p–value fallacy consists in interpreting the p–value as the strength of evidence against the null hypothesis, and using it simultaneously with the frequency of type I error under the null hypothesis (Goodman 1999).

The American Statistical Association has several concerns regarding the use of the p–value as a cornerstone to perform hypothesis testing in science. This concern motivates the ASA’s statement on p–values (Wasserstein and Lazar 2016), which can be summarized in the following principles:

  • “P–values can indicate how incompatible the data are with a specified statistical model.”

  • “P–values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.”

  • “Scientific conclusions and business or policy decisions should not be based only on whether a p–value passes a specific threshold.”

  • “Proper inference requires full reporting and transparency.”

  • “A p–value, or statistical significance, does not measure the size of an effect or the importance of a result.”

  • “By itself, a p–value does not provide a good measure of evidence regarding a model or hypothesis.”

Another difference between the Frequentists and the Bayesians is the way how scientific hypothesis are tested. The former use the p-value, whereas the latter use the Bayes factor. Observe that the p–value is associated with the probability of the data given the hypothesis, whereas the Bayes factor is associated with the probability of the hypothesis given the data. However, there is an approximate link between the \(t\) statistic and the Bayes factor for regression coefficients (A. Raftery 1995). In particular, \(|t|>(log(N)+6)^{1/2}\), corresponds to strong evidence in favor of rejecting the not relevance of a control in a regression. Observe that in this setting the threshold of the \(t\) statistic, and as a consequence the significant level, depends on the sample size. Observe that this setting agrees with the idea in experimental designs of selecting the sample size such that we control Type I and Type II errors. In observational studies we cannot control the sample size, but we can select the significance level.

See also (Sellke, Bayarri, and Berger 2001) and (Benjamin et al. 2018) for nice exercises to reveal potential flaws of the p–value (\(p\)) due to \(p\sim U[0,1]\) under the null hypothesis,18 and calibrations of the p-value to be interpretable as the odds ratio and the error probability. In particular, \(B(p)=-e\times p\times\log(p)\) when \(p<e^{-1}\), and interpret this as the Bayes factor of \(H_0\) to \(H_1\), where \(H_1\) denotes the unspecified alternative to \(H_0\), and \(\alpha(p)=(1+[-e\times p\times \log(p)]^{-1})^{-1}\) as the error probability \(\alpha\) in rejecting \(H_0\). Take into account that \(B(p)\) and \(\alpha(p)\) are lower bounds.

References

Benjamin, Daniel J, James O Berger, Magnus Johannesson, Brian A Nosek, E-J Wagenmakers, Richard Berk, Kenneth A Bollen, et al. 2018. “Redefine Statistical Significance.” Nature Human Behaviour 2 (1): 6–10.
———. 2006. “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–402.
Bernardo, J., and A. Smith. 1994. Bayesian Theory. Chichester: Wiley.
DeGroot, M. H. 1975. Probability and Statistics. London: Addison-Wesley Publishing Co.
Fisher, R. 1958. Statistical Methods for Research Workers. 13th ed. New York: Hafner.
Goodman, S. N. 1999. “Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy.” Annals of Internal Medicine 130 (12): 995–1004.
———. 1961. Theory of Probability. London: Oxford University Press.
Kass, R E, and A E Raftery. 1995. Bayes factors.” Journal of the American Statistical Association 90 (430): 773–95.
Koop, Gary M. 2003. Bayesian Econometrics. John Wiley & Sons Inc.
Lindley, D. V. 2000. “The Philosophy of Statistics.” The Statistician 49 (3): 293–337.
Neyman, J., and E. Pearson. 1933. “On the Problem of the Most Efficient Tests of Statistical Hypotheses.” Philosophical Transactions of the Royal Society, Series A 231: 289–337.
Pericchi, Luis, and Carlos Pereira. 2015. Adaptative significance levels using optimal decision rules: Balancing by weighting the error probabilities.” Brazilian Journal of Probability and Statistics.
Raftery, A. 1995. “Bayesian Model Selection in Social Research.” Sociological Methodology 25: 111–63.
Sellke, Thomas, MJ Bayarri, and James O Berger. 2001. “Calibration of p Values for Testing Precise Null Hypotheses.” The American Statistician 55 (1): 62–71.
Wasserstein, Ronald L., and Nicole A. Lazar. 2016. “The ASA’s Statement on p–Values: Context, Process and Purpose.” The American Statistician.

  1. A pivot quantity is a function of unobserved parameters and observations whose probability distribution does not depend on the unknown parameters.↩︎

  2. An ancillary statistic is a pivotal quantity that is also a statistic.↩︎

  3. https://fivethirtyeight.com/features/not-even-scientists-can-easily-explain-p-values/↩︎

  4. See https://joyeuserrance.wordpress.com/2011/04/22/proof-that-p-values-under-the-null-are-uniformly-distributed/ for a simple proof.↩︎