Chapter 8 Bayesian Testing

8.1 Motivation

The past chapter was pretty intense in terms of exploring different methods of Hypothesis Testing. As we did in the first half of the book, we’re now going to shift to a Bayesian perspective, and apply this perspective to hypothesis testing. Thankfully, we’re already familiar with what it means to ‘think Bayesian’, so this section should be relatively light.

8.2 Bayesian Testing

Let’s think back to the Bayesian/Frequentist approaches we used earlier in the book Recall that the likelihood function underpinned a lot of the frequentist approach: that is, \(f(X|\theta)\), or the likelihood of the data we observed given specific parameters. The Bayesian approach flipped this on it’s head: we found \(f(\theta|X)\), or the distribution of the parameter given the data, by employing Bayes rule.

Now let’s think back to Hypothesis testing. Remember that a p-value gives the probability of observing the data we observed (or more extreme data) given the null hypothesis was true. We could write this as \(P(data | H_0 \; true)\). We could do the same thing now with a Bayesian approach and try to flip this, to find \(P(H_0 \; true | data)\). How do we write this? Well, using Bayes’ rule!

\[P(H_0 \; true | data) = \frac{P(data|H_0 \; true)P(H_0 \; true)}{P(data)}\]

One tricky thing here is finding \(P(H_0 \; true)\), but that’s just our prior! That is, this is what we assign a prior probability to, like we’ve been doing thus far with our Bayes work.

This is especially useful when we’d like to compare two different hypotheses. Let’s think about \(P(H_0 \; true | data)\) vs. \(P(H_A \; true | data)\). If we take the ratio, we get:

\[\frac{P(H_0 \; true | data)}{P(H_A \; true | data)} = \frac{\frac{P(data|H_0 \; true)P(H_0 \; true)}{P(data)}}{\frac{P(data|H_A \; true)P(H_A \; true)}{P(data)}} = \frac{P(data | H_0 \; true) P(H_0 \; true)}{P(data | H_A \; true) P(H_A \; true)}\]

This is pretty useful, as we will see in a second. Let’s think through this intuitively. There are two forces that increase the probability of a model being true: the prior probability that it is true, or \(P(H_0 \; true)\), and the probability of the observed data conditional on that model, or \(P(data | H_0 \; true)\). If either of these are high, then they push up the overall probability of the model being true given the data.

This is all pretty general; don’t worry, it’s more about understanding the approach at this point! We’ll get into more specific testing methods coming up.

8.3 Bayes’ Factor

We just went through the basic logic of Bayesian Hypothesis testing. Let’s talk about a specific way to test. Consider if we want to compare two models: \(M_A\) and \(M_B\). We saw above that we can use a Bayesian approach to say \(P(M_A \; true | data) = \frac{P(data|M_A \; true)P(M_A \; true)}{P(data)}\), and taking the ratio yields:

\[\frac{P(data | M_A \; true) P(M_A \; true)}{P(data | M_B \; true) P(M_B \; true)}\]

The \(\frac{P(M_A \; true)}{P(M_B \; true)}\) part of this ratio is just the ratio of the prior probabilities. The other part of this ratio:

\[\frac{P(data|M_A \; true)}{P(data|M_B \; true)}\]

Is called Bayes’ Factor. Intuitively, what is Bayes’ Factor giving? It’s just the ratio of how likely the data we observed is for each model. What if \(M_A\) was much better than \(M_B\)? Then we would expect for the probability of the observed data to be higher, which means the numerator would be a lot larger than the denominator, which means Bayes Factor would be large.

In general, then, a large Bayes factor is evidence supporting the model in the numerator, and a small Bayes factor is evidence supporting the model in the denominator. Generally, if one of the probabilities is more than 10 times as large as the other, it’s very strong evidence for that model. Between 3 times and 10 times as large a probability is generally considered ‘substantial’ evidence.

Bayes Factor can be tricky to calculate, so let’s think about if we’re finding it for a two sided hypothesis test. That is, consider:

\[H_0: \theta = \theta_0, \; H_A: \theta \neq \theta_0\]

The Bayes factor relating the hypotheses is:

\[\frac{P(data|H_A \; true)}{P(data|H_0 \; true)}\]

Let’s think about the denominator. This should be relatively easy to find, since it’s just the probability of the data given a specific value for the parameter, or \(P(data | \theta_0)\). The numerator is a little more complicated, since \(H_A\) by design includes many values for the parameter \(\theta\). For this, then we have to integrate over all possible values of \(\theta\), weighting by the probability of each \(\theta\) along the way. That is:

\[P(data | H_A \; true, \; \theta \neq \theta_0) = \int_{\theta} f(data | \theta) f(\theta)\]

Where \(f(\theta)\) is just the prior, which we’re used to by now. So, we could re-write our Bayes factor from above:

\[\frac{\int_{\theta} f(data | \theta) f(\theta)}{f(data|\theta_0)}\]

Again, we judge which model is more likely be the magnitude of Bayes Factor.

8.3.1 BIC

Short for Bayesian Information Criterion, this allows us to compare two models. Let’s just jump right in; here’s the equation:

\[BIC = -2\big[ l(\hat{\theta}_{MLE})\big] + k \big[log(n) + log(2\pi)\big]\]

Where \(l(\hat{\theta}_{MLE})\) is the value of the log likelihood evaluated at the MLEs, \(k\) is the number of free parameters (i.e., a Normal distribution has 2 parameters), \(n\) is the sample size.

If you calculate the BIC for two different models, you ideally want the model with the lowest BIC. So, if \(H_0\) had a BIC of -50 and \(H_A\) had a BIC of -80, you might prefer \(H_0\).