Chapter 8 Likelihood Based Testing
8.1 Motivation
We’ve learned about the mechanics of Hypothesis Testing; now it’s time to apply this knowledge through a couple of different channels. We’ll focus on three major approaches to testing, which should prove useful in actual inference applications.
8.2 Likelihood Ratio Testing
It’s probably best to jump right into this. Remember the logic we established in the last chapter: a parameter \(\theta\) has ‘support’ \(\Omega\), which is partitioned (split in two) by \(\Omega_0\) and \(\Omega_1\). The null hypothesis, \(H_0\), is that \(\theta\) is in the ‘null space’, or \(\Omega_0\), and the alternative hypothesis \(H_A\) is that \(\theta\) is in \(\Omega_1\).
Also recall that we need a test statistic, which is just some function of the data, and we reject the null if this test statistic falls in a certain rejection region (generally, the test statistic is the sample mean or proportion, and the rejection is some distance that’s far enough away from the hypothesized value to make the null unlikely to be true). We often choose this rejection region by setting our Type I or Type II error to be a specific value.
In the case of the Likelihood Ratio Test, the test statistic is a little funky. It’s given by \(\bigwedge(x)\), where \(x\) is the data. For a test \(H_0: \theta \in \Omega_0\) vs. \(H_a: \theta \in \Omega_1\), the test statistic \(\bigwedge(x)\) is:
\[\bigwedge(x) = \frac{max_{\theta \in \Omega_0} \big[f(x|\theta)\big]}{max_{\theta \in \Omega} \big[f(x|\theta)\big]}\]
Yikes. Let’s break this down.
We can already see why this is called a ‘Ratio’ test, since we have the ratio of two very similar terms! Both things are taking the max of our likelihood \(f(x|\theta)\), with respect to the parameter \(\theta\). This is something we’ve been doing for a while now (the MLE is just the value for \(\theta\) that maximizes the log likelihood).
In fact, you’ll notice the only difference between the top and bottom is that the top has a \(\Omega_0\) and the bottom is just \(\Omega\). What’s going on here? Well, for the denominator, we’re maximizing the log likelihood for any \(\theta\) (remember that, by definition, \(\theta\) is in \(\Omega\), or \(\theta \in \Omega\)). The numerator, then, maximizes the likelihood but only for values of \(\theta\) in the subset \(\Omega_0\).
This is actually kind of an intuitive approach. First, let’s consider if the null hypothesis is true, and \(\theta\) is in \(\Omega_0\). In this case, the numerator and denominator are going to be the same! To find the denominator, we’ll find the maximum of the log likelihood for any value of \(\theta\), which we just said is in \(\Omega_0\). We’ll find the same value for the numerator, since we’re maximizing within \(\Omega_0\), which we just said is the overall max! So, in this case, \(\bigwedge(x) = 1\), since maxed likelihoods will be equal.
What about when the null hypothesis is false, and \(\theta\) is not in \(\Omega_0\)? In fact, what if the true \(\theta\) is very far away from \(\Omega_0\)? In this case, we would expect the denominator to be much bigger than the numerator, since the most likely value of \(\theta\) within \(\Omega_0\) is much less than global maximum likelihood. That is, it’s much less likely that the \(\theta\) from \(\Omega_0\) is the true \(\theta\), since the likelihood is smaller.
Hopefully this intuition makes sense. Anyways, this \(\bigwedge(x)\) is thus bounded between 0 and 1, and the smaller it is, the more evidence against the null hypothesis (since if \(\bigwedge(x)\) is small, it means the max \(\theta\) inside \(\Omega_0\) is much less likely than the overall max). Therefore, the rejection region for \(\bigwedge(x)\) is generally less than or equal to some value (you might say .5 or less is the rejection region: that means that the global max was twice as likely as the \(\Omega_0\) max. It really depends, though, on the test and what type of error you are trying to minimize).
Summing up LRT (likelihood ratio test) is pretty straightforward: take the ratio of the maximum likelihoods for the null space and the overall space, and reject \(H_0\) if this is too small (‘too small’ varies for different problems).
Now that we know how to test using LR, we should also learn how to build the Confidence Interval for the parameter. Recall up to this point we’ve built confidence intervals by doing plus or minus standard errors (Frequentist), or pulling off quantiles from posterior distributions of the parameters (Bayesian). The LRT requires a slightly different and more computational approach.
The idea is actually pretty intuitive. We’re going to build an interval for \(\theta\) such that every value in the interval does not reject the LRT that we’re running. That is, if \(\theta = 100\) fails the LRT (we get a ratio that is far too small), then 100 wouldn’t be in our interval! Maybe only the values 3.5 to 6.7 are the values that do not reject the LRT; these would be the values for our interval!
Although it seems a little odd, this is a pretty valid approach: if a value for \(\theta\) fails the LRT, it means that value isn’t very likely, and thus it probably shouldn’t be in our confidence interval, since confidence intervals are intervals of likely values for our parameters! Clearly, this whole setup is less than ideal, because it means we have to run the LRT calculation and see if we reject for a whole range of values, and then pick the values that we didn’t reject for to make our interval. That’s why we have computers! Indeed, the LRT-based confidence interval is a very natural use for R.
There are a couple of other important concepts related to Likelihood Testing:
8.2.1 Monotone Likelihood Ratio
Let data \(X_1, X_2,...X_n\) have the joint PDF \(f(x|\theta)\) where \(\theta\) is a parameter, and let \(T\) be a test statistic that is some function of the data. If for every possible combo of \(\theta_1\) and \(\theta_2\) such that \(\theta_2 > \theta_1\), the ratio \(\frac{f(x|theta_2)}{f(x|theta_1)}\) depends on \(x\) through the test statistic \(T\) and the ratio is a monotone function of \(T\) for all possible values of \(T\), then the joint distribution of \(X\) has a monotone likelihood ratio.
Phew. What does that look like in practice? Consider an i.i.d. Binomial example for a r.v. \(X\) with known \(n\) where we are trying to estimate the parameter \(p\). Consider two values of \(p\) such that \(p_2 > p_1\). Let’s say our test statistic in this case will be the sum of all of the successes we see, so \(T = \sum_{i=1}^n x_i\). This is a natural test statistic because the total number of successes helps to determine the true proportion of successes, especially since we know the total number of trials! It’s straightforward to show that
\[\frac{f(x|\theta_2)}{f(x|\theta_1)} = \frac{\big(\prod_{i=1}^n{n \choose x}\big) p_2^{\sum_{i=1}^nx} (1-p_2)^{n-{\sum_{i=1}^nx}}}{\big(\prod_{i=1}^n{n \choose x}\big) p_1^{\sum_{i=1}^nx} (1-p_1)^{n-{\sum_{i=1}^nx}}} = \frac{ p_2^{\sum_{i=1}^nx} (1-p_2)^{n-{\sum_{i=1}^nx}}}{ p_1^{\sum_{i=1}^nx} (1-p_1)^{n-{\sum_{i=1}^nx}}}\]
We already defined \(T = \sum_{i=1}^nx\), so this is straightforward to plug in:
\[\frac{ p_2^T (1-p_2)^{n -T}}{ p_1^T (1-p_1)^{n -T}}\]
We can check the first box here: the data vector \(x\) only shows up through the statistic \(T\). How do we know that? Well, look at it! There’s no \(x\); we were able to re-write everything simply in terms of \(T\), which recall is the total of the \(x\) values.
How about the second requirement, that the ratio is monotone for all values of \(T\)? Well, let’s do some re-arranging:
\[\frac{ p_2^T (1-p_2)^{n -T}}{ p_1^T (1-p_1)^{n -T}} = \Big(\frac{ p_2^T (1-p_1)^T}{ p_1^T (1-p_2)^T}\Big) \Big(\frac{(1-p_2)^n}{(1-p_1)^n}\Big)\]
We only have to look at the first term, because it’s the part that deals with \(T\). We know that \(p_2 > p_1\) and \(1 - p_1 > 1 - p_2\), so the numerator will be larger than the denominator. This means that this function is always increasing with \(T\), so it is monotone!
Phew. We just showed that we have a monotone likelihood ratio, first by checking if the ratio only involves the data through \(T\) and then checking if the function is monotone w.r.t. \(T\). Why does it matter? Well, enter the next part…
8.2.2 Uniformly Most Powerful Tests -
This means that for a given level \(\alpha\), a test has greater or equal power to any other test for any value of a parameter. Crazy! Recall that we generally want high power (want to reject false nulls), and a uniformly most powerful test (UMP for short) automatically gives us the most possible power.
What’s neat, then, is that if we have a monotone likelihood ratio (which we just spent some time showing) then the test is uniformly most powerful (if it is 1-sided)! There, then, is the value of showing a monotone likelihood ratio: we get a UMP test! Remember, this only holds for one-sided test.
8.2.3 Sufficient Statistics
Closely related to UMP. Basically, a statistic \(T\) is sufficient if you can factor the PDF as a function of the data times a function of the parameter and the statistic \(T\). Intuitively, this means that all we need to estimate \(\theta\) is \(T\), and there’s nothing else from the sample that’s important.
8.2.4 Log Likelihood Ratio Tests
We’ve been through a lot about the Likelihood Ratio Tests, and you’ve probably been seeing a bunch of likelihood and have been itching to take the log of them. Have no fear! This is where the log LRT comes into play.
Let’s say we have a two-sided alternative hypothesis, so something like \(H_A \theta \neq \theta_0\). Also assume that we have a large \(n\) (sample size). Recall that the LRT statistic is given by:
\[\bigwedge(x) = \frac{max_{\theta \in \Omega_0} \big[f(x|\theta)\big]}{max_{\theta \in \Omega} \big[f(x|\theta)\big]}\]
Well, interestingly enough:
\[-2\big[log\big(\bigwedge(x)\big)\big] \rightarrow^D \chi^2\]
This is saying that negative two times the log of our LRT statistic converges in distribution to a chi-squared distribution, with degrees of freedom equal to the number of parameters (so usually 1)!
That’s pretty funky. First, what does the log of the LRT statistic actually look like? Well, using our knowledge of log rules, we get:
\[log(\bigwedge(x)) = log(max_{\theta \in \Omega_0} \big[f(x|\theta)\big]) - log(max_{\theta \in \Omega} \big[f(x|\theta)\big])\]
That’s nifty, because it just means we have a subtraction of the log likelihoods! And, apparently, when we multiply this by negative 2 (we won’t cover the proof here) we get something that should be distributed chi-squared.
This allows us to perform testing very easily, since we can find the log of the LRT statistic and then compare it to the \(95^{th}\) percentile (or whatever level) of the appropriate chi-squared distribution to see if we reject or not. For example, we know that the \(95^{th}\) percentile for a \(\chi^2\) with \(df = 1\) is around 3.84. If we get a statistic greater than this, we have evidence to reject the null!
In summary, the log likelihood ratio test is a simple way to find a test statistic and then compare it to a well known distribution. We can find confidence intervals the same way we found them for regular old likelihood ratio tests: by selecting the values for \(\theta\) that don’t reject the null and thus are more likely.
8.3 Score Test
The LRT is definitely a lot. Thankfully, the next two forms of testing, Score and Wald, are much tamer.
Let’s start with the score function. This is actually something we’ve worked with countless times up until this point: the Score Function is just the derivative of the log likelihood! We denote this as \(U(\theta)\).
So, if we want to test \(H_0: \theta = \theta_0\) vs. \(H_A: \theta \neq \theta_0\), then your test statistic would be:
\[\frac{[U(\theta_0)]^2}{I_n(\theta_0)}\]
Or the score function (derivative of the log-likelihood) with \(\theta_0\) plugged in, squared, divided by Fisher’s information (from earlier in the book) with \(\theta_0\) plugged in.
This is our test statistic, and, interestingly enough, when \(H_0\) is true, this converges to a \(\chi^2\) with \(df = 1\). This means that we can compare our test statistic to a \(\chi^2_{df=1}\) distribution; that is, if 3.84 is the \(95^{th}\) percentile for a \(chi^2\) with \(df = 1\), and we get a score test statistic over 3.84, we would reject if we were working with 95\(\%\) confidence.
Really quickly, let’s think about the intuition for this test. Imagine if the null hypothesis is true, and we have that \(\theta = \theta_0\). Remember that the score function is the derivative of the log likelihood, so it’s essentially the slope of the log likelihood function. When we’re at the MLE, the slope is 0, since the MLE maximizes the function (flat hill on top). Well, then, we know the slope at the true parameter \(\theta\) will also be pretty small, since it should be relatively close to the MLE and thus close to the peak (in fact, although we won’t prove it here, the score function with \(\theta\) has expectation 0, so it will be at the top on average!). Therefore, the farther away we get from this peak, the steeper and steeper the slope and thus the higher \(U(\theta_0)\) is. That’s why a larger value of the Score test statistic means the data is less likely; because it’s farther away from maximizing the log likelihood! We divide by Fisher’s information to give a scale to what we’re doing.
Lastly, let’s talk about Score Confidence intervals. Like the LRT, there’s no nice closed-form way to solve for these intervals. Instead, you have to just manually find values that do not reject the score function, and build your interval that way.
8.4 Wald Test
Like the Score Test, the Wald is relatively straightforward. It allows us to test \(H_0: \theta = \theta_0\) vs. \(H_A: \theta \neq \theta_0\). The test statistic is given by:
\[I_n(\hat{\theta}_{MLE}) [ \hat{\theta}_{MLE} - \theta_0]^2\]
What do we measure this test statistic against? Well, fortunately, like the Score this converges to a \(\chi^2\) with \(df = 1\).
Let’s think about why a high value for the Wald test is evidence against the null. If the hypothesized value for \(\theta\) is far from the MLE, then the Wald test statistic will be large. Well, we know that MLEs are (asymptotically) unbiased, so if the hypothesized value for \(\theta\) is far from the MLE, it’s probably far from the true parameter as well! That’s why a large test statistic is evidence against \(\theta_0\).
Notice something else that’s a little strange about this test: we’re using \(\hat{\theta}_{MLE}\) when we plug into Fisher for both the test statistic and the confidence interval. This can give us strange results.
Finally (and fortunately), we also have a closed form equation for the Confidence Interval, which means we don’t have to do that weird manual calculation thing we had to do for Score and LRT. Here it is:
\[\hat{\theta}_{MLE} \pm z\sqrt{\frac{1}{I_n(\hat{\theta}_{MLE})}}\]
Which is pretty much what we’ve been using all along!
Now that we’ve gone over all of these different tests, let’s close by reviewing each at a high level:
1. Likelihood Ratio Test: Take the ratio of the maximum likelihood of the null space to the maximum likelihood overall. If you take the log of this, then you can compare the test statistic to a \(\chi^2_{df = 1}\). Usually this is the best approach, but you also have to calculate two likelihoods and it’s difficult to find the CI.
2. Score Test: We take the derivative of the log likelihood (the score function), plug in the hypothesized value for the parameter, square it and divide by Fisher’s information. We can compare this test statistic to a \(\chi^2_{df = 1}\). It’s a combination of being more robust than the Wald test and easier to calculate than the LRT, but it’s usually the least powerful approach.
3. Wald Test: We take the distance between the MLE and the hypothesized value, multiply by Fisher’s and compare this test statistic to the \(\chi^2_{df=1}\). It’s easy to calculate and gives a closed form equation for the Confidence Interval, which is nifty. However, we find the standard error under the MLE, which can give wacky results!
8.5 Practice
1a. Let \(X_1,...,X_n \sim Bern(p)\). Assume i.i.d. Is \(T = \sum_{i=1}^n X_i\) sufficient?
Solution: Let’s see if we can factor the PDF solely in terms of \(T\). We can find the likelihood:
\[f(X_1,...,X_n | p) = \prod_{i=1}^n p^x_i (1 - p)^{1 - x_i}\]
\[p_i^{\sum_{i=1}^n X_i}(1-p)^{n - \sum_{i=1}^n X_i}\]
We can plug in \(T\) here:
\[p_i^{T}(1-p)^{n - T}\]
Since the data only shows up through \(T\) (there’s no other \(x_i\) here) \(T\) is sufficient.
- Find a sufficient statistic for \(X_1,...,X_n \sim Geom(p)\). Assume i.i.d.
Solution: Let’s start by finding the likelihood:
\[f(x_1,...,x_n|p) = \prod_{i=1}^n (1 - p)^{x_i} p\]
\[ = (1 - p)^{\sum_{i=1}^n x_i} p\]
Clearly, \(T = \sum_{i=1}^n x_i\) works for a sufficient statistic here, since if we plug in \(T\) we get:
\[ = (1 - p)^{T} p\]
And \(T\) is the only thing that involves data here.
- Let \(X_1,...,X_n\) be i.i.d. \(Unif(0,b)\), where \(b\) is unknown. Find a sufficient statistic.
Solution: Let’s write down the likelihood:
\[\prod_{i=1}^n \frac{1}{b - 0}\]
This seems like this would be it…however, we require a bit more. If we see any \(x_i\) greater than \(b\), for example, then we know that \(b\) is not the correct upper bound. So, we need an ‘indicator random variable’ that all values are less than \(b\). This gives:
\[f(x_1,...,x_n) = \prod_{i=1}^n \frac{1}{b} I_i\]
Where \(I_i\) is the indicator for \(x_i \leq b\) for \(i \in (1,n)\) (that is, \(I_i = 1\) if \(x_i\) is less than \(b\) and \(I_i = 0\) otherwise). This comes out to:
\[b^{-n} I_{max\{x_1,...,x_n\} \leq b}\]
Where \(I_{max\{x_1,...,x_n\} \leq b}\) is the indicator that \(max\{x_1,...,x_n\} \leq b\), which is the same as saying that all of the \(x_i\) values are \(\leq b\) (or all of the \(I_i\) terms are \(1\)). Anyways, since we’ve got…
\[b^{-n} I_{max\{x_1,...,x_n\} \leq b}\]
…this means that \(T = max\{x_1,...,x_n\}\) is a sufficient statistic since it encapsulates all values of \(x_i\).
Consider a sample \(X_{1},\ldots,X_{n}\) that are i.i.d. \(Pois(\lambda)\) random variables. We conduct the following test (based on this sample) with level \(\alpha = 0.05\),
\[H_{0}:\lambda=\lambda_0,H_{A}:\lambda \neq \lambda_0\]
2a. Find the score function.
Solution: This is just the derivative of the log likelihood. We know the likelihood is:
\[\prod_{i=1}^n \frac{e^{-\lambda} \lambda^{x_i}}{x_i!} = \frac{e^{-n\lambda} \lambda^{\sum_{i=1}^nx_i}}{\prod_{i=1}^nx_i!}\]
Taking the log gives (we can ignore the factorial term in the denominator because when we derive it will just disappear):
\[-n\lambda + \sum_{i=1}^n x_i log(\lambda)\]
And deriving gives the score function:
\[U(\lambda) = -n + \frac{\sum_{i=1}^n x_i}{\lambda}\]
- Find Fisher’s Information
Solution: We’ve already found one derivative of the log likelihood, so we just need to take another derivative:
\[U^{\prime} = -\frac{\sum_{i=1}^n x_i}{\lambda^2}\]
And then we can take the negative expectation:
\[I(\lambda) = -E(-\frac{\sum_{i=1}^n x_i}{\lambda^2}) = \frac{n\lambda}{\lambda^2}\]
Since the expected value of each \(x_i\) is \(\lambda\). So, Fisher’s information is:
\[\frac{n}{\lambda}\]
- Find \(\hat{\lambda}_{MLE}\)
Solution: We already have the score function, so we set it equal to zero and solve:
\[0 = -n + \frac{\sum_{i=1}^n x_i}{\lambda}\]
\[\hat{\lambda_{MLE}} = \frac{\sum_{i=1}^n x_i}{n} = \bar{X}\]
We can check in R if our MLE gets us close to the true parameter:
## [1] 7.13
- Calculate the LRT statistic
Solution: We simply take the ratio of the likelihood; the top consists of the null space, or \(\lambda_0\), plugged in for \(\lambda\), in the denominator we plug in the value for \(\lambda\) that maximizes the likelihood, or \(\hat{\lambda}_{MLE}\).
\[\Lambda(X) = \frac{e^{-n\lambda_0} \lambda_0^{\sum_i X_i}}{e^{-n\bar{X}} \bar{X}^{\sum_i X_i}} = e^{-n\left(\lambda_0-\bar{X}\right)}\left(\frac{\lambda_0}{\bar{X}}\right)^{\sum_i X_i}\]
- Find the log of the LRT, then describe how you would find an LRT-based CI
Solution: We know that we just do -2 times the log of what we got above.
\[-2\log\Lambda(X) = -2\left[-n(\lambda_0 - \bar{X}) + \sum_i X_i (\log\lambda_0 - \log\bar{X})\right] = 2n(\lambda - \bar{X} + \bar{X}\log(\lambda/\bar{X}))\]
The CI is a little annoying. We know, since we’re testing at the level \(\alpha = .05\) level, that we will reject if we see a value of 3.84 or more (since this is the critical value for a \(\chi^2_{df=1}\)). So, we would iterate through possible values of \(\lambda\) and check if the output of \(-2log\Lambda(X)\) was greater than 3.84. If the output is less, we include that value of \(\lambda\) in the CI, since it does not reject (it’s a plausible value for the parameter).
- Find the Score test statistic, and describe how to find the Score-based CI
Solution: This is just the score function with \(\lambda_0\) plugged in, squared, and then divided by Fisher’s information with \(\lambda_0\) plugged in.
\[S(\lambda) = \frac{\left(-n + \sum_i X_i/\lambda_0 \right)^2}{n/\lambda_0} \sim \chi_1^2\]
We have to do the same inversion of the CI as described in the previous part. Again, we’re comparing to 3.84, since this statistic converges to a \(\chi^2_{df=1}\).
- Calculate the Wald test statistic, and then the Wald-based CI
Solution: We know this is just Fisher’s information with the MLE plugged in times the squared difference between the MLE and the null value. We have all of this:
\[W(\lambda) = \frac{n}{\bar{X}}(\bar{X}-\lambda)^2\]
And we can similarly plug in for the CI:
\[\bar{X} \pm 1.96\sqrt{\frac{\bar{X}}{n}}\]