5.7 GLM Inference Tests
Okay, suppose we feel pretty good about the assumptions and conditions for our model. Let’s do some actual tests!
5.7.1 Wald tests
Now that we know what the coefficients mean, we can ask whether they matter. The parallel to the \(t\)-tests we used to do in MLR (testing whether individual coefficients were zero, and thus whether individual predictors mattered) is called a Wald test. Like all hypothesis tests, it relies on knowing the distribution of a particular test statistic if the null hypothesis is true. Here, our null hypothesis is still “this predictor isn’t useful” – that is, \(H_0:\beta_j = 0\).
What is that squiggly I thing? That’s the symbol for something called the Fisher information. This is also a fun adventure for later in life. You can just think of it as a quantity that’s related to how much information you have about \(\beta\), which is why the inverse of the Fisher information corresponds to the variance of \(\hat{\beta}\). This is similar to a principle we’ve seen before: the more information you have about the coefficient you’re trying to estimate, the more reliable (and therefore less variable) your estimates will be!
The first thing to note here is that GLMs are fit using (dun dun dunnnnnn) maximum likelihood. Yes, your old friend ML! The coefficient estimates that you see in the R output are maximum likelihood estimates of the corresponding \(\beta\)’s. Now, via some fun asymptotic theory that you can enjoy later in life, you can prove what happens to the distribution of these maximum likelihood estimates as you increase the sample size:
\[\hat{\beta}_{MLE}\sim N(\beta,\mathcal{I}^{-1}(\hat{\beta}_{MLE})) \quad\mbox{as }n\rightarrow \infty\] The immediate takeaway here is that as \(n\rightarrow\infty\), your ML estimates of the \(\beta\)’s are unbiased, with a normal distribution.
Yay! This means that we know:
- the SE of our estimates. This allows us to do confidence intervals for the \(\beta\)’s!
- the null distribution of our estimates (i.e., normal). This allows us to compare our observed values to the appropriate distribution and thus derive a p-value!
Even more yay: you don’t have to do it by hand. Check out that summary information: R is doing those tests automatically.
##
## Call:
## glm(formula = admit ~ gre + rank, family = binomial, data = admissions_dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5199 -0.8715 -0.6588 1.1775 2.1113
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.802365 0.672982 -2.678 0.007402 **
## gre 0.003224 0.001019 3.163 0.001562 **
## rank2 -0.721737 0.313033 -2.306 0.021132 *
## rank3 -1.291305 0.340775 -3.789 0.000151 ***
## rank4 -1.602054 0.414932 -3.861 0.000113 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 464.53 on 395 degrees of freedom
## AIC: 474.53
##
## Number of Fisher Scoring iterations: 4
Here we can see that everything looks significant. For example, the p-value for the Wald test on GRE is 0.0016. So at an alpha level of, let’s say, 0.01, we’re confident that, even after accounting for class rank, GRE score provides extra information that helps us predict the chance of admission.
Likewise, the p-value for rank2
is 0.02. That means we’re somewhat confident that students in the second quartile have a different chance of admission than those in the first quartile who have the same GRE score – remembering that the first quartile is the baseline category. We’re not super confident: that p-value of 0.02 comes in below an alpha of 0.05 but not below 0.01. We’re much more confident that fourth-quartile students have different admissions chances than first-quartile students: that p-value is very low.
5.7.2 Likelihood Ratio Tests
Back in linear regression, we also had \(F\) tests that we used to assess the usefulness of a whole bunch of predictors at once. When we had two nested models, where the larger (full) model used all of the predictors in the smaller (reduced) model plus some extras, we could do an \(F\) test with the null hypothesis that the two models were effectively the same – the extra terms in the larger model didn’t really help it do any better at predicting the response.
…and yes, I do mean that likelihood, the one we’ve seen before. It just keeps popping up!
There’s a parallel test to this for logistic regression, as well. It’s called a likelihood ratio test.
Like all our hypothesis tests, the LR test relies on calculating a test statistic. In this case, the test statistic is the deviance:
\[D = 2\log\left(\frac{L_{full}(b)}{L_{reduced}(b)}\right)= 2(\ell_{full}(b) - \ell_{reduced}(b)) \] Here, \(L_{full}(b)\) is the likelihood of the coefficient estimates under the larger, full model (given the observed data), and \(\ell_{full}(b)\) is the corresponding log-likelihood from the full model. Similarly, \(L_{reduced}(b)\) is the likelihood according to the smaller, reduced model, and \(\ell_{reduced}(b)\) is the log-likelihood.
Remember, the likelihood is a measure of how well the coefficients fit with the observed data. If the full model is doing a better job than the reduced model, then its coefficients should be a better fit with the data. In that case, \(L_{full} > L_{reduced}\), and the deviance is positive. But if the two models are equally good (more precisely, if the likelihood is the same according to both models), the deviance is 0.
This test statistic is compared to a chi-squared distribution with \(df = df_{full} - df_{reduced}\), for reasons we will not go into here :) But the interpretation of this test is quite familiar – it’s a lot like the old partial \(F\) test!
For example, let’s create a comparison model for admissions that just uses GRE. This is nested inside our previous model.
admissions_glm_gre = glm(admit ~ gre,
data = admissions_dat,
family = binomial)
admissions_glm_gre %>% summary()
##
## Call:
## glm(formula = admit ~ gre, family = binomial, data = admissions_dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1623 -0.9052 -0.7547 1.3486 1.9879
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.901344 0.606038 -4.787 1.69e-06 ***
## gre 0.003582 0.000986 3.633 0.00028 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 486.06 on 398 degrees of freedom
## AIC: 490.06
##
## Number of Fisher Scoring iterations: 4
This model looks okay – I mean, GRE does come back significant, so it’s clearly doing something. But does the larger model, including rank, work better?
We can perform this test with the lrtest
function, which is in the lmtest
library. It looks like this:
## Likelihood ratio test
##
## Model 1: admit ~ gre + rank
## Model 2: admit ~ gre
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 5 -232.27
## 2 2 -243.03 -3 21.524 8.192e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Quite a lot like the \(F\) test output, in fact! You can see how different the two models are: the model with rank has 3 extra terms, or degrees of freedom, because of the 3 indicator variables for rank 2, 3, and 4. And, of course, the takeaway: the p-value. This one is real low! So we reject the null hypothesis that the two models are equally effective; we’re confident that adding rank to the model really does improve its predictions.