7.5 Deviance and Residuals

7.5.1 Pearson residuals

There are actually a bunch of different kinds of residuals we could think about in the GLM context; I’m just discussing two here. If you want to see more, here’s a pretty decent (though not very detailed) blog post on the topic.

We’ve been doing a lot of talking about residuals. But what are the residuals in this context? Traditionally, we’ve been interested in the discrepancy between the predicted value of a given point \(i\) (the \(E(y_i)\) from our model, which we called \(\hat{\mu}_i\) in linear regression, \(\hat{p}_i\) in logistic regression, and \(\hat{\lambda}_i\) in Poisson regression) and the actual response \(y_i\).

Sometimes we like to standardize these discrepancies: scale them down so that we can easily see which ones are “large.” It’s the same principle as calculating z-scores. One way of doing this is to scale them by \(SE(y_i)\) (i.e., the square root of the estimated variance of \(y_i\)). This gives us the Pearson residuals:

\[r_i = \frac{y_i - \widehat{\mu}_i}{SE(y_i)} = \frac{y_i - g^{-1}(\boldsymbol x_i' \boldsymbol b)}{SE(y_i)}\]

For logistic regression, in particular:

\[r_{P_i} = \frac{y_i - \hat{p}_i}{\sqrt{\hat{p}_i(1-\hat{p}_i)}}\] where \(\hat{\pi_i} = \frac{e^{\boldsymbol x_i' \boldsymbol b}}{1+e^{\boldsymbol x_i' \boldsymbol b}}\).

And for Poisson regression: \[r_{P_i} = \frac{y_i - e^{\boldsymbol x_i' \boldsymbol b}}{\sqrt{e^{\boldsymbol x_i' \boldsymbol b}}}\]

These are the sorts of residuals we’ve been looking at previously, though we haven’t always bothered to standardize them.

7.5.2 Deviance: a new approach

But there’s another way of thinking about residuals in this context, by drawing on the idea of deviance (yes, the same deviance we were examining in the likelihood ratio test). Deviance residuals can be standardized just as residuals from MLR can be.

When we encountered deviance before, it was a way of talking about the difference between a full model and a smaller, reduced model – specifically, in terms of the likelihood of each model. Now, we want to think about the difference between our current model and the saturated model.

What’s the saturated model? The biggest possible model! We’ve thought about the smallest model you could possibly use (just the intercept). What’s the biggest?

Remember that you can’t use any more predictors than that, because then your \(\boldsymbol X'\boldsymbol X\) matrix wouldn’t be full rank! That’s why we call this a “saturated” model: there’s no room for anything more.

We’ve talked about this a bit before: the biggest, most complicated model you can fit has \(n\) coefficients. That’s like having a separate predictor for every single observation. Then you could estimate every point you’ve seen, exactly correctly! So for the saturated model, \(\mu_i = y_i\).

The overall model deviance is defined thus:

\[D = 2\log\left(\frac{L_{sat}(\hat{\beta})}{L_{model}(\hat{\beta})}\right) = 2(\ell_{sat}(\hat{\beta})-\ell_{model}(\hat{\beta}))\]

Compare the formula for the deviance we used for the likelihood ratio test: that one compared the full and reduced model. Here we compare our model of interest to the most full possible model.

In R, you can find deviance information in the summary output:

admissions_glm %>% summary()
## 
## Call:
## glm(formula = admit ~ gre + rank, family = binomial, data = admissions_dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5199  -0.8715  -0.6588   1.1775   2.1113  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.802365   0.672982  -2.678 0.007402 ** 
## gre          0.003224   0.001019   3.163 0.001562 ** 
## rank2       -0.721737   0.313033  -2.306 0.021132 *  
## rank3       -1.291305   0.340775  -3.789 0.000151 ***
## rank4       -1.602054   0.414932  -3.861 0.000113 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 464.53  on 395  degrees of freedom
## AIC: 474.53
## 
## Number of Fisher Scoring iterations: 4

This is similar to the principle of the overall F-test in MLR (although R does not actually do a test/p-value here). If anything in your model is useful, then your model should be outperforming the intercept-only model: that is, it should be getting closer to the saturated model’s perfect predictions. So the residual deviance should be noticeably lower than the null deviance.

The null deviance shown here is based on the intercept-only model: how different is that intercept-only model from the saturated model? The residual deviance is based on your fitted model: how different is your model from the saturated model?

7.5.3 Deviance residuals

It turns out that the deviance can be broken up into a sum of values that come from each data point, and written as:

Remember that those likelihoods in the definition of deviance are the joint likelihood of the parameters given all the observations, and have the same form as the joint pdf of all the observations given the parameters. So, assuming your observations are independent, they can be written as the product of the individual pdfs of each observation. When you take the log, you get a sum over all the observations. Each observation’s contribution to that sum is its contribution to the overall deviance!

\[D = \sum_{i=1}^n d_i\]

…where \(d_i\) represents the deviance due to data point \(i\).

Using this fact, the deviance residual for an individual point is defined:

\[dev_i = sign(y_i - \mu_i) \sqrt{d_i}\]

Fun fact: this means that \[D = \sum_{i=1}^n dev^2_i\] Look familiar? That’s a sum of squared (deviance) residuals. Just as with the old SSR, we use it to tell us about how good the model’s fit is overall!

This is an interesting expression! Note that the deviance residuals use the sign of “actual minus predicted.” So just like the old days, if you underestimated the response, you get a positive residual, whereas if you guessed too high, you get a negative residual. But here, the magnitude of the residual is defined using that individual deviance. A point with a large residual is one that “makes the coefficients look bad” – one that’s responsible for a big chunk of deviation between our model and the perfect saturated model.

You can get R to tell you several different kinds of residuals using the residuals function and the argument type. For example:

admissions_glm %>% residuals(type = "pearson") %>% head()
##          1          2          3          4          5          6 
## -0.3928639  1.6208406  0.6781537  1.9553287 -0.4214758  1.0376569
admissions_glm %>% residuals(type = "deviance") %>% head()
##          1          2          3          4          5          6 
## -0.5357807  1.6052664  0.8698997  1.7739563 -0.5718637  1.2089628
admissions_glm %>% residuals(type = "response") %>% head()
##          1          2          3          4          5          6 
## -0.1337056  0.7242995  0.3150180  0.7926735 -0.1508454  0.5184742

Those last ones ("response") are on the response scale – in this case, in terms of \(p\), the probability of admission. That’s why they’re all less than 1. We might underestimate or overestimate someone’s probability of admission, but we can’t be off by more than 1!

There is definitely more to be said about this subject, but I have to leave you something to do in grad school :) The important thing for the purposes of this course is to be aware of deviance, and the fact that there are multiple ways to think about residuals.

Which residuals should you care about? As usual, it depends. Response residuals are handy if you want to explain a particular data point to someone who’s not statistically trained, since they’re on a more intuitive scale. Pearson residuals (and other standardized residuals) are helpful for trying to see if a point is really unusual, since they’re scaled, like z-scores. Deviance residuals make a lot of sense if you want to be consistent about the math you’re using – they are based on likelihood, and in GLMs, your model fitting is also based on maximum likelihood. And as always, you may have a client or audience that’s used to one or the other!