6.13 Outliers

Outliers are observations with predictions that are very far from the observed values. In linear regression outliers signaled a possible problem with the normality and/or constant variance assumptions, particularly in small samples. Logistic regression does not have these assumptions, however it is still useful to examine outliers to find observations that are not predicted well by the model.

In a logistic regression, the observed values are always each 0 or 1 (even if the response is coded as a factor, numerically glm() uses 0s and 1s) and the predicted probabilities are always between 0 and 1, so the differences between these are always between -1 and 1. The residuals in a logistic regression, called deviance residuals, are more complex, however, than simple differences between observed outcomes and predicted probabilities. For a glm object, the usual residual functions resid(), rstandard(), and rstudent() compute deviance residuals and their standardized and Studentized counterparts, respectively.

Examine outliers by highlighting points on a residual plot and examining their predictor and outcome values. Again, here we are simply examining observations that are fit poorly by the model. After identifying them, we do not remove them from the model, although you could do a sensitivity analysis to evaluate their impact on your conclusions.

Example 6.3 (continued): Look for outliers in the model that includes an interaction. The cutoff you choose here is arbitrary. In this example, a cutoff of 2.5 highlighted a few observations and you can see from Figure 6.6 that one of them especially seems to stand out on its own as being highly unusual.

RSTUDENT <- rstudent(fit.ex6.3.int)
SUB <- abs(RSTUDENT) > 2.5
sum(SUB, na.rm = T)
## [1] 3
car::residualPlots(fit.ex6.3.int, terms = ~ 1,
                   tests = F, quadratic = F, fitted = T,
                   type = "rstudent", pch = 20, col = "gray")
points(logit(fitted(fit.ex6.3.int))[SUB],
       RSTUDENT[SUB],
       pch=20, cex=1.5)
abline(h = c(-2.5, 2.5), lty = 2)
Identifying outliers in a logistic regression

Figure 6.6: Identifying outliers in a logistic regression

Outliers in a logistic regression are observations with either (a) \(Y = 1\) but predictors at levels at which the predicted probability is very low (resulting in a large positive residual) or (b) \(Y = 0\) but predictors at levels at which the predicted probability is very high (resulting in a large negative residual). If we examine the observation with the large positive residual more closely we see that it falls into the former category. This individual used marijuana (\(Y = 1\)) but is in the age and income groups with the lowest odds of lifetime marijuana use and this discrepancy results in a large positive residual.

# Which row has the large positive residual?
RSTUDENT[SUB]
##   3043  48455  40288 
##  2.951 -2.653 -2.509
# Examine that row
nsduh[c("3043"),
      c("mj_lifetime",
        "alc_agefirst",
        "demog_age_cat6",
        "demog_sex",
        "demog_income")]
##      mj_lifetime alc_agefirst demog_age_cat6 demog_sex      demog_income
## 3043         Yes           29            65+    Female $20,000 - $49,999

Just to see how extreme the residuals can be for this example, let’s add two individuals to the dataset with the most extreme residuals possible for this model. The first is someone who has used marijuana but whose predictor values are at the levels for which marijuana use is least prevalent: first used alcohol at age 45 years (the maximum value in the dataset), age 65+, male, and income of $20,000 - $49,999. The second is someone who has never used marijuana but whose predictor values are at the levels for which marijuana use is most prevalent: first used alcohol at age 3 years (the minimum value in the dataset), age 18-25, female, and income less than $20,000.

The data for these two hypothetical individuals, as well as their predicted probabilities in the original model, are shown below. The first has a very low predicted probability of lifetime marijuana use, contrary to their reported use, and the second a very high predicted probability, contrary to their reported lack of use.

##       alc_agefirst demog_age_cat6 demog_sex      demog_income mj_lifetime   pred
## 37290           45            65+      Male $20,000 - $49,999         Yes 0.0017
## 25435            3          18-25    Female Less than $20,000          No 0.9983

Figure 6.7 shows the residual plot after adding these two individuals to the dataset and refitting the model. The dark point in the upper left corresponds to our first anomalous individual (\(Y = 1\) but low predicted probability) and the dark point in the lower right to the other (\(Y = 0\) but high predicted probability). In each case, the large magnitude of the Studentized residual indicates that the observed value is incongruous with the predicted probability. This does not mean there is anything wrong with the model, however, only that there are some individuals who are very unlike the others.

Extreme outliers in a logistic regression

Figure 6.7: Extreme outliers in a logistic regression