Cook’s distance and Leverage

Sometimes individual observations can exert a great deal of influence on our fitted model. One routine way of checking for this is to fit the model n times, missing out each observation in turn. If we removed the $i$ th observation and compared the fitted values from the full model, say $\hat{y}_{j}$ , to those obtained by removing this point, denoted by $\hat{y}_{j(i)}$ , then

$\mbox{Cook's Distance} \propto \sum_{j=1}^{n}(\hat{y}_j - \hat{y}_{j(i)})^2.$

Observations with a high Cook’s distance could be influential. We know we can estimate fitted values

$\begin{aligned} \boldsymbol{\hat{Y}} & = \boldsymbol{X} \boldsymbol{\beta}\\ & = \boldsymbol{X} (\boldsymbol{X^T}\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{Y}\\ & = \boldsymbol{H}\boldsymbol{Y}. \end{aligned}$

The matrix $\boldsymbol{H}$ is called a hat matrix. The entries tell us the magnitude of difference between the observed and fitted values. Large hat values indicate observations with a large distance, or leverage. That is compared to the average leverage $\frac{p+1}{n}$

A plot of all the lines for the Rodent data shows that the porcupine has a great deal of influence. When this point is omitted, the fitted line changes from horizontal (blue line) to one with a strong positive slope (red line). We can see that the porcupine exerts a great deal of influence on the model.

There is a good reason why porcupines do not follow relationship between mass and speed that we see in the other rodents. For most rodents, their speed is their defense against predators — whereas the porcupine is protected by its sharp spikes, so it doesn’t need to move as quickly as other rodents.

Ideally we should fit multiple regression lines excluding each data point in turn. However, this isn’t particularly feasible for very large numbers of data points.

In R one can use cooks distance which is available using the diagnostics of a linear model fit

model<-lm(log(Speed)~log(Mass),data=rodent)
autoplot(model, which = 4)

cd_cont_pos <- function(leverage, level, model) 
  {sqrt(level*length(coef(model))*(1-leverage)/leverage)}
cd_cont_neg <- function(leverage, level, model) 
{-cd_cont_pos(leverage, level, model)}

autoplot(model, which = 5) +
    stat_function(fun = cd_cont_pos, args = list(level = 0.5, model = model), 
                  xlim = c(0, 0.4), lty = 2, colour = "red") +
    stat_function(fun = cd_cont_neg, args = list(level = 0.5, model = model), 
                  xlim = c(0, 0.4), lty = 2, colour = "red") +
      stat_function(fun = cd_cont_pos, args = list(level = 1, model = model), 
                  xlim = c(0, 0.4), lty = 3, colour = "red") +
    stat_function(fun = cd_cont_neg, args = list(level = 1, model = model), 
                  xlim = c(0, 0.4), lty = 3, colour = "red") +
    scale_y_continuous(limits = c(-5, 3.5))

Again we can see that the North American Porcupine stands out, see Cook’s distance. The R library car also has a function called outlierTest which performs a formal test for detecting an outlier.

library(car)
outlierTest(lm(log(Mass)~log(Speed),data=rodent))

##                          rstudent unadjusted p-value Bonferroni p
## North American Porcupine 4.026755         0.00066086       0.0152

The details are omitted here but you can check out this function in R. In summary, the output indicates observations that could be outliers. In this example, the North American Porcupine stands out.

?car::outlierTest