5.5 Deviance

The deviance is a key concept in generalized linear models. Intuitively, it measures the deviance of the fitted generalized linear model with respect to a perfect model for the sample {(xi,Yi)}ni=1. This perfect model, known as the saturated model, is the model that perfectly fits the data, in the sense that the fitted responses (ˆYi) equal the observed responses (Yi). For example, in logistic regression this would be the model such that

ˆP[Y=1|X1=Xi1,,Xp=Xip]=Yi,i=1,,n.

Figure 5.9 shows a174 saturated model and a fitted logistic model.

Fitted logistic regression versus a saturated model and the null model.

Figure 5.9: Fitted logistic regression versus a saturated model and the null model.

Formally, the deviance is defined through the difference of the log-likelihoods between the fitted model, (ˆβ), and the saturated model, s. Computing s amounts to substitute μi by Yi in (5.20). If the canonical link function is used, this corresponds to setting θi=g(Yi) (recall (5.12)). The deviance is then defined as:

D:=2[(ˆβ)s]ϕ.

The log-likelihood (ˆβ) is always smaller than s (the saturated model is more likely given the sample, since it is the sample itself). As a consequence, the deviance is always larger or equal than zero, being zero only if the fit of the model is perfect.

If the canonical link function is employed, the deviance can be expressed as

D=2a(ϕ)ni=1(Yiˆθib(ˆθi)Yig(Yi)+b(g(Yi)))ϕ=2ϕa(ϕ)ni=1(Yi(g(Yi)ˆθi)b(g(Yi))+b(ˆθi)).

In most of the cases, a(ϕ)ϕ, so the deviance does not depend on ϕ. Expression (5.30) is interesting, since it delivers the following key insight:

The deviance generalizes the Residual Sum of Squares (RSS) of the linear model. The generalization is driven by the likelihood and its equivalence with the RSS in the linear model.

To see this insight, let’s consider the linear model in (5.30) by setting ϕ=σ2, a(ϕ)=ϕ, b(θ)=θ22, c(y,ϕ)=12{y2ϕ+log(2πϕ)}, and θ=μ=η.175 Then:

D=2σ2σ2ni=1(Yi(Yiˆηi)Y2i2+ˆη2i2)=ni=1(2Y2i2YiˆηiY2i+ˆη2i)=ni=1(Yiˆηi)2=RSS(ˆβ),

since ˆηi=ˆβ0+ˆβ1xi1++ˆβpxip. Remember that RSS(ˆβ) is just another name for the SSE.

A benchmark for evaluating the scale of the deviance is the null deviance,

D0:=2[(ˆβ0)s]ϕ,

which is the deviance of the model without predictors, the one featuring only an intercept, to the perfect model. In logistic regression, this model is

Y|(X1=x1,,Xp=xp)Ber(logistic(β0))

and, in this case, ˆβ0=logit(mn)=log(mn1mn) where m is the number of 1’s in Y1,,Yn (see Figure 5.9).

Using again (5.30), we can see that the null deviance is a generalization of the total sum of squares of the linear model (see Section 2.6):

D0=ni=1(Yiˆηi)2=ni=1(Yiˆβ0)2=SST,

since ˆβ0=ˉY because there are no predictors.

Pictorial representation of the deviance (\(D\)) and the null deviance (\(D_0\)).

Figure 5.10: Pictorial representation of the deviance (D) and the null deviance (D0).

Using the deviance and the null deviance, we can compare how much the model has improved by adding the predictors X1,,Xp and quantify the percentage of deviance explained. This can be done by means of the R2 statistic, which is a generalization of the determination coefficient for linear regression:

R2:=1DD0linearmodel=1SSESST.

The R2 for generalized linear models is a measure that shares the same philosophy with the determination coefficient in linear regression: it is a proportion of how good the model fit is. If perfect, D=0 and R2=1. If the predictors are not model-related with Y, then D=D0 and R2=0.

However, this R2 has a different interpretation than in linear regression. In particular:

  • Is not the percentage of variance explained by the model, but rather a ratio indicating how close is the fit to being perfect or the worst.
  • It is not related to any correlation coefficient.

The deviance is returned by summary. It is important to recall that R refers to the deviance as the 'Residual deviance' and the null deviance is referred to as 'Null deviance'.

# Summary of model
nasa <- glm(fail.field ~ temp, family = "binomial", data = challenger)
summaryLog <- summary(nasa)
summaryLog
## 
## Call:
## glm(formula = fail.field ~ temp, family = "binomial", data = challenger)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0566  -0.7575  -0.3818   0.4571   2.2195  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   7.5837     3.9146   1.937   0.0527 .
## temp         -0.4166     0.1940  -2.147   0.0318 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28.267  on 22  degrees of freedom
## Residual deviance: 20.335  on 21  degrees of freedom
## AIC: 24.335
## 
## Number of Fisher Scoring iterations: 5
# 'Residual deviance' is the deviance; 'Null deviance' is the null deviance

# Null model (only intercept)
null <- glm(fail.field ~ 1, family = "binomial", data = challenger)
summaryNull <- summary(null)
summaryNull
## 
## Call:
## glm(formula = fail.field ~ 1, family = "binomial", data = challenger)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.852  -0.852  -0.852   1.542   1.542  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -0.8267     0.4532  -1.824   0.0681 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28.267  on 22  degrees of freedom
## Residual deviance: 28.267  on 22  degrees of freedom
## AIC: 30.267
## 
## Number of Fisher Scoring iterations: 4

# Compute the R^2 with a function -- useful for repetitive computations
r2glm <- function(model) {

  summaryLog <- summary(model)
  1 - summaryLog$deviance / summaryLog$null.deviance

}

# R^2
r2glm(nasa)
## [1] 0.280619
r2glm(null)
## [1] -4.440892e-16

A quantity related with the deviance is the scaled deviance:

D:=Dϕ=2[(ˆβ)s].

If ϕ=1, such as in the binomial or Poisson regression models, then both the deviance and the scaled deviance agree. The scaled deviance has asymptotic distribution

Daχ2np1,

where χ2k is the Chi-squared distribution with k degrees of freedom. In the case of the linear model, D=1σ2RSS is exactly distributed as a χ2np1.

The result (5.32) provides a way of estimating ϕ when it is unknown: match D with the expectation E[χ2np1]=np1. This provides

ˆϕD:=2((ˆβ)s)np1,

which, as expected, in the case of the linear model is equivalent to ˆσ2 as given in (2.15). More importantly, the scaled deviance can be used for performing hypotheses tests on sets of coefficients of a generalized linear model.

Assume we have one model, say M2, with p2 predictors and another model, say M1, with p1<p2 predictors that are contained in the set of predictors of the M2. In other words, assume M1 is nested within M2. Then we can test the null hypothesis that the extra coefficients of M2 are simultaneously zero. For example, if M1 has the coefficients {β0,β1,,βp1} and M2 has coefficients {β0,β1,,βp1,βp1+1,,βp2}, we can test

H0:βp1+1==βp2=0vs.H1:βj0 for any p1<jp2.

This can be done by means of the statistic176

Dp1Dp2a,H0χ2p2p1.

If H0 is true,177 then Dp1Dp2 is expected to be small, thus we will reject H0 if the value of the statistic is above the α-upper quantile of the χ2p2p1, denoted as χ2α;p2p1.

D apparently removes the effects of ϕ, but it is still dependent on ϕ, since this is hidden in the likelihood (see (5.30)). Therefore, D cannot be computed unless ϕ is known, which forbids using (5.33). Hopefully, this dependence is removed by employing (5.32) and (5.33) and assuming that they are asymptotically independent. This gives the F-test for H0:

F=(Dp1Dp2)/(p2p1)Dp2/(np21)=(Dp1Dp2)/(p2p1)Dp2/(np21)a,H0Fp2p1,np21.

Note that F is perfectly computable, since ϕ cancels due to the quotient (and because we assume that a(ϕ)ϕ).

Note also that this is an extension of the F-test seen in Section 2.6: take p1=0 and p2=p and then one tests the significance of all the predictors included in the model (both models contain intercept).

The computation of deviances and associated tests is done through anova, which implements the Analysis of Deviance. This is illustrated in the following code, which coincidentally also illustrates the inclusion of nonlinear transformations on the predictors.

# Polynomial predictors
nasa0 <- glm(fail.field ~ 1, family = "binomial", data = challenger)
nasa1 <- glm(fail.field ~ temp, family = "binomial", data = challenger)
nasa2 <- glm(fail.field ~ poly(temp, degree = 2), family = "binomial",
             data = challenger)
nasa3 <- glm(fail.field ~ poly(temp, degree = 3), family = "binomial",
             data = challenger)

# Plot fits
temp <- seq(-1, 35, l = 200)
tt <- data.frame(temp = temp)
plot(fail.field ~ temp, data = challenger, pch = 16, xlim = c(-1, 30),
     xlab = "Temperature", ylab = "Incident probability")
lines(temp, predict(nasa0, newdata = tt, type = "response"), col = 1)
lines(temp, predict(nasa1, newdata = tt, type = "response"), col = 2)
lines(temp, predict(nasa2, newdata = tt, type = "response"), col = 3)
lines(temp, predict(nasa3, newdata = tt, type = "response"), col = 4)
legend("bottomleft", legend = c("Null model", "Linear", "Quadratic", "Cubic"),
       lwd = 2, col = 1:4)


# R^2's
r2glm(nasa0)
## [1] -4.440892e-16
r2glm(nasa1)
## [1] 0.280619
r2glm(nasa2)
## [1] 0.3138925
r2glm(nasa3)
## [1] 0.4831863

# Chisq and F tests -- same results since phi is known
anova(nasa1, test = "Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: fail.field
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev Pr(>Chi)   
## NULL                    22     28.267            
## temp  1   7.9323        21     20.335 0.004856 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(nasa1, test = "F")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: fail.field
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev      F   Pr(>F)   
## NULL                    22     28.267                   
## temp  1   7.9323        21     20.335 7.9323 0.004856 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Incremental comparisons of nested models
anova(nasa1, nasa2, nasa3, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: fail.field ~ temp
## Model 2: fail.field ~ poly(temp, degree = 2)
## Model 3: fail.field ~ poly(temp, degree = 3)
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1        21     20.335                       
## 2        20     19.394  1   0.9405   0.3321  
## 3        19     14.609  1   4.7855   0.0287 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Quadratic effects are not significative

# Cubic vs. linear
anova(nasa1, nasa3, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: fail.field ~ temp
## Model 2: fail.field ~ poly(temp, degree = 3)
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1        21     20.335                       
## 2        19     14.609  2    5.726   0.0571 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Example in Poisson regression
species1 <- glm(Species ~ ., data = species, family = poisson)
species2 <- glm(Species ~ Biomass * pH, data = species, family = poisson)

# Comparison
anova(species1, species2, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: Species ~ pH + Biomass
## Model 2: Species ~ Biomass * pH
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        86     99.242                          
## 2        84     83.201  2    16.04 0.0003288 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
r2glm(species1)
## [1] 0.7806071
r2glm(species2)
## [1] 0.8160674

  1. Several are possible depending on the interpolation between points.↩︎

  2. The canonical link function g is the identity, check (5.12) and (5.13).↩︎

  3. Note that Dp1>Dp2 because (ˆβp1)<(ˆβp2).↩︎

  4. In other words, M2 does not add significant predictors when compared with the submodel M1.↩︎