5.6 Model selection

The same discussion we did in Section 3.2 is applicable to generalized linear models with small changes:

  1. The deviance of the model (reciprocally the likelihood and the \(R^2\)) always decreases (increases) with the inclusion of more predictors – no matter whether they are significant or not.
  2. The excess of predictors in the model is paid by a larger variability in the estimation of the model which results in less precise prediction.
  3. Multicollinearity may hide significant variables, change the sign of them, and result in an increase of the variability of the estimation.
  4. Roughly speaking, the BIC is consistent in performing model-selection, and penalizes more the complex models when compared to AIC and LOOCV.

In addition, variable selection can also be done with the lasso for generalized linear models, as seen in Section 5.8.

Stepwise selection can be done through MASS::stepAIC as in linear models176. Conveniently, summary also reports the AIC:

# Models
nasa1 <- glm(fail.field ~ temp, family = "binomial", data = challenger)
nasa2 <- glm(fail.field ~ temp + pres.field, family = "binomial",
             data = challenger)

# Summaries
summary(nasa1)
## 
## Call:
## glm(formula = fail.field ~ temp, family = "binomial", data = challenger)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0566  -0.7575  -0.3818   0.4571   2.2195  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   7.5837     3.9146   1.937   0.0527 .
## temp         -0.4166     0.1940  -2.147   0.0318 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28.267  on 22  degrees of freedom
## Residual deviance: 20.335  on 21  degrees of freedom
## AIC: 24.335
## 
## Number of Fisher Scoring iterations: 5
summary(nasa2)
## 
## Call:
## glm(formula = fail.field ~ temp + pres.field, family = "binomial", 
##     data = challenger)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2109  -0.6081  -0.4292   0.3498   2.0913  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  6.642709   4.038547   1.645   0.1000  
## temp        -0.435032   0.197008  -2.208   0.0272 *
## pres.field   0.009376   0.008821   1.063   0.2878  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28.267  on 22  degrees of freedom
## Residual deviance: 19.078  on 20  degrees of freedom
## AIC: 25.078
## 
## Number of Fisher Scoring iterations: 5

# AICs
AIC(nasa1) # Better
## [1] 24.33485
AIC(nasa2)
## [1] 25.07821

MASS::stepAIC works analogously as in linear regression. An illustration is given next for a predicting binary variable that measures whether a Boston suburb (Boston dataset from Section 3.1) is wealth or not. The binary variable is medv > 25: it is TRUE (1) for suburbs with median house value larger than 25000$ and FALSE (0) otherwise.177 The cutoff 25000$ corresponds to the 25% richest suburbs.

# Boston dataset
data(Boston, package = "MASS")

# Model whether a suburb has a median house value larger than $25000
mod <- glm(I(medv > 25) ~ ., data = Boston, family = "binomial")
summary(mod)
## 
## Call:
## glm(formula = I(medv > 25) ~ ., family = "binomial", data = Boston)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3498  -0.2806  -0.0932  -0.0006   3.3781  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  5.312511   4.876070   1.090 0.275930    
## crim        -0.011101   0.045322  -0.245 0.806503    
## zn           0.010917   0.010834   1.008 0.313626    
## indus       -0.110452   0.058740  -1.880 0.060060 .  
## chas         0.966337   0.808960   1.195 0.232266    
## nox         -6.844521   4.483514  -1.527 0.126861    
## rm           1.886872   0.452692   4.168 3.07e-05 ***
## age          0.003491   0.011133   0.314 0.753853    
## dis         -0.589016   0.164013  -3.591 0.000329 ***
## rad          0.318042   0.082623   3.849 0.000118 ***
## tax         -0.010826   0.004036  -2.682 0.007314 ** 
## ptratio     -0.353017   0.122259  -2.887 0.003884 ** 
## black       -0.002264   0.003826  -0.592 0.554105    
## lstat       -0.367355   0.073020  -5.031 4.88e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 563.52  on 505  degrees of freedom
## Residual deviance: 209.11  on 492  degrees of freedom
## AIC: 237.11
## 
## Number of Fisher Scoring iterations: 7
r2glm(mod)
## [1] 0.628923

# With BIC -- ends up with only the significant variables and a similar R^2
modBIC <- MASS::stepAIC(mod, trace = 0, k = log(nrow(Boston)))
summary(modBIC)
## 
## Call:
## glm(formula = I(medv > 25) ~ indus + rm + dis + rad + tax + ptratio + 
##     lstat, family = "binomial", data = Boston)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3077  -0.2970  -0.0947  -0.0005   3.2552  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.556433   3.948818   0.394 0.693469    
## indus       -0.143236   0.054771  -2.615 0.008918 ** 
## rm           1.950496   0.441794   4.415 1.01e-05 ***
## dis         -0.426830   0.111572  -3.826 0.000130 ***
## rad          0.301060   0.076542   3.933 8.38e-05 ***
## tax         -0.010240   0.003631  -2.820 0.004800 ** 
## ptratio     -0.404964   0.112086  -3.613 0.000303 ***
## lstat       -0.384823   0.069121  -5.567 2.59e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 563.52  on 505  degrees of freedom
## Residual deviance: 215.03  on 498  degrees of freedom
## AIC: 231.03
## 
## Number of Fisher Scoring iterations: 7
r2glm(modBIC)
## [1] 0.6184273

The logistic model is at the intersection between regression models and classification methods. Therefore, the search for adequate predictors to be included in the model can also be done in terms of the classification performance. Although we do not explore in detail this direction, we simply mention how the overall predictive accuracy can be summarized with the hit matrix (also called confusion matrix):

Reality vs. classified \(\hat Y=0\) \(\hat Y=1\)
\(Y=0\) Correct\(_{0}\) Incorrect\(_{01}\)
\(Y=1\) Incorrect\(_{10}\) Correct\(_{1}\)

The hit ratio, \(\frac{\text{Correct}_0+\text{Correct}_1}{n},\) is the percentage of correct classifications. The hit matrix is easily computed with the table function which, whenever called with two vectors, computes the cross-table between those two vectors.

# Fitted probabilities for Y = 1
nasa$fitted.values
##          1          2          3          4          5          6          7          8          9         10 
## 0.42778935 0.23014393 0.26910358 0.32099837 0.37772880 0.15898364 0.12833090 0.23014393 0.85721594 0.60286639 
##         11         12         13         14         15         16         17         18         19         20 
## 0.23014393 0.04383877 0.37772880 0.93755439 0.37772880 0.08516844 0.23014393 0.02299887 0.07027765 0.03589053 
##         21         22         23 
## 0.08516844 0.07027765 0.82977495

# Classified Y's
yHat <- nasa$fitted.values > 0.5

# Hit matrix:
# - 16 correctly classified as 0
# - 4 correctly classified as 1
# - 3 incorrectly classified as 0
tab <- table(challenger$fail.field, yHat)
tab
##    yHat
##     FALSE TRUE
##   0    16    0
##   1     3    4

# Hit ratio (ratio of correct classification)
sum(diag(tab)) / sum(tab)
## [1] 0.8695652

It is important to recall that the hit matrix will be always biased towards unrealistically good classification rates if it is computed in the same sample used for fitting the logistic model. An approach based on data-splitting/cross-validation is therefore needed to estimate unbiasedly the hit matrix.

For the Boston dataset, do the following:

  1. Compute the hit matrix and hit ratio for the regression I(medv > 25) ~ ..
  2. Fit I(medv > 25) ~ . but now using only the first 300 observations of Boston, the training dataset.
  3. For the previous model, predict the probability of the responses and classify them into 0 or 1 in the last \(206\) observations, the testing dataset.
  4. Compute the hit matrix and hit ratio for the new predictions. Check that the hit ratio is smaller than the one in the first point.

  1. The leaps package does not support generalized linear models directly. There are, however, other packages for performing best subset selection with generalized linear models, but we do not cover them here.↩︎

  2. This is a common way of generating a binary variable from a quantitative variable.↩︎