## 4.8 Model selection and multicollinearity

The same discussion we did in Section 3.7 is applicable to logistic regression with small changes:

1. The deviance of the model (reciprocally the likelihood and the $$R^2$$) always decreases (increase) with the inclusion of more predictors – no matter whether they are significant or not.
2. The excess of predictors in the model is paid by a larger variability in the estimation of the model which results in less precise prediction.
3. Multicollinearity may hide significant variables, change the sign of them and result in an increase of the variability of the estimation at the expense of little improvement in .

The use of information criteria, stepwise and vif allow to efficiently fight back these issues. Let’s review them quickly from the perspective of logistic regression.

First, remember that the BIC/AIC information criteria are based on a balance between the model fitness, given by the likelihood, and its complexity. In the logistic regression, the BIC is \begin{align*} \text{BIC}(\text{model}) &= -2\log \text{lik}(\hat{\boldsymbol{\beta}}) + (k + 1)\times\log n\\ &=D+(k+1)\times \log n, \end{align*}

where $$\text{lik}(\hat{\boldsymbol{\beta}})$$ is the likelihood of the model. The AIC replaces $$\log n$$ by $$2$$, hence penalizing less model complexity. The BIC and AIC can be computed in R through the functions BIC and AIC, and we can check manually that they match with its definition.

# Models
nasa <- glm(fail.field ~ temp, family = "binomial", data = challenger)
nasa2 <- glm(fail.field ~ temp + pres.field, family = "binomial",
data = challenger)

# nasa
summary1 <- summary(nasa)
summary1
##
## Call:
## glm(formula = fail.field ~ temp, family = "binomial", data = challenger)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -1.0566  -0.7575  -0.3818   0.4571   2.2195
##
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   7.5837     3.9146   1.937   0.0527 .
## temp         -0.4166     0.1940  -2.147   0.0318 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 28.267  on 22  degrees of freedom
## Residual deviance: 20.335  on 21  degrees of freedom
## AIC: 24.335
##
## Number of Fisher Scoring iterations: 5
BIC(nasa)
## [1] 26.60584
summary1$deviance + 2 * log(dim(challenger)[1]) ## [1] 26.60584 AIC(nasa) ## [1] 24.33485 summary1$deviance + 2 * 2
## [1] 24.33485

# nasa2
summary2 <- summary(nasa2)
summary2
##
## Call:
## glm(formula = fail.field ~ temp + pres.field, family = "binomial",
##     data = challenger)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -1.2109  -0.6081  -0.4292   0.3498   2.0913
##
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  6.642709   4.038547   1.645   0.1000
## temp        -0.435032   0.197008  -2.208   0.0272 *
## pres.field   0.009376   0.008821   1.063   0.2878
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 28.267  on 22  degrees of freedom
## Residual deviance: 19.078  on 20  degrees of freedom
## AIC: 25.078
##
## Number of Fisher Scoring iterations: 5
BIC(nasa2)
## [1] 28.48469
summary2$deviance + 3 * log(dim(challenger)[1]) ## [1] 28.48469 AIC(nasa2) ## [1] 25.07821 summary2$deviance + 3 * 2
## [1] 25.07821

Second, stepwise works analogously to the linear regression situation. Here is an illustration for a binary variable that measures whether a Boston suburb (Boston dataset) is wealth or not. The binary variable is medv > 25: it is TRUE (1) for suburbs with median house value larger than 25000$) and FALSE (0) otherwise. The cutoff 25000$ corresponds to the 25% richest suburbs.

# Boston dataset
data(Boston)

# Model whether a suburb has a median house value larger than 25000\$
mod <- glm(I(medv > 25) ~ ., data = Boston, family = "binomial")
summary(mod)
##
## Call:
## glm(formula = I(medv > 25) ~ ., family = "binomial", data = Boston)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -3.3498  -0.2806  -0.0932  -0.0006   3.3781
##
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  5.312511   4.876070   1.090 0.275930
## crim        -0.011101   0.045322  -0.245 0.806503
## zn           0.010917   0.010834   1.008 0.313626
## indus       -0.110452   0.058740  -1.880 0.060060 .
## chas         0.966337   0.808960   1.195 0.232266
## nox         -6.844521   4.483514  -1.527 0.126861
## rm           1.886872   0.452692   4.168 3.07e-05 ***
## age          0.003491   0.011133   0.314 0.753853
## dis         -0.589016   0.164013  -3.591 0.000329 ***
## rad          0.318042   0.082623   3.849 0.000118 ***
## tax         -0.010826   0.004036  -2.682 0.007314 **
## ptratio     -0.353017   0.122259  -2.887 0.003884 **
## black       -0.002264   0.003826  -0.592 0.554105
## lstat       -0.367355   0.073020  -5.031 4.88e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 563.52  on 505  degrees of freedom
## Residual deviance: 209.11  on 492  degrees of freedom
## AIC: 237.11
##
## Number of Fisher Scoring iterations: 7
r2Log(mod)
## [1] 0.628923

# With BIC - wnds up with only the significant variables and a similar R^2
modBIC <- stepwise(mod, trace = 0)
##
## Direction:  backward/forward
## Criterion:  BIC
summary(modBIC)
##
## Call:
## glm(formula = I(medv > 25) ~ indus + rm + dis + rad + tax + ptratio +
##     lstat, family = "binomial", data = Boston)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -3.3077  -0.2970  -0.0947  -0.0005   3.2552
##
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  1.556433   3.948818   0.394 0.693469
## indus       -0.143236   0.054771  -2.615 0.008918 **
## rm           1.950496   0.441794   4.415 1.01e-05 ***
## dis         -0.426830   0.111572  -3.826 0.000130 ***
## rad          0.301060   0.076542   3.933 8.38e-05 ***
## tax         -0.010240   0.003631  -2.820 0.004800 **
## ptratio     -0.404964   0.112086  -3.613 0.000303 ***
## lstat       -0.384823   0.069121  -5.567 2.59e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 563.52  on 505  degrees of freedom
## Residual deviance: 215.03  on 498  degrees of freedom
## AIC: 231.03
##
## Number of Fisher Scoring iterations: 7
r2Log(modBIC)
## [1] 0.6184273

Finally, multicollinearity can also be present in logistic regression. Despite the nonlinear logistic curve, the predictors are combined linearly in (4.4). Due to this, if two or more predictors are highly correlated between them, the fit of the model will be compromised since the individual linear effect of each predictor is hard to disentangle from the rest of correlated predictors.

In addition to inspecting the correlation matrix and look for high correlations, a powerful tool to detect multicollinearity is the generalized Variance Inflation Factor (gVIF) of each coefficient $$\hat\beta_j$$. This is a measure of how much the variability in the estimation has increased due to the addition of the predictor $$X_j$$. The next rule of thumb gives direct insight into which predictors are multicollinear:

• gVIF close to 1: absence of multicollinearity.
• gVIF larger than 5 or 10: problematic amount of multicollinearity. Advised to remove the predictor with largest gVIF.

Here is an example illustrating the use of gVIF, through vif, in practice. It also shows also how the simple inspection of the covariance matrix is not enough for detecting collinearity in tricky situations.

# Create predictors with multicollinearity: x4 depends on the rest
set.seed(45678)
x1 <- rnorm(100)
x2 <- 0.5 * x1 + rnorm(100)
x3 <- 0.5 * x2 + rnorm(100)
x4 <- -x1 + x2 + rnorm(100, sd = 0.25)

# Response
z <- 1 + 0.5 * x1 + 2 * x2 - 3 * x3 - x4
y <- rbinom(n = 100, size = 1, prob = 1/(1 + exp(-z)))
data <- data.frame(x1 = x1, x2 = x2, x3 = x3, x4 = x4, y = y)

# Correlations - none seems suspicius
cor(data)
##            x1         x2         x3         x4           y
## x1  1.0000000 0.38254782  0.2142011 -0.5261464  0.20198825
## x2  0.3825478 1.00000000  0.5167341  0.5673174  0.07456324
## x3  0.2142011 0.51673408  1.0000000  0.2500123 -0.49853746
## x4 -0.5261464 0.56731738  0.2500123  1.0000000 -0.11188657
## y   0.2019882 0.07456324 -0.4985375 -0.1118866  1.00000000

# Generalized variance inflation factors anormal: largest for x4, we remove it
modMultiCo <- glm(y ~ x1 + x2 + x3 + x4, family = "binomial")
vif(modMultiCo)
##       x1       x2       x3       x4
## 27.84756 36.66514  4.94499 36.78817

# Without x4
modClean <- glm(y ~ x1 + x2 + x3, family = "binomial")

# Comparison
summary(modMultiCo)
##
## Call:
## glm(formula = y ~ x1 + x2 + x3 + x4, family = "binomial")
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -2.4743  -0.3796   0.1129   0.4052   2.3887
##
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   1.2527     0.4008   3.125  0.00178 **
## x1           -3.4269     1.8225  -1.880  0.06007 .
## x2            6.9627     2.1937   3.174  0.00150 **
## x3           -4.3688     0.9312  -4.691 2.71e-06 ***
## x4           -5.0047     1.9440  -2.574  0.01004 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 132.81  on 99  degrees of freedom
## Residual deviance:  59.76  on 95  degrees of freedom
## AIC: 69.76
##
## Number of Fisher Scoring iterations: 7
summary(modClean)
##
## Call:
## glm(formula = y ~ x1 + x2 + x3, family = "binomial")
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -2.0952  -0.4144   0.1839   0.4762   2.5736
##
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   0.9237     0.3221   2.868 0.004133 **
## x1            1.2803     0.4235   3.023 0.002502 **
## x2            1.7946     0.5290   3.392 0.000693 ***
## x3           -3.4838     0.7491  -4.651 3.31e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 132.813  on 99  degrees of freedom
## Residual deviance:  68.028  on 96  degrees of freedom
## AIC: 76.028
##
## Number of Fisher Scoring iterations: 6
r2Log(modMultiCo)
## [1] 0.5500437
r2Log(modClean)
## [1] 0.4877884

# Genralized variance inflation factors normal
vif(modClean)
##       x1       x2       x3
## 1.674300 2.724351 3.743940

For the Boston dataset, do the following:

1. Compute the hit matrix and hit ratio for the regression I(medv > 25) ~ . (hint: do table(medv > 25, …)).
2. Fit I(medv > 25) ~ . but now using only the first 300 observations of Boston, the training dataset (hint: use subset).
3. For the previous model, predict the probability of the responses and classify them into 0 or 1 in the last 206 observations, the testing dataset (hint: use predict on that subset).
4. Compute the hit matrix and hit ratio for the new predictions. Check that the hit ratio is smaller than the one in the first point. The hit ratio on the testing dataset, and not the first hit rate, is an estimator of how well the model is going to classify future observations.