2 Stepwise methods
2.1 Introduction
We have seen that fitting all the models to select the best one may be computationally intensive. Stepwise methods decrease the number of models to fit by adding (forward) or removing (backward) one variable at each step.
In backward stepwise, we fit with all the predictors in the model. We then remove the predictor with the lowest contribution to the model. This can be based on the change of AIC or some other statistic, if the variable is removed. It is also common to remove the predictor with the highest p-value. One keeps removing variables until the removal of any other predictor will decrease the prediction ability (or all the predictors have a significant p-value).
The forward stepwise starts by choosing the predictor with best prediction ability. Then, with that predictor in the model, looks for the next predictor that most improves the model. This process stops when no more predictors improve the model.
Despite being computationally appealing, stepwise methods don’t necessarily choose the correct model, or even the best one.
2.2 Readings
Read the following chapters of An introduction to statistical learning:
- 6.1.2 Stepwise Selection
2.3 Practical session
Task 1 - Fit a linear model with stepwise backward
As in the previous section, we will use the fat dataset in the library(faraway), we want to fit a linear model to predict body fat (variable brozek) using the other variables available, except for siri (another way of computing body fat), density (it is used in the brozek and siri formulas) and free (it is computed using brozek formula).
Let’s first read the data
#install.packages(c("leaps", "faraway"))
library(leaps) #function to search for the best model
library(faraway) #has the dataset fat
set.seed(1221)
## brozek siri density age weight height adipos free neck chest abdom hip
## 1 12.6 12.3 1.0708 23 154.25 67.75 23.7 134.9 36.2 93.1 85.2 94.5
## 2 6.9 6.1 1.0853 22 173.25 72.25 23.4 161.3 38.5 93.6 83.0 98.7
## 3 24.6 25.3 1.0414 22 154.00 66.25 24.7 116.0 34.0 95.8 87.9 99.2
## 4 10.9 10.4 1.0751 26 184.75 72.25 24.9 164.7 37.4 101.8 86.4 101.2
## 5 27.8 28.7 1.0340 24 184.25 71.25 25.6 133.1 34.4 97.3 100.0 101.9
## 6 20.6 20.9 1.0502 24 210.25 74.75 26.5 167.0 39.0 104.5 94.4 107.8
## thigh knee ankle biceps forearm wrist
## 1 59.0 37.3 21.9 32.0 27.4 17.1
## 2 58.7 37.3 23.4 30.5 28.9 18.2
## 3 59.6 38.9 24.0 28.8 25.2 16.6
## 4 60.1 37.3 22.8 32.4 29.4 18.2
## 5 63.2 42.2 24.0 32.2 27.7 17.7
## 6 66.0 42.0 25.6 35.7 30.6 18.8
The function regsubset()
that we have used before, it also implements
backward and forward selection.
We will start with backward selection for variables selection.
############################################################
#BACKWARD SELECTION
############################################################
model.bwrd <- regsubsets(brozek ~ age + weight +
height + adipos +
neck + chest +
abdom + hip + thigh +
knee + ankle +
biceps + forearm +
wrist,
data = fat,
nvmax = 14,
method = "backward")
summary(model.bwrd)
## Subset selection object
## Call: regsubsets.formula(brozek ~ age + weight + height + adipos +
## neck + chest + abdom + hip + thigh + knee + ankle + biceps +
## forearm + wrist, data = fat, nvmax = 14, method = "backward")
## 14 Variables (and intercept)
## Forced in Forced out
## age FALSE FALSE
## weight FALSE FALSE
## height FALSE FALSE
## adipos FALSE FALSE
## neck FALSE FALSE
## chest FALSE FALSE
## abdom FALSE FALSE
## hip FALSE FALSE
## thigh FALSE FALSE
## knee FALSE FALSE
## ankle FALSE FALSE
## biceps FALSE FALSE
## forearm FALSE FALSE
## wrist FALSE FALSE
## 1 subsets of each size up to 14
## Selection Algorithm: backward
## age weight height adipos neck chest abdom hip thigh knee ankle biceps
## 1 ( 1 ) " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " "*" " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " "*" " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " "*" " " " " " " " " " "
## 5 ( 1 ) "*" "*" " " " " " " " " "*" " " " " " " " " " "
## 6 ( 1 ) "*" "*" " " " " " " " " "*" " " "*" " " " " " "
## 7 ( 1 ) "*" "*" " " " " "*" " " "*" " " "*" " " " " " "
## 8 ( 1 ) "*" "*" " " " " "*" " " "*" "*" "*" " " " " " "
## 9 ( 1 ) "*" "*" " " " " "*" " " "*" "*" "*" " " " " "*"
## 10 ( 1 ) "*" "*" " " " " "*" " " "*" "*" "*" " " "*" "*"
## 11 ( 1 ) "*" "*" "*" " " "*" " " "*" "*" "*" " " "*" "*"
## 12 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*" "*" " " "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## forearm wrist
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " "*"
## 4 ( 1 ) "*" "*"
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## 11 ( 1 ) "*" "*"
## 12 ( 1 ) "*" "*"
## 13 ( 1 ) "*" "*"
## 14 ( 1 ) "*" "*"
#prints the Cp and matrix of predictors
cbind(summary(model.bwrd)$outmat,
Cp=round(summary(model.bwrd)$cp,4))
## age weight height adipos neck chest abdom hip thigh knee ankle biceps
## 1 ( 1 ) " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " "*" " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " "*" " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " "*" " " " " " " " " " "
## 5 ( 1 ) "*" "*" " " " " " " " " "*" " " " " " " " " " "
## 6 ( 1 ) "*" "*" " " " " " " " " "*" " " "*" " " " " " "
## 7 ( 1 ) "*" "*" " " " " "*" " " "*" " " "*" " " " " " "
## 8 ( 1 ) "*" "*" " " " " "*" " " "*" "*" "*" " " " " " "
## 9 ( 1 ) "*" "*" " " " " "*" " " "*" "*" "*" " " " " "*"
## 10 ( 1 ) "*" "*" " " " " "*" " " "*" "*" "*" " " "*" "*"
## 11 ( 1 ) "*" "*" "*" " " "*" " " "*" "*" "*" " " "*" "*"
## 12 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*" "*" " " "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## forearm wrist Cp
## 1 ( 1 ) " " " " "71.0755"
## 2 ( 1 ) " " " " "19.6177"
## 3 ( 1 ) " " "*" "13.2793"
## 4 ( 1 ) "*" "*" "8.148"
## 5 ( 1 ) "*" "*" "8.3016"
## 6 ( 1 ) "*" "*" "6.5901"
## 7 ( 1 ) "*" "*" "5.3117"
## 8 ( 1 ) "*" "*" "5.2307"
## 9 ( 1 ) "*" "*" "6.2969"
## 10 ( 1 ) "*" "*" "7.5953"
## 11 ( 1 ) "*" "*" "9.1148"
## 12 ( 1 ) "*" "*" "11.0509"
## 13 ( 1 ) "*" "*" "13"
## 14 ( 1 ) "*" "*" "15"
Start reading the above matrix from below. First, you have the model will all the predictors (line 14). The knee is removed, followed by adipos. The last model fitted only has abdom. Among these models, the one with 8 predictors has the lowest \(Cp\).
We can now fit the model with those predictors:
#final model from backward
bwrd.model <- lm(brozek ~ age + weight +
neck + abdom + hip +
thigh + forearm +
wrist,
data = fat)
summary(bwrd.model)
##
## Call:
## lm(formula = brozek ~ age + weight + neck + abdom + hip + thigh +
## forearm + wrist, data = fat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0574 -2.7411 -0.1912 2.6929 9.4977
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -20.06213 10.84654 -1.850 0.06558 .
## age 0.05922 0.02850 2.078 0.03876 *
## weight -0.08414 0.03695 -2.277 0.02366 *
## neck -0.43189 0.20799 -2.077 0.03889 *
## abdom 0.87721 0.06661 13.170 < 2e-16 ***
## hip -0.18641 0.12821 -1.454 0.14727
## thigh 0.28644 0.11949 2.397 0.01727 *
## forearm 0.48255 0.17251 2.797 0.00557 **
## wrist -1.40487 0.47167 -2.978 0.00319 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.965 on 243 degrees of freedom
## Multiple R-squared: 0.7467, Adjusted R-squared: 0.7383
## F-statistic: 89.53 on 8 and 243 DF, p-value: < 2.2e-16
Now, let’s use forward stepwise. Using the \(Cp\) to choose the best model, will result, in this case,in the same set of the predictors as the backward stepwise. This is not always the case but it is quite common to happen.
Also, the matrix in the output is not exactly the same as the backward method.
############################################################
#FORWARD SELECTION
############################################################
model.fwrd <- regsubsets(brozek ~ age + weight +
height + adipos +
neck + chest +
abdom + hip + thigh +
knee + ankle +
biceps + forearm +
wrist,
data = fat,
nvmax = 14,
method = "forward")
summary(model.fwrd)
## Subset selection object
## Call: regsubsets.formula(brozek ~ age + weight + height + adipos +
## neck + chest + abdom + hip + thigh + knee + ankle + biceps +
## forearm + wrist, data = fat, nvmax = 14, method = "forward")
## 14 Variables (and intercept)
## Forced in Forced out
## age FALSE FALSE
## weight FALSE FALSE
## height FALSE FALSE
## adipos FALSE FALSE
## neck FALSE FALSE
## chest FALSE FALSE
## abdom FALSE FALSE
## hip FALSE FALSE
## thigh FALSE FALSE
## knee FALSE FALSE
## ankle FALSE FALSE
## biceps FALSE FALSE
## forearm FALSE FALSE
## wrist FALSE FALSE
## 1 subsets of each size up to 14
## Selection Algorithm: forward
## age weight height adipos neck chest abdom hip thigh knee ankle biceps
## 1 ( 1 ) " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " "*" " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " "*" " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " "*" " " " " " " " " " "
## 5 ( 1 ) " " "*" " " " " "*" " " "*" " " " " " " " " " "
## 6 ( 1 ) "*" "*" " " " " "*" " " "*" " " " " " " " " " "
## 7 ( 1 ) "*" "*" " " " " "*" " " "*" " " "*" " " " " " "
## 8 ( 1 ) "*" "*" " " " " "*" " " "*" "*" "*" " " " " " "
## 9 ( 1 ) "*" "*" " " " " "*" " " "*" "*" "*" " " " " "*"
## 10 ( 1 ) "*" "*" " " " " "*" " " "*" "*" "*" " " "*" "*"
## 11 ( 1 ) "*" "*" "*" " " "*" " " "*" "*" "*" " " "*" "*"
## 12 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*" "*" " " "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## forearm wrist
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " "*"
## 4 ( 1 ) "*" "*"
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## 11 ( 1 ) "*" "*"
## 12 ( 1 ) "*" "*"
## 13 ( 1 ) "*" "*"
## 14 ( 1 ) "*" "*"
## age weight height adipos neck chest abdom hip thigh knee ankle biceps
## 1 ( 1 ) " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " "*" " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " "*" " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " "*" " " " " " " " " " "
## 5 ( 1 ) " " "*" " " " " "*" " " "*" " " " " " " " " " "
## 6 ( 1 ) "*" "*" " " " " "*" " " "*" " " " " " " " " " "
## 7 ( 1 ) "*" "*" " " " " "*" " " "*" " " "*" " " " " " "
## 8 ( 1 ) "*" "*" " " " " "*" " " "*" "*" "*" " " " " " "
## 9 ( 1 ) "*" "*" " " " " "*" " " "*" "*" "*" " " " " "*"
## 10 ( 1 ) "*" "*" " " " " "*" " " "*" "*" "*" " " "*" "*"
## 11 ( 1 ) "*" "*" "*" " " "*" " " "*" "*" "*" " " "*" "*"
## 12 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*" "*" " " "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## forearm wrist Cp
## 1 ( 1 ) " " " " "71.0755"
## 2 ( 1 ) " " " " "19.6177"
## 3 ( 1 ) " " "*" "13.2793"
## 4 ( 1 ) "*" "*" "8.148"
## 5 ( 1 ) "*" "*" "7.4123"
## 6 ( 1 ) "*" "*" "7.0794"
## 7 ( 1 ) "*" "*" "5.3117"
## 8 ( 1 ) "*" "*" "5.2307"
## 9 ( 1 ) "*" "*" "6.2969"
## 10 ( 1 ) "*" "*" "7.5953"
## 11 ( 1 ) "*" "*" "9.1148"
## 12 ( 1 ) "*" "*" "11.0509"
## 13 ( 1 ) "*" "*" "13"
## 14 ( 1 ) "*" "*" "15"
#final model from forward
fwrd.model <- lm(brozek ~ age + weight +
neck + abdom +
hip + thigh +
forearm + wrist,
data = fat)
summary(fwrd.model)
##
## Call:
## lm(formula = brozek ~ age + weight + neck + abdom + hip + thigh +
## forearm + wrist, data = fat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0574 -2.7411 -0.1912 2.6929 9.4977
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -20.06213 10.84654 -1.850 0.06558 .
## age 0.05922 0.02850 2.078 0.03876 *
## weight -0.08414 0.03695 -2.277 0.02366 *
## neck -0.43189 0.20799 -2.077 0.03889 *
## abdom 0.87721 0.06661 13.170 < 2e-16 ***
## hip -0.18641 0.12821 -1.454 0.14727
## thigh 0.28644 0.11949 2.397 0.01727 *
## forearm 0.48255 0.17251 2.797 0.00557 **
## wrist -1.40487 0.47167 -2.978 0.00319 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.965 on 243 degrees of freedom
## Multiple R-squared: 0.7467, Adjusted R-squared: 0.7383
## F-statistic: 89.53 on 8 and 243 DF, p-value: < 2.2e-16
TRY IT YOURSELF:
- What variables are selected in the example above using forward stepwise, if instead of the \(Cp\) we use the \(BIC\)?
See the solution code
- What is the adjusted r-square for the model in 1) and the model with the 8 predictors?
See the solution code
Task 2 - Stepwise in a logistic model
With the lowbwt.csv dataset, in the library(faraway), we want to fit a logistic regression to predict low birthweight (<2500gr), using age, lwt, race, smoke, ptl, ht, ui, ftv.
Let’s read the data and make sure that race and ftv are factor predictors. And recode ftv into (0, 1, 2+)
#install.packages("bestglm")
#install.packages("dplyr")
library(bestglm)
library(dplyr) #to use the rename() function
lowbwt <- read.csv("https://www.dropbox.com/s/1odxxsbwd5anjs8/lowbwt.csv?dl=1",
stringsAsFactors = TRUE)
lowbwt$race <- factor(lowbwt$race,
levels = c(1,2,3),
labels = c("White", "Black", "Other"))
#number of previous appointments
lowbwt$ftv[lowbwt$ftv>1] <- 2 #categories 0, 1, 2+
lowbwt$ftv <- factor(lowbwt$ftv,
levels = c(0,1,2),
labels = c("0", "1", "2+"))
head(lowbwt)
## id low age lwt race smoke ptl ht ui ftv bwt
## 1 85 0 19 182 Black 0 0 0 1 0 2523
## 2 86 0 33 155 Other 0 0 0 0 2+ 2551
## 3 87 0 20 105 White 1 0 0 0 1 2557
## 4 88 0 21 108 White 1 0 0 1 2+ 2594
## 5 89 0 18 107 White 1 0 0 1 0 2600
## 6 91 0 21 124 Other 0 0 0 0 0 2622
Let’s first fit the model for low using all the predictors.
lowbwt.logit <- glm(low ~ age +
lwt + race +
smoke + ptl +
ht + ui + ftv,
family=binomial, data=lowbwt)
summary(lowbwt.logit)
##
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl + ht + ui +
## ftv, family = binomial, data = lowbwt)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.590930 1.213568 0.487 0.62630
## age -0.026945 0.037414 -0.720 0.47141
## lwt -0.015748 0.006963 -2.262 0.02371 *
## raceBlack 1.254016 0.529382 2.369 0.01784 *
## raceOther 0.820827 0.451042 1.820 0.06878 .
## smoke 0.865245 0.414450 2.088 0.03683 *
## ptl 0.604607 0.357445 1.691 0.09075 .
## ht 1.894689 0.705977 2.684 0.00728 **
## ui 0.732822 0.460442 1.592 0.11148
## ftv1 -0.299057 0.463918 -0.645 0.51917
## ftv2+ 0.178370 0.450777 0.396 0.69233
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 200.61 on 178 degrees of freedom
## AIC: 222.61
##
## Number of Fisher Scoring iterations: 4
The function step()
also implements stepwise selection based on AIC for
generalised linear models. We will use it to select the best predictors using
backward stepwise
## Start: AIC=222.61
## low ~ age + lwt + race + smoke + ptl + ht + ui + ftv
##
## Df Deviance AIC
## - ftv 2 201.43 219.43
## - age 1 201.13 221.13
## <none> 200.61 222.61
## - ui 1 203.10 223.10
## - ptl 1 203.58 223.58
## - smoke 1 205.06 225.06
## - race 2 207.41 225.41
## - lwt 1 206.35 226.35
## - ht 1 208.22 228.22
##
## Step: AIC=219.43
## low ~ age + lwt + race + smoke + ptl + ht + ui
##
## Df Deviance AIC
## - age 1 201.99 217.99
## <none> 201.43 219.43
## - ptl 1 203.95 219.95
## - ui 1 204.11 220.11
## - race 2 208.77 222.77
## - lwt 1 206.81 222.81
## - smoke 1 206.92 222.92
## - ht 1 208.81 224.81
##
## Step: AIC=217.99
## low ~ lwt + race + smoke + ptl + ht + ui
##
## Df Deviance AIC
## <none> 201.99 217.99
## - ptl 1 204.22 218.22
## - ui 1 204.90 218.90
## - smoke 1 207.73 221.73
## - lwt 1 208.11 222.11
## - race 2 210.31 222.31
## - ht 1 209.46 223.46
##
## Call: glm(formula = low ~ lwt + race + smoke + ptl + ht + ui, family = binomial,
## data = lowbwt)
##
## Coefficients:
## (Intercept) lwt raceBlack raceOther smoke ptl
## -0.08655 -0.01591 1.32572 0.89708 0.93873 0.50321
## ht ui
## 1.85504 0.78570
##
## Degrees of Freedom: 188 Total (i.e. Null); 181 Residual
## Null Deviance: 234.7
## Residual Deviance: 202 AIC: 218
We can see, step by step, the variables removed. First ftv is removed, followed by age, up to the final model that includes the variables above.
If you get an error because there are missing values in dataset and the
step()
function does not know how to handle them, try to restrict the dataset to
complete cases only as below.
lowbwt.complete <- lowbwt[complete.cases(lowbwt), ]
lowbwt.logit <- glm(low ~ age +
lwt + race +
smoke + ptl +
ht + ui + ftv,
family=binomial, data=lowbwt.complete)
step(lowbwt.logit)
We can see, step by step, the variables removed. First ftv is removed, followed by age, up to the final model that includes the variables above.
2.4 Exercises
With the fat dataset (Task 1), use the
step()
function to implement backward selection, to select the predictors for body fat (variable brozek) using all the other variables available, except for siri, density and freeThe dataset
prostate
available in the packageprostate
contains information on 97 men who were about to receive a radical prostatectomy. These data come from a study examining the correlation between the prostate specific antigen (lpsa) and a number of other clinical measures.Use backward stepwise to select a subset of predictors of lpsa, using all other the variables in the model. You can use the \(adjusted \, r^2\) to select the best model. You can compare the model that you have obtained with the model using best subset selection (section 1.3)