Chapter 2 Examples
2.1 Example 1 - Backward Elimination and Forward Selection
Create regression models for predicting wrestlers’ hydrostatic fat (hwfat
) using the data frame
HSWRESTLER
.
The dataset can be found in the R
package PASWR2
, which can be loaded in using
library(PASWR2)
- Perform quality checks on the data in
HSWRESTLER
as well as some basic exploratory analysis to see how the predictors (age
,ht
,wt
,abs
,triceps
, andsubscap
) are related tohwfat.
Consider using the pairs()
function to perform preliminary analysis of how the predictors relate to hwfat
, as well as the cor()
function.
Since we are only interested in the predictors listed in the question, we can remove the last two.
<- HSWRESTLER[,-c(8:9)] HSWRESTLER
pairs(HSWRESTLER)
The predictors that appear to best relate to hwfat
are wt
, abs
, triceps
and subscap
.
cor(HSWRESTLER)
## age ht wt abs triceps subscap
## age 1.00000000 0.4177471 0.2406721 -0.05970301 -0.1349368 -0.1077762
## ht 0.41774714 1.0000000 0.6044705 0.25882739 0.1578298 0.2525007
## wt 0.24067211 0.6044705 1.0000000 0.84808755 0.7526595 0.8149226
## abs -0.05970301 0.2588274 0.8480876 1.00000000 0.9057768 0.9254833
## triceps -0.13493683 0.1578298 0.7526595 0.90577680 1.0000000 0.9455530
## subscap -0.10777616 0.2525007 0.8149226 0.92548332 0.9455530 1.0000000
## hwfat -0.17053777 0.1399135 0.7334894 0.91813556 0.9166337 0.9059425
## hwfat
## age -0.1705378
## ht 0.1399135
## wt 0.7334894
## abs 0.9181356
## triceps 0.9166337
## subscap 0.9059425
## hwfat 1.0000000
The correlations back up the initial interpretation, with wt
having a slightly weaker linear relationship than the others.
The package car
has several useful functions, and the function scatterplotMatrix()
may be useful for exploring relationships among the predictors and hwfat.
There are a number of arguments/options for the function scatterplotMatrix()
, so it's advised to refer to its help page for further
details.
library(car)
scatterplotMatrix(x = HSWRESTLER)
#or
scatterplotMatrix(formula = ~ hwfat + age + ht + wt + abs + triceps +
+ subscap, data = HSWRESTLER)
The plots generated by scatterplotMatrix()
display with more clarity why wt
doesn't follow as linear a relationship as the other predictors mentioned above.
- Use backward elimination with the predictors
age
,ht
,wt
,abs
,triceps
, andsubscap
and an \(\alpha_{crit}\) of 0.20 to create a regression model wherehwfat
is the response variable.
Start by defining the full model, with all listed predictors. Then use the function drop1()
with a test="F"
argument to find the first term to remove based on highest (least significant) p-value.
<- lm(hwfat ~ ., data = HSWRESTLER)
model.all
drop1(model.all, test = "F") # single term deletions
## Single term deletions
##
## Model:
## hwfat ~ age + ht + wt + abs + triceps + subscap
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 651.05 179.51
## age 1 9.594 660.64 178.65 1.0463 0.309839
## ht 1 1.613 652.66 177.70 0.1759 0.676225
## wt 1 2.546 653.60 177.81 0.2777 0.599879
## abs 1 162.000 813.05 194.84 17.6669 7.549e-05 ***
## triceps 1 72.683 723.73 185.76 7.9264 0.006301 **
## subscap 1 5.921 656.97 178.21 0.6458 0.424315
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the largest p-value from drop1()
is 0.6762, which corresponds to the variable ht
, one needs to update the original model by dropping the variable ht
. The function update()
allows the user to update a previous model (model.all
) using the short-hand notation. ~ . -ht. The periods on the left and right side of the tilde tell R
to use what is specified in the left- and right-hand sides of the tilde in model.all
minus the variable ht.
<- update(model.all, . ~ . - ht) model.new
The first variable we remove is
The second variable to be removed is
The third variable dropped is
The fourth variable dropped is
- Use forward selection with the predictors
age
,ht
,wt
,abs
,triceps
, andsubscap
and an \(\alpha_{crit}\) of 0.20 to create a regression model wherehwfat
is the response variable.
The functions add1() and update() are used to create a model using forward selection.
<- (~ . + age + ht + wt + abs + triceps + subscap)
SCOPE
<- lm(hwfat ~ 1, data = HSWRESTLER) # create a new empty model for Forwards Selection
mod.fs
add1(mod.fs, scope = SCOPE, test = "F") # SCOPE defined above determines the variables to be added
## Single term additions
##
## Model:
## hwfat ~ 1
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 6017.8 340.97
## age 1 175.0 5842.8 340.67 2.2765 0.1355
## ht 1 117.8 5900.0 341.43 1.5175 0.2218
## wt 1 3237.6 2780.2 282.74 88.5045 2.219e-14 ***
## abs 1 5072.8 945.0 198.57 407.9929 < 2.2e-16 ***
## triceps 1 5056.3 961.5 199.92 399.6462 < 2.2e-16 ***
## subscap 1 4939.0 1078.8 208.90 347.9456 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
<- update(mod.fs, . ~ . + abs) mod.fs
We look to add the term which results in the lowest AIC for the model, since multiple variables are below the \(\alpha_{crit}\) of 0.20.
The first variable we add is
The second variable to be added is
The third variable added is
The third variable added is