Practical 1 Solutions
Note that the questions in all these practicals have multiple possible solutions. Here you will find one that is in line with the course content.
In this first practical we will investigate the dataset fat
from the library faraway
.
With this data we want to predict body fat (variable brozek
) using the other variables available, except for siri
(another way of computing body fat), density
(it is used in the brozek
and siri
formulae) and free
(it is computed using the brozek formula).
Q1 Data Installation and Initial Manipulation
- If you have not yet done so, install library
faraway
.
install.packages("faraway")
- Load library
faraway
and read the help file on the datasetfat
.
library("faraway")
# read the help file for more information about the variables ?fat
- Look at the top six rows of the dataset
fat
. Use relevant commands to find the dimension of the dataframe and the names of the variables.
head(fat) # to look at the top six rows only.
## brozek siri density age weight height adipos free neck chest abdom hip thigh knee
## 1 12.6 12.3 1.0708 23 154.25 67.75 23.7 134.9 36.2 93.1 85.2 94.5 59.0 37.3
## 2 6.9 6.1 1.0853 22 173.25 72.25 23.4 161.3 38.5 93.6 83.0 98.7 58.7 37.3
## 3 24.6 25.3 1.0414 22 154.00 66.25 24.7 116.0 34.0 95.8 87.9 99.2 59.6 38.9
## 4 10.9 10.4 1.0751 26 184.75 72.25 24.9 164.7 37.4 101.8 86.4 101.2 60.1 37.3
## 5 27.8 28.7 1.0340 24 184.25 71.25 25.6 133.1 34.4 97.3 100.0 101.9 63.2 42.2
## 6 20.6 20.9 1.0502 24 210.25 74.75 26.5 167.0 39.0 104.5 94.4 107.8 66.0 42.0
## ankle biceps forearm wrist
## 1 21.9 32.0 27.4 17.1
## 2 23.4 30.5 28.9 18.2
## 3 24.0 28.8 25.2 16.6
## 4 22.8 32.4 29.4 18.2
## 5 24.0 32.2 27.7 17.7
## 6 25.6 35.7 30.6 18.8
dim(fat) # see what the dimension of the dataframe is.
## [1] 252 18
# 252 rows and 18 columns.
names(fat) # the variable names.
## [1] "brozek" "siri" "density" "age" "weight" "height" "adipos" "free"
## [9] "neck" "chest" "abdom" "hip" "thigh" "knee" "ankle" "biceps"
## [17] "forearm" "wrist"
- Create a new dataframe called
fat1
which is equivalent tofat
, but with the variablessiri
,density
andfree
removed. Check that the dimension of this new dataframe is correct (that is, the number of rows should be the same as that offat
in part (c) and the number of columns should be three less). It is important to get this part correct, as we will use this new dataframefat1
throughout the remainder of this practical session.
# Remove variables 2, 3 and 8.
<- fat[,-c(2,3,8)] # siri is variable 2, density variable 3 and
fat1 # free variable 8.
dim(fat1) # Check the dimension of the new dataframe.
## [1] 252 15
# 252 rows and 15 columns.
- Check whether there is any missing data in the (new) dataframe.
sum(is.na(fat1)) # We can just check is.na for the whole
## [1] 0
# dataframe to start with.
# Since the result is zero, there is no missing data.
Q2 Exploratory Data Analysis
- Calculate the correlations of
brozek
with respect to the 14 predictor variables.
cor(fat1[,1], fat1[,-1])
## age weight height adipos neck chest abdom hip
## [1,] 0.2891735 0.6131561 -0.08910641 0.7279942 0.4914889 0.7028852 0.8137062 0.6256999
## thigh knee ankle biceps forearm wrist
## [1,] 0.5612844 0.5077859 0.2667826 0.4930309 0.3632774 0.3475728
- Which three predictors have the highest correlations with the response
brozek
? Calculate the pair-wise correlation between each of these three predictors. Are they also highly correlated with each other?
# adipos, chest and abdom are the most correlated variables.
# We see that these three predictors exhibit strong correlations
# with the response. We can check the pair-wise correlations
# between these three variables as well.
cor(fat$adipos, fat$chest)
## [1] 0.9117986
cor(fat$adipos, fat$abdom)
## [1] 0.9238801
cor(fat$chest, fat$abdom)
## [1] 0.9158277
# These three predictors are all highly correlated with each other as well.
- Using the command
pairs()
, produce a 4 by 4 frame of scatter plots of the response variable and the three predictors.
pairs( fat[, c("brozek", "adipos", "chest", "abdom")] )
- What does this plot tell us about potential correlations among the three predictors? Does it back up the correlation values obtained in parts (a) and (b)?
# The strong pair-wise positive correlations between the predictors,
# and also between each predictor and the response, is clearly seen
# in the scatter plots.
Q3 Forward Stepwise Regression
- Use
regsubsets()
to perform forward stepwise selection, assigning the result to a variable calledfwd
. Note that you will need to (install and) load libraryleaps
.
library("leaps") # required library for regsubsets() function.
= regsubsets(brozek~., data = fat1, method = 'forward', nvmax = 14) fwd
- Look at the summary of the resulting object
fwd
and extract the following statistics for the model of each size: RSS, \(R^2\), \(C_p\), BIC and adjusted \(R^2\). Combine them into a single matrix.
= summary(fwd)
results results
## Subset selection object
## Call: regsubsets.formula(brozek ~ ., data = fat1, method = "forward",
## nvmax = 14)
## 14 Variables (and intercept)
## Forced in Forced out
## age FALSE FALSE
## weight FALSE FALSE
## height FALSE FALSE
## adipos FALSE FALSE
## neck FALSE FALSE
## chest FALSE FALSE
## abdom FALSE FALSE
## hip FALSE FALSE
## thigh FALSE FALSE
## knee FALSE FALSE
## ankle FALSE FALSE
## biceps FALSE FALSE
## forearm FALSE FALSE
## wrist FALSE FALSE
## 1 subsets of each size up to 14
## Selection Algorithm: forward
## age weight height adipos neck chest abdom hip thigh knee ankle biceps forearm
## 1 ( 1 ) " " " " " " " " " " " " "*" " " " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " "*" " " " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " "*" " " " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " "*" " " " " " " " " " " "*"
## 5 ( 1 ) " " "*" " " " " "*" " " "*" " " " " " " " " " " "*"
## 6 ( 1 ) "*" "*" " " " " "*" " " "*" " " " " " " " " " " "*"
## 7 ( 1 ) "*" "*" " " " " "*" " " "*" " " "*" " " " " " " "*"
## 8 ( 1 ) "*" "*" " " " " "*" " " "*" "*" "*" " " " " " " "*"
## 9 ( 1 ) "*" "*" " " " " "*" " " "*" "*" "*" " " " " "*" "*"
## 10 ( 1 ) "*" "*" " " " " "*" " " "*" "*" "*" " " "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" " " "*" " " "*" "*" "*" " " "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*" "*" " " "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## wrist
## 1 ( 1 ) " "
## 2 ( 1 ) " "
## 3 ( 1 ) "*"
## 4 ( 1 ) "*"
## 5 ( 1 ) "*"
## 6 ( 1 ) "*"
## 7 ( 1 ) "*"
## 8 ( 1 ) "*"
## 9 ( 1 ) "*"
## 10 ( 1 ) "*"
## 11 ( 1 ) "*"
## 12 ( 1 ) "*"
## 13 ( 1 ) "*"
## 14 ( 1 ) "*"
= results$rss
RSS = results$rsq
r2 = results$cp
Cp = results$bic
BIC = results$adjr2
Adj_r2
# Combine the calculated criteria values above into a single matrix.
<- cbind(RSS, r2, Cp, BIC, Adj_r2)
criteria_values criteria_values
## RSS r2 Cp BIC Adj_r2
## [1,] 5094.931 0.6621178 71.075501 -262.3758 0.6607663
## [2,] 4241.328 0.7187265 19.617727 -303.0555 0.7164672
## [3,] 4108.183 0.7275563 13.279341 -305.5638 0.7242606
## [4,] 3994.311 0.7351080 8.147986 -307.1180 0.7308182
## [5,] 3950.628 0.7380049 7.412297 -304.3597 0.7326798
## [6,] 3913.377 0.7404753 7.079431 -301.2177 0.7341196
## [7,] 3853.214 0.7444651 5.311670 -299.5925 0.7371342
## [8,] 3819.985 0.7466688 5.230664 -296.2456 0.7383287
## [9,] 3805.075 0.7476576 6.296894 -291.7018 0.7382730
## [10,] 3793.873 0.7484005 7.595324 -286.9153 0.7379607
## [11,] 3786.200 0.7489093 9.114827 -281.8960 0.7374010
## [12,] 3785.179 0.7489771 11.050872 -276.4346 0.7363734
## [13,] 3784.367 0.7490309 13.000019 -270.9592 0.7353225
## [14,] 3784.367 0.7490309 15.000000 -265.4298 0.7342058
- How many predictors are included in the models with i) minimum \(C_p\), ii) minimum BIC, and iii) maximum adjusted \(R^2\)? Which predictors are included in each case?
which.min(Cp) # i) 8
## [1] 8
which.min(BIC) # ii) 4
## [1] 4
which.max(Adj_r2) # iii) 8
## [1] 8
coef(fwd, 4)
## (Intercept) weight abdom forearm wrist
## -31.2967858 -0.1255654 0.9213725 0.4463824 -1.3917662
# Looking at the summary above, the model with 4 predictors (minimum BIC)
# includes the predictors weight, abdom, forearm and wrist.
coef(fwd, 8)
## (Intercept) age weight neck abdom hip
## -20.06213373 0.05921577 -0.08413521 -0.43189267 0.87720667 -0.18641032
## thigh forearm wrist
## 0.28644340 0.48254563 -1.40486912
# The model with 8 predictors (minimum Cp and maximum adjusted R-squared)
# includes additionally the predictors age, neck, hip and thigh.
- In a single plotting window, generate the following three plots, showing the relationship between the two variables as lines (as seen in the practical demonstration 2.4): i) \(C_p\) against number of predictors, ii) BIC against number of predictors, and iii) adjusted \(R^2\) against number of predictors. Make sure to add relevant labels to the plots. Highlight the optimal model appropriately on each plot using a big red point.
par(mfrow = c(1, 3))
plot(Cp, xlab = "Number of Predictors", ylab = "Cp", type = 'l', lwd = 2)
points(8, Cp[8], col = "red", cex = 2, pch = 8, lwd = 2)
plot(BIC, xlab = "Number of Predictors", ylab = "BIC", type = 'l', lwd = 2)
points(4, BIC[4], col = "red", cex = 2, pch = 8, lwd = 2)
plot(Adj_r2, xlab = "Number of Predictors", ylab = "Adjusted RSq",
type = "l", lwd = 2)
points(8, Adj_r2[8], col = "red", cex = 2, pch = 8, lwd = 2)
- Use the special
plot()
function forregsubsets
to visualise the models obtained by \(C_p\).
plot(fwd, scale = "Cp")
Q4 Best Subset and Backward Selection
Do the best models (as determined by \(C_p\), BIC and adjusted-\(R^2\)) from best subset and backward stepwise selections have the same number of predictors? If yes, are the predictors the same as those from forward selection? Hint: if it is not obvious how to perform backward selection type ?regsubsets
and see information about argument method
.
= regsubsets(brozek~., data = fat1, nvmax = 14)
best = regsubsets(brozek~., data = fat1, method = 'backward', nvmax = 14)
bwd
which.min(summary(best)$cp)
## [1] 8
which.min(summary(best)$bic)
## [1] 4
which.max(summary(best)$adjr2)
## [1] 8
which.min(summary(bwd)$cp)
## [1] 8
which.min(summary(bwd)$bic)
## [1] 4
which.max(summary(bwd)$adjr2)
## [1] 8
# Yes, the three optimal models (under each of the criteria Cp, BIC and
# adj-R-squared) for each of forward stepwise, backward stepwise and
# best subset selections all have the same number of predictors.
# Cp
coef(fwd,8)
## (Intercept) age weight neck abdom hip
## -20.06213373 0.05921577 -0.08413521 -0.43189267 0.87720667 -0.18641032
## thigh forearm wrist
## 0.28644340 0.48254563 -1.40486912
coef(best,8)
## (Intercept) age weight neck abdom hip
## -20.06213373 0.05921577 -0.08413521 -0.43189267 0.87720667 -0.18641032
## thigh forearm wrist
## 0.28644340 0.48254563 -1.40486912
coef(bwd,8)
## (Intercept) age weight neck abdom hip
## -20.06213373 0.05921577 -0.08413521 -0.43189267 0.87720667 -0.18641032
## thigh forearm wrist
## 0.28644340 0.48254563 -1.40486912
# BIC
coef(fwd,4)
## (Intercept) weight abdom forearm wrist
## -31.2967858 -0.1255654 0.9213725 0.4463824 -1.3917662
coef(best,4)
## (Intercept) weight abdom forearm wrist
## -31.2967858 -0.1255654 0.9213725 0.4463824 -1.3917662
coef(bwd,4)
## (Intercept) weight abdom forearm wrist
## -31.2967858 -0.1255654 0.9213725 0.4463824 -1.3917662
# In addition, the predictors are also the same.
# Note that comparison of the models under R-squared is as for
# Cp as this criterion also suggested the model with 8 predictors.