Practical Class Problem Sheets
Practical 1 - Subset Selection
In this first practical we will investigate the dataset fat from the library faraway.
With this data we want to predict body fat (variable brozek) using the other variables available, except for siri (another way of computing body fat), density (it is used in the brozek and siri formulae) and free (it is computed using the brozek formula).
Q1 Data Installation and Initial Manipulation
If you have not yet done so, install library
faraway.Load library
farawayand read the help file on the datasetfat.Look at the top six rows of the dataset
fat. Use relevant commands to find the dimension of the dataframe and the names of the variables.Create a new dataframe called
fat1which is equivalent tofat, but with the variablessiri,densityandfreeremoved. Check that the dimension of this new dataframe is correct (that is, the number of rows should be the same as that offatin part (c) and the number of columns should be three less). It is important to get this part correct, as we will use this new dataframefat1throughout the remainder of this practical session (to check you got this part right, have a quick look at the opening paragraph to Practical 2).Check whether there is any missing data in the (new) dataframe.
Q2 Exploratory Data Analysis
Calculate the correlations of
brozekwith respect to the 14 predictor variables.Which three predictors have the highest correlations with the response
brozek? Calculate the pair-wise correlation between each of these three predictors. Are they also highly correlated with each other?Using the command
pairs(), produce a 4 by 4 frame of scatter plots of the response variable and the three predictors.What does this plot tell us about potential correlations among the three predictors? Does it back up the correlation values obtained in parts (a) and (b)?
Q3 Forward Stepwise Regression
Use
regsubsets()to perform forward stepwise selection, assigning the result to a variable calledfwd. Note that you will need to (install and) load libraryleaps.Look at the summary of the resulting object
fwdand extract the following statistics for the model of each size: RSS, \(R^2\), \(C_p\), BIC and adjusted \(R^2\). Combine them into a single matrix.How many predictors are included in the models with i) minimum \(C_p\), ii) minimum BIC, and iii) maximum adjusted \(R^2\)? Which predictors are included in each case?
In a single plotting window, generate the following three plots, showing the relationship between the two variables as lines (as seen in the practical demonstration 2.4): i) \(C_p\) against number of predictors, ii) BIC against number of predictors, and iii) adjusted \(R^2\) against number of predictors. Make sure to add relevant labels to the plots. Highlight the optimal model appropriately on each plot using a big red point.
Use the special
plot()function forregsubsetsto visualise the models obtained by \(C_p\).
Q4 Best Subset and Backward Selection
Do the best models (as determined by \(C_p\), BIC and adjusted-\(R^2\)) from best subset and backward stepwise selections have the same number of predictors? If yes, are the predictors the same as those from forward selection? Hint: if it is not obvious how to perform backward selection type ?regsubsets and see information about argument method.
Practical 2 - Subset Selection
Make sure to finish Practical 1 before continuing with this practical! It might be tempting to skip some stuff if you’re behind, but you may end up more lost\(...\)we will happily answer questions from any previous practical each session.
In this practical, we continue our analysis of the fat data in library faraway introduced in Practical 1. Before starting, run the following commands to obtain our reduced dataset of interest, fat1, as you should have defined in Question 1(d) of Practical 1. We also load library leaps for function regsubsets later.
library("leaps")
library("faraway")
fat1 <- fat[,-c(2, 3, 8)]Q1 Predictive Performance
Based on forward selection, evaluate the predictive accuracy (using validation mean squared errors) of models with numbers of predictors ranging from 1 to 14 based on a training and test sample. Use approximately 75% of the samples for training. You can use the
predict.regsubsetsfunction defined in practical demonstration Section 2.4.4. To make the results of your answer to this part reproducible (and in line with the published solutions), include the lineset.seed(1)at the top of your code to set the random seed number equal to 1.How many predictors are included in the model which is optimal based on the validation mean squared errors obtained for the particular splitting of training and test data used in part (a)? Which predictors are they?
Whilst we set the seed number equal to 1 in part (a), you should also play around with changing the seed number and investigating whether the sampling randomness of training and test data affects your results. This is the focus of parts (c) and (d).
Change the seed number in part (a) to 67 and repeat the analysis. Does this affect the results?
Finally, we investigate the spread in the number of predictors chosen for many different splits of training and test data, such as was carried out for the
Hittersdata in Section 2.4.6. Loop over the forward stepwise selection analysis of part (a) for seed numbers 1 to 100, storing the number of predictors that are chosen in each case in a vector. Display the results in a histogram.
Q2 Cross-Validation
Using a similar approach to that discussed in Section 2.4.7, compare the 5-fold cross-validated predictive accuracy of the two models with i) minimum \(C_p\), ii) minimum BIC, using the whole dataset under forwards stepwise selection (in other words, those found in Practical 1, Q3c). Which model does this analysis suggest that you use? For reproducibility purposes, and to make your answer in line with the published solutions, it is recommended to include the line set.seed(1) before generating the random folds.
Practical 3 - Ridge Regression
Make sure to at least finish Practical 1 before continuing with this practical! It might be tempting to skip some stuff if you’re behind, but you may end up more lost\(...\)we will happily answer questions from any previous practical each session.
We continue analysing the fat data in package faraway in this practical. Run the following commands to load the package and obtain our reduced dataset fat1 of interest. We also load the glmnet package for performing ridge regression.
library(faraway)
fat1 <- fat[,-c(2, 3, 8)]
library(glmnet)Q1 Ridge Regression
Define a vector
ycontaining all the values of the response variablebrozek.Stack the predictors column-wise into a matrix, and define this matrix as
x.Fit a model of the predictors for the response using ridge regression over a grid of 200 \(\lambda\) values. Hint: type
?glmnetand look at argumentnlambda.Plot the regularisation paths of the ridge regresssion coefficients as a function of log \(\lambda\).
Q2 Cross-Validation
Using
cv.glmnetwith 5-fold CV and random seed equal to 2, find the values of min-CV \(\lambda\) and 1-se \(\lambda\).Plot the CV \(\lambda\) graph next to the regularisation-path plot indicating with vertical lines the two \(\lambda\) values.
Which predictor has the highest absolute coefficient?
Type
plot(ridge.cv)into the R console. What is plotted?
Q3 Comparison of Predictive Performance
Using 50 repetitions with random seed set to 2, compare whether the model with min-CV \(\lambda\) or 1-se \(\lambda\) may be preferable in terms of predictive performance. In this case use 5-fold CV within cv.glmnet() and also a 50% splitting for training and testing. Produce boxplots of the correlation values obtained from each cycle of your for loop. Finally, calculate the mean correlations from the two models. Under which approach does ridge seem to perform better?
Practical 4 - Lasso and PCR Regression
Try to at least finish Question 1 from Practical 3 before continuing with this practical! It might be tempting to skip some stuff if you’re behind, but you may end up more lost\(...\)we will happily answer questions from any previous practical each session.
In this practical, we will start by considering the seatpos dataset in library faraway.
Q1 Data Analysis
This question focuses on generally improving your data analytical skills, so just go with it! A subsequent lasso regression analysis follows in Question 2.
Load library
farawayand check the help file for datasetseatpos.Looking at the help file, which variable might the researchers at the HuMoSim laboratory consider to be a good response variable?
Using the help file only for the moment (that is, without performing any analysis on the dataset), are there any predictor variables that we think may be likely to be highly correlated?
What is the dimension of the dataset? Check whether there is any missing data.
Use exploratory data analysis commands and plots to investigate the correlation of
hipcenterwith the other (possible predictor) variables. With which predictor variables ishipcentermost correlated? Are those predictor variables correlated with each other? Does this exploratory analysis support your initial thoughts from part (c)? What do your findings here suggest about the fixed location referred to in the help file for description ofhipcenter(is it in front or behind the driver)?
Q2 Lasso Regression
Load library glmnet. Fit a model of all of the remaining variables for hipcenter using lasso regression. Plot regularisation paths against log \(\lambda\) and \(\ell_1\) norm values. Investigate which variables are included under various values of \(\lambda\). Do you notice anything interesting about the trends for the predictors Ht and HtShoes? Do the findings here support exploratory analysis of Question 1?
Q3 Principal Component Regression
Load library
pls(required for PCR).Use the command
pcrto fit a principal component regression model tohipcenterbased on the remaining variables. Select the number of principal components using cross-validation. Remember to standardise the variables. How many principal components are recommended, according to the cross-validation error?Approriately plot the mean squared validation error.
Use the command
coef()to find corresponding coefficient estimates in terms of the original \(\hat{\beta}\)’s. How do these estimates coincide with the exploratory analysis of Question 1?
Q4 Predictive Accuracy of Methods
Use 50 repetitions of data-splitting with 28 samples for training (10 for testing) to compare the predictive performance of the following models of the remaining predictors for hipcenter:
- Best subset selection with \(C_p\)
- Ridge with min-CV \(\lambda\) (5 folds)
- Lasso with min-CV \(\lambda\) (5 folds)
- PCR with min-CV number of principal components.
Practical 5 - Polynomial and Step Function Regression
In this practical we consider two datasets for analysis by polynomial and step-wise function regression. Firstly the seatpos dataset analysed in Practical 4, followed by the Boston data analysed in practical demonstration Section 7.4.
Q1 seatpos Data
Load libraries
MASSandfaraway.We will analyse the effects of predictor variable
Hton the response variablehipcenter. Assign thehipcentervalues to a vectory, and theHtvariable values to a vectorx.Plot variable
Htandhipcenteragainst each other. Remember to include suitable axis labels. From visual inspection of this plot, what sort of polynomial might be appropriate for this data?Fit a first and second order polynomial to the data using the commands
lmandpoly. Look at the corresponding summary objects. Do these back up your answer to part (c)?Plot the first and second polynomial fits to the data, along with \(\pm2\) standard deviation confidence intervals, similar to those generated in practical demonstration Section 7.4. What do you notice about the degree-2 polynomial plot?
Perform an analysis of variance to confirm whether higher order degrees of polynomial are useful for modelling
hipcenterbased onHt.Use step function regression with 5 cut-points to model
hipcenterbased onHt. Plot the results.
Q2 Boston Data
The relationship between predictor indus and response medv of the Boston data also looks interesting. Predictor indus is the proportion of non-retail business acres per town. First plot the variables to see how the relationship between indus and medv looks like. Then analyse the data. Feel free to select between polynomials and step functions (of course you can experiment with both if you have time). If you use step functions you can also choose to define the intervals manually if you so wish. Use ANOVA to decide which model fits the data best.
Practical 6 - Splines and GAMs
Make sure to at least finish Question 1 from Practical 5 before continuing with this practical! It might be tempting to skip some stuff if you’re behind, but you may end up more lost\(...\)we will happily answer questions from any previous practical each session.
In Questions 1 and 2, we again analyse the seatpos data from library faraway.
Q1 Splines
In this question, we will again analyse the effects of predictor variable Ht on the response variable hipcenter.
Load the necessary packages for the dataset and spline functions. Assign the
hipcentervalues to a vectory, and theHtvariable values to a vectorx.Find the 25th, 50th and 75th percentiles of
x, storing them in a vectorcuts.Use a linear spline to model
hipcenteras a function ofHt, putting knots at the 25th, 50th and 75th percentiles ofx.Plot the fitted linear spline from part (c) over the data, along with \(\pm2\) standard deviation confidence intervals.
Use a smoothing spline to model
hipcenteras a function ofHt, selecting \(\lambda\) with cross-validation, and generate a relevant plot. What do you notice?
Q2 GAMs
- Fit a GAM for
hipcenterthat consists of three terms:
- a natural spline with 5 degrees of freedom for
Age, - a smoothing spline with 3 degrees of freedom for
Thigh, and - a simple linear model term for
Ht.
Plot the resulting contributions of each term to the GAM, and compare them with plots of
hipcenteragainst each of the three variablesAge,ThighandHt.Does the contribution of each term of the GAM make sense in light of these pair-wise plots. Is the GAM fitting the data well?