Practical Class Problem Sheets

Practical 1 - Subset Selection

In this first practical we will investigate the dataset fat from the library faraway.

With this data we want to predict body fat (variable brozek) using the other variables available, except for siri (another way of computing body fat), density (it is used in the brozek and siri formulae) and free (it is computed using the brozek formula).

Q1 Data Installation and Initial Manipulation

If you have not yet done so, install library faraway.
Load library faraway and read the help file on the dataset fat.
Look at the top six rows of the dataset fat. Use relevant commands to find the dimension of the dataframe and the names of the variables.
Create a new dataframe called fat1 which is equivalent to fat, but with the variables siri, density and free removed. Check that the dimension of this new dataframe is correct (that is, the number of rows should be the same as that of fat in part (c) and the number of columns should be three less). It is important to get this part correct, as we will use this new dataframe fat1 throughout the remainder of this practical session (to check you got this part right, have a quick look at the opening paragraph to Practical 2).
Check whether there is any missing data in the (new) dataframe.

Q2 Exploratory Data Analysis

Calculate the correlations of brozek with respect to the 14 predictor variables.
Which three predictors have the highest correlations with the response brozek? Calculate the pair-wise correlation between each of these three predictors. Are they also highly correlated with each other?
Using the command pairs(), produce a 4 by 4 frame of scatter plots of the response variable and the three predictors.
What does this plot tell us about potential correlations among the three predictors? Does it back up the correlation values obtained in parts (a) and (b)?

Q3 Forward Stepwise Regression

Use regsubsets() to perform forward stepwise selection, assigning the result to a variable called fwd. Note that you will need to (install and) load library leaps.
Look at the summary of the resulting object fwd and extract the following statistics for the model of each size: RSS, $R^2$ , $C_p$ , BIC and adjusted $R^2$ . Combine them into a single matrix.
How many predictors are included in the models with i) minimum $C_p$ , ii) minimum BIC, and iii) maximum adjusted $R^2$ ? Which predictors are included in each case?
In a single plotting window, generate the following three plots, showing the relationship between the two variables as lines (as seen in the practical demonstration 2.4): i) $C_p$ against number of predictors, ii) BIC against number of predictors, and iii) adjusted $R^2$ against number of predictors. Make sure to add relevant labels to the plots. Highlight the optimal model appropriately on each plot using a big red point.
Use the special plot() function for regsubsets to visualise the models obtained by $C_p$ .

Q4 Best Subset and Backward Selection

Do the best models (as determined by $C_p$ , BIC and adjusted- $R^2$ ) from best subset and backward stepwise selections have the same number of predictors? If yes, are the predictors the same as those from forward selection? Hint: if it is not obvious how to perform backward selection type ?regsubsets and see information about argument method.

Practical 2 - Subset Selection

Make sure to finish Practical 1 before continuing with this practical! It might be tempting to skip some stuff if you’re behind, but you may end up more lost $...$ we will happily answer questions from any previous practical each session.

In this practical, we continue our analysis of the fat data in library faraway introduced in Practical 1. Before starting, run the following commands to obtain our reduced dataset of interest, fat1, as you should have defined in Question 1(d) of Practical 1. We also load library leaps for function regsubsets later.

library("leaps")
library("faraway")
fat1 <- fat[,-c(2, 3, 8)]

Q1 Predictive Performance

Based on forward selection, evaluate the predictive accuracy (using validation mean squared errors) of models with numbers of predictors ranging from 1 to 14 based on a training and test sample. Use approximately 75% of the samples for training. You can use the predict.regsubsets function defined in practical demonstration Section 2.4.4. To make the results of your answer to this part reproducible (and in line with the published solutions), include the line set.seed(1) at the top of your code to set the random seed number equal to 1.
How many predictors are included in the model which is optimal based on the validation mean squared errors obtained for the particular splitting of training and test data used in part (a)? Which predictors are they?

Whilst we set the seed number equal to 1 in part (a), you should also play around with changing the seed number and investigating whether the sampling randomness of training and test data affects your results. This is the focus of parts (c) and (d).

Change the seed number in part (a) to 67 and repeat the analysis. Does this affect the results?
Finally, we investigate the spread in the number of predictors chosen for many different splits of training and test data, such as was carried out for the Hitters data in Section 2.4.6. Loop over the forward stepwise selection analysis of part (a) for seed numbers 1 to 100, storing the number of predictors that are chosen in each case in a vector. Display the results in a histogram.

Q2 Cross-Validation

Using a similar approach to that discussed in Section 2.4.7, compare the 5-fold cross-validated predictive accuracy of the two models with i) minimum $C_p$ , ii) minimum BIC, using the whole dataset under forwards stepwise selection (in other words, those found in Practical 1, Q3c). Which model does this analysis suggest that you use? For reproducibility purposes, and to make your answer in line with the published solutions, it is recommended to include the line set.seed(1) before generating the random folds.

Practical 3 - Ridge Regression

Make sure to at least finish Practical 1 before continuing with this practical! It might be tempting to skip some stuff if you’re behind, but you may end up more lost $...$ we will happily answer questions from any previous practical each session.

We continue analysing the fat data in package faraway in this practical. Run the following commands to load the package and obtain our reduced dataset fat1 of interest. We also load the glmnet package for performing ridge regression.

library(faraway)
fat1 <- fat[,-c(2, 3, 8)]
library(glmnet)

Q1 Ridge Regression

Define a vector y containing all the values of the response variable brozek.
Stack the predictors column-wise into a matrix, and define this matrix as x.
Fit a model of the predictors for the response using ridge regression over a grid of 200 $\lambda$ values. Hint: type ?glmnet and look at argument nlambda.
Plot the regularisation paths of the ridge regresssion coefficients as a function of log $\lambda$ .

Q2 Cross-Validation

Using cv.glmnet with 5-fold CV and random seed equal to 2, find the values of min-CV $\lambda$ and 1-se $\lambda$ .
Plot the CV $\lambda$ graph next to the regularisation-path plot indicating with vertical lines the two $\lambda$ values.
Which predictor has the highest absolute coefficient?
Type plot(ridge.cv) into the R console. What is plotted?

Q3 Comparison of Predictive Performance

Using 50 repetitions with random seed set to 2, compare whether the model with min-CV $\lambda$ or 1-se $\lambda$ may be preferable in terms of predictive performance. In this case use 5-fold CV within cv.glmnet() and also a 50% splitting for training and testing. Produce boxplots of the correlation values obtained from each cycle of your for loop. Finally, calculate the mean correlations from the two models. Under which approach does ridge seem to perform better?

Practical 4 - Lasso and PCR Regression

Try to at least finish Question 1 from Practical 3 before continuing with this practical! It might be tempting to skip some stuff if you’re behind, but you may end up more lost $...$ we will happily answer questions from any previous practical each session.

In this practical, we will start by considering the seatpos dataset in library faraway.

Q1 Data Analysis

This question focuses on generally improving your data analytical skills, so just go with it! A subsequent lasso regression analysis follows in Question 2.

Load library faraway and check the help file for dataset seatpos.
Looking at the help file, which variable might the researchers at the HuMoSim laboratory consider to be a good response variable?
Using the help file only for the moment (that is, without performing any analysis on the dataset), are there any predictor variables that we think may be likely to be highly correlated?
What is the dimension of the dataset? Check whether there is any missing data.
Use exploratory data analysis commands and plots to investigate the correlation of hipcenter with the other (possible predictor) variables. With which predictor variables is hipcenter most correlated? Are those predictor variables correlated with each other? Does this exploratory analysis support your initial thoughts from part (c)? What do your findings here suggest about the fixed location referred to in the help file for description of hipcenter (is it in front or behind the driver)?

Q2 Lasso Regression

Load library glmnet. Fit a model of all of the remaining variables for hipcenter using lasso regression. Plot regularisation paths against log $\lambda$ and $\ell_1$ norm values. Investigate which variables are included under various values of $\lambda$ . Do you notice anything interesting about the trends for the predictors Ht and HtShoes? Do the findings here support exploratory analysis of Question 1?

Q3 Principal Component Regression

Load library pls (required for PCR).
Use the command pcr to fit a principal component regression model to hipcenter based on the remaining variables. Select the number of principal components using cross-validation. Remember to standardise the variables. How many principal components are recommended, according to the cross-validation error?
Approriately plot the mean squared validation error.
Use the command coef() to find corresponding coefficient estimates in terms of the original $\hat{\beta}$ ’s. How do these estimates coincide with the exploratory analysis of Question 1?

Q4 Predictive Accuracy of Methods

Use 50 repetitions of data-splitting with 28 samples for training (10 for testing) to compare the predictive performance of the following models of the remaining predictors for hipcenter:

Best subset selection with $C_p$
Ridge with min-CV $\lambda$ (5 folds)
Lasso with min-CV $\lambda$ (5 folds)
PCR with min-CV number of principal components.

Practical 5 - Polynomial and Step Function Regression

In this practical we consider two datasets for analysis by polynomial and step-wise function regression. Firstly the seatpos dataset analysed in Practical 4, followed by the Boston data analysed in practical demonstration Section 7.4.

Q1 `seatpos` Data

Load libraries MASS and faraway.
We will analyse the effects of predictor variable Ht on the response variable hipcenter. Assign the hipcenter values to a vector y, and the Ht variable values to a vector x.
Plot variable Ht and hipcenter against each other. Remember to include suitable axis labels. From visual inspection of this plot, what sort of polynomial might be appropriate for this data?
Fit a first and second order polynomial to the data using the commands lm and poly. Look at the corresponding summary objects. Do these back up your answer to part (c)?
Plot the first and second polynomial fits to the data, along with $\pm2$ standard deviation confidence intervals, similar to those generated in practical demonstration Section 7.4. What do you notice about the degree-2 polynomial plot?
Perform an analysis of variance to confirm whether higher order degrees of polynomial are useful for modelling hipcenter based on Ht.
Use step function regression with 5 cut-points to model hipcenter based on Ht. Plot the results.

Q2 `Boston` Data

The relationship between predictor indus and response medv of the Boston data also looks interesting. Predictor indus is the proportion of non-retail business acres per town. First plot the variables to see how the relationship between indus and medv looks like. Then analyse the data. Feel free to select between polynomials and step functions (of course you can experiment with both if you have time). If you use step functions you can also choose to define the intervals manually if you so wish. Use ANOVA to decide which model fits the data best.

Practical 6 - Splines and GAMs

Make sure to at least finish Question 1 from Practical 5 before continuing with this practical! It might be tempting to skip some stuff if you’re behind, but you may end up more lost $...$ we will happily answer questions from any previous practical each session.

In Questions 1 and 2, we again analyse the seatpos data from library faraway.

Q1 Splines

In this question, we will again analyse the effects of predictor variable Ht on the response variable hipcenter.

Load the necessary packages for the dataset and spline functions. Assign the hipcenter values to a vector y, and the Ht variable values to a vector x.
Find the 25th, 50th and 75th percentiles of x, storing them in a vector cuts.
Use a linear spline to model hipcenter as a function of Ht, putting knots at the 25th, 50th and 75th percentiles of x.
Plot the fitted linear spline from part (c) over the data, along with $\pm2$ standard deviation confidence intervals.
Use a smoothing spline to model hipcenter as a function of Ht, selecting $\lambda$ with cross-validation, and generate a relevant plot. What do you notice?

Q2 GAMs

Fit a GAM for hipcenter that consists of three terms:

a natural spline with 5 degrees of freedom for Age,
a smoothing spline with 3 degrees of freedom for Thigh, and
a simple linear model term for Ht.

Plot the resulting contributions of each term to the GAM, and compare them with plots of hipcenter against each of the three variables Age, Thigh and Ht.
Does the contribution of each term of the GAM make sense in light of these pair-wise plots. Is the GAM fitting the data well?

Q3 `Boston` Data

Investigate modelling medv using regression, natural and smoothing splines, taking indus as a single predictor.
Choose a set of predictor variables to use in order to fit a GAM for medv. Feel free to experiment with different modelling options.

Practical Class Problem Sheets

Practical 1 - Subset Selection

Q1 Data Installation and Initial Manipulation

Q2 Exploratory Data Analysis

Q3 Forward Stepwise Regression

Q4 Best Subset and Backward Selection

Practical 2 - Subset Selection

Q1 Predictive Performance

Q2 Cross-Validation

Practical 3 - Ridge Regression

Q1 Ridge Regression

Q2 Cross-Validation

Q3 Comparison of Predictive Performance

Practical 4 - Lasso and PCR Regression

Q1 Data Analysis

Q2 Lasso Regression

Q3 Principal Component Regression

Q4 Predictive Accuracy of Methods

Practical 5 - Polynomial and Step Function Regression

Q1 seatpos Data

Q2 Boston Data

Practical 6 - Splines and GAMs

Q1 Splines

Q2 GAMs

Q3 Boston Data

Q1 `seatpos` Data

Q2 `Boston` Data

Q3 `Boston` Data