# Practical Class Problem Sheets

## Practical 1 - Subset Selection

In this first practical we will investigate the dataset `fat`

from the library `faraway`

.

With this data we want to predict body fat (variable `brozek`

) using the other variables available, except for `siri`

(another way of computing body fat), `density`

(it is used in the `brozek`

and `siri`

formulae) and `free`

(it is computed using the brozek formula).

### Q1 Data Installation and Initial Manipulation

If you have not yet done so, install library

`faraway`

.Load library

`faraway`

and read the help file on the dataset`fat`

.Look at the top six rows of the dataset

`fat`

. Use relevant commands to find the dimension of the dataframe and the names of the variables.Create a new dataframe called

`fat1`

which is equivalent to`fat`

, but with the variables`siri`

,`density`

and`free`

removed. Check that the dimension of this new dataframe is correct (that is, the number of rows should be the same as that of`fat`

in part (c) and the number of columns should be three less). It is important to get this part correct, as we will use this new dataframe`fat1`

throughout the remainder of this practical session (to check you got this part right, have a quick look at the opening paragraph to Practical 2).Check whether there is any missing data in the (new) dataframe.

### Q2 Exploratory Data Analysis

Calculate the correlations of

`brozek`

with respect to the 14 predictor variables.Which three predictors have the highest correlations with the response

`brozek`

? Calculate the pair-wise correlation between each of these three predictors. Are they also highly correlated with each other?Using the command

`pairs()`

, produce a 4 by 4 frame of scatter plots of the response variable and the three predictors.What does this plot tell us about potential correlations among the three predictors? Does it back up the correlation values obtained in parts (a) and (b)?

### Q3 Forward Stepwise Regression

Use

`regsubsets()`

to perform forward stepwise selection, assigning the result to a variable called`fwd`

. Note that you will need to (install and) load library`leaps`

.Look at the summary of the resulting object

`fwd`

and extract the following statistics for the model of each size: RSS, \(R^2\), \(C_p\), BIC and adjusted \(R^2\). Combine them into a single matrix.How many predictors are included in the models with i) minimum \(C_p\), ii) minimum BIC, and iii) maximum adjusted \(R^2\)? Which predictors are included in each case?

In a single plotting window, generate the following three plots, showing the relationship between the two variables as lines (as seen in the practical demonstration 2.4): i) \(C_p\) against number of predictors, ii) BIC against number of predictors, and iii) adjusted \(R^2\) against number of predictors. Make sure to add relevant labels to the plots. Highlight the optimal model appropriately on each plot using a big red point.

Use the special

`plot()`

function for`regsubsets`

to visualise the models obtained by \(C_p\).

### Q4 Best Subset and Backward Selection

Do the best models (as determined by \(C_p\), BIC and adjusted-\(R^2\)) from best subset and backward stepwise selections have the same number of predictors? If yes, are the predictors the same as those from forward selection? *Hint:* if it is not obvious how to perform backward selection type `?regsubsets`

and see information about argument `method`

.

## Practical 2 - Subset Selection

Make sure to finish Practical 1 before continuing with this practical! It might be tempting to skip some stuff if you’re behind, but you may end up more lost\(...\)we will happily answer questions from any previous practical each session.

In this practical, we continue our analysis of the `fat`

data in library `faraway`

introduced in Practical 1. Before starting, run the following commands to obtain our reduced dataset of interest, `fat1`

, as you should have defined in Question 1(d) of Practical 1. We also load library `leaps`

for function `regsubsets`

later.

```
library("leaps")
library("faraway")
<- fat[,-c(2, 3, 8)] fat1
```

### Q1 Predictive Performance

Based on forward selection, evaluate the predictive accuracy (using validation mean squared errors) of models with numbers of predictors ranging from 1 to 14 based on a training and test sample. Use approximately 75% of the samples for training. You can use the

`predict.regsubsets`

function defined in practical demonstration Section 2.4.4. To make the results of your answer to this part reproducible (and in line with the published solutions), include the line`set.seed(1)`

at the top of your code to set the random seed number equal to 1.How many predictors are included in the model which is optimal based on the validation mean squared errors obtained for the particular splitting of training and test data used in part (a)? Which predictors are they?

Whilst we set the seed number equal to 1 in part (a), you should also play around with changing the seed number and investigating whether the sampling randomness of training and test data affects your results. This is the focus of parts (c) and (d).

Change the seed number in part (a) to 67 and repeat the analysis. Does this affect the results?

Finally, we investigate the spread in the number of predictors chosen for many different splits of training and test data, such as was carried out for the

`Hitters`

data in Section 2.4.6. Loop over the forward stepwise selection analysis of part (a) for seed numbers 1 to 100, storing the number of predictors that are chosen in each case in a vector. Display the results in a histogram.

### Q2 Cross-Validation

Using a similar approach to that discussed in Section 2.4.7, compare the 5-fold cross-validated predictive accuracy of the two models with i) minimum \(C_p\), ii) minimum BIC, using the whole dataset under forwards stepwise selection (in other words, those found in Practical 1, Q3c). Which model does this analysis suggest that you use? For reproducibility purposes, and to make your answer in line with the published solutions, it is recommended to include the line `set.seed(1)`

before generating the random folds.

## Practical 3 - Ridge Regression

Make sure to at least finish Practical 1 before continuing with this practical! It might be tempting to skip some stuff if you’re behind, but you may end up more lost\(...\)we will happily answer questions from any previous practical each session.

We continue analysing the `fat`

data in package `faraway`

in this practical. Run the following commands to load the package and obtain our reduced dataset `fat1`

of interest. We also load the `glmnet`

package for performing ridge regression.

```
library(faraway)
<- fat[,-c(2, 3, 8)]
fat1 library(glmnet)
```

### Q1 Ridge Regression

Define a vector

`y`

containing all the values of the response variable`brozek`

.Stack the predictors column-wise into a matrix, and define this matrix as

`x`

.Fit a model of the predictors for the response using ridge regression over a grid of 200 \(\lambda\) values.

*Hint:*type`?glmnet`

and look at argument`nlambda`

.Plot the regularisation paths of the ridge regresssion coefficients as a function of log \(\lambda\).

### Q2 Cross-Validation

Using

`cv.glmnet`

with 5-fold CV and random seed equal to 2, find the values of min-CV \(\lambda\) and 1-se \(\lambda\).Plot the CV \(\lambda\) graph next to the regularisation-path plot indicating with vertical lines the two \(\lambda\) values.

Which predictor has the highest absolute coefficient?

Type

`plot(ridge.cv)`

into the R console. What is plotted?

### Q3 Comparison of Predictive Performance

Using 50 repetitions with random seed set to 2, compare whether the model with min-CV \(\lambda\) or 1-se \(\lambda\) may be preferable in terms of predictive performance. In this case use 5-fold CV within `cv.glmnet()`

and also a 50% splitting for training and testing. Produce boxplots of the correlation values obtained from each cycle of your `for`

loop. Finally, calculate the mean correlations from the two models. Under which approach does ridge seem to perform better?

## Practical 4 - Lasso and PCR Regression

Try to at least finish Question 1 from Practical 3 before continuing with this practical! It might be tempting to skip some stuff if you’re behind, but you may end up more lost\(...\)we will happily answer questions from any previous practical each session.

In this practical, we will start by considering the `seatpos`

dataset in library `faraway`

.

### Q1 Data Analysis

This question focuses on generally improving your data analytical skills, so just go with it! A subsequent lasso regression analysis follows in Question 2.

Load library

`faraway`

and check the help file for dataset`seatpos`

.Looking at the help file, which variable might the researchers at the HuMoSim laboratory consider to be a good response variable?

Using the help file only for the moment (that is, without performing any analysis on the dataset), are there any predictor variables that we think may be likely to be highly correlated?

What is the dimension of the dataset? Check whether there is any missing data.

Use exploratory data analysis commands and plots to investigate the correlation of

`hipcenter`

with the other (possible predictor) variables. With which predictor variables is`hipcenter`

most correlated? Are those predictor variables correlated with each other? Does this exploratory analysis support your initial thoughts from part (c)? What do your findings here suggest about the fixed location referred to in the help file for description of`hipcenter`

(is it in front or behind the driver)?

### Q2 Lasso Regression

Load library `glmnet`

. Fit a model of all of the remaining variables for `hipcenter`

using lasso regression. Plot regularisation paths against log \(\lambda\) and \(\ell_1\) norm values. Investigate which variables are included under various values of \(\lambda\). Do you notice anything interesting about the trends for the predictors `Ht`

and `HtShoes`

? Do the findings here support exploratory analysis of Question 1?

### Q3 Principal Component Regression

Load library

`pls`

(required for PCR).Use the command

`pcr`

to fit a principal component regression model to`hipcenter`

based on the remaining variables. Select the number of principal components using cross-validation. Remember to standardise the variables. How many principal components are recommended, according to the cross-validation error?Approriately plot the mean squared validation error.

Use the command

`coef()`

to find corresponding coefficient estimates in terms of the original \(\hat{\beta}\)’s. How do these estimates coincide with the exploratory analysis of Question 1?

### Q4 Predictive Accuracy of Methods

Use 50 repetitions of data-splitting with 28 samples for training (10 for testing) to compare the predictive performance of the following models of the remaining predictors for `hipcenter`

:

- Best subset selection with \(C_p\)
- Ridge with min-CV \(\lambda\) (5 folds)
- Lasso with min-CV \(\lambda\) (5 folds)
- PCR with min-CV number of principal components.

## Practical 5 - Polynomial and Step Function Regression

In this practical we consider two datasets for analysis by polynomial and step-wise function regression. Firstly the `seatpos`

dataset analysed in Practical 4, followed by the `Boston`

data analysed in practical demonstration Section 7.4.

### Q1 `seatpos`

Data

Load libraries

`MASS`

and`faraway`

.We will analyse the effects of predictor variable

`Ht`

on the response variable`hipcenter`

. Assign the`hipcenter`

values to a vector`y`

, and the`Ht`

variable values to a vector`x`

.Plot variable

`Ht`

and`hipcenter`

against each other. Remember to include suitable axis labels. From visual inspection of this plot, what sort of polynomial might be appropriate for this data?Fit a first and second order polynomial to the data using the commands

`lm`

and`poly`

. Look at the corresponding summary objects. Do these back up your answer to part (c)?Plot the first and second polynomial fits to the data, along with \(\pm2\) standard deviation confidence intervals, similar to those generated in practical demonstration Section 7.4. What do you notice about the degree-2 polynomial plot?

Perform an analysis of variance to confirm whether higher order degrees of polynomial are useful for modelling

`hipcenter`

based on`Ht`

.Use step function regression with 5 cut-points to model

`hipcenter`

based on`Ht`

. Plot the results.

### Q2 `Boston`

Data

The relationship between predictor `indus`

and response `medv`

of the `Boston`

data also looks interesting. Predictor `indus`

is the proportion of non-retail business acres per town. First plot the variables to see how the relationship between `indus`

and `medv`

looks like. Then analyse the data. Feel free to select between polynomials and step functions (of course you can experiment with both if you have time). If you use step functions you can also choose to define the intervals manually if you so wish. Use ANOVA to decide which model fits the data best.

## Practical 6 - Splines and GAMs

Make sure to at least finish Question 1 from Practical 5 before continuing with this practical! It might be tempting to skip some stuff if you’re behind, but you may end up more lost\(...\)we will happily answer questions from any previous practical each session.

In Questions 1 and 2, we again analyse the `seatpos`

data from library `faraway`

.

### Q1 Splines

In this question, we will again analyse the effects of predictor variable `Ht`

on the response variable `hipcenter`

.

Load the necessary packages for the dataset and spline functions. Assign the

`hipcenter`

values to a vector`y`

, and the`Ht`

variable values to a vector`x`

.Find the 25th, 50th and 75th percentiles of

`x`

, storing them in a vector`cuts`

.Use a linear spline to model

`hipcenter`

as a function of`Ht`

, putting knots at the 25th, 50th and 75th percentiles of`x`

.Plot the fitted linear spline from part (c) over the data, along with \(\pm2\) standard deviation confidence intervals.

Use a smoothing spline to model

`hipcenter`

as a function of`Ht`

, selecting \(\lambda\) with cross-validation, and generate a relevant plot. What do you notice?

### Q2 GAMs

- Fit a GAM for
`hipcenter`

that consists of three terms:

- a natural spline with 5 degrees of freedom for
`Age`

, - a smoothing spline with 3 degrees of freedom for
`Thigh`

, and - a simple linear model term for
`Ht`

.

Plot the resulting contributions of each term to the GAM, and compare them with plots of

`hipcenter`

against each of the three variables`Age`

,`Thigh`

and`Ht`

.Does the contribution of each term of the GAM make sense in light of these pair-wise plots. Is the GAM fitting the data well?