Chapter 1 Variable Subset Selection

Consider the standard linear model: y=β0+β1x1+β2x2++βpxp+ϵ Here, y is the response variable, the x’s are the predictor variables and ϵ is an error term.

Subset selection involves identifying a subset of the p predictors x1,...,xp of size d that we believe to be related to the response. We then fit a least squares linear regression model using just the reduced set of variables.

For example, if we believe that only x1 and x2 are related to the response, then we may take d=2 and fit a model of the following form: y=β0+β1x1+β2x2+ϵ where we have deemed variables x3,...,xp to be irrelevant.

But how do we determine which variables are relevant? We may just `know’ which variables are most informative for a particular response (for example, from previous data investigations or talking to an expert in the field), but often we need to investigate.

How many possible linear models are we trying to choose from? Well, we will always include the intercept term β0. Then each variable X1,...,Xp can either be included or not, hence we have 2p possible models to choose from.

Question: how do we choose which variables should be included in our model?

1.1 Validation by Prediction Error

Much of the material in this section should be a refresher of validation techniques which you have seen in previous modules.

1.1.1 Training Error

The training error of a linear model is evaluated via the Residual Sum of Squares (RSS), which as we have seen is given by RSS=ni=1(yiˆyi)2, where ˆyi=ˆβ0+ˆβ1xi1+ˆβ2xi2++ˆβdxid. is the predicted response value at input xi=(xi1,...,xid). Another very common metric is the Mean Squared Error (MSE) which is simply a scaled down version of RSS: MSE=RSSn where n is the size of the training sample. Unfortunately, neither of these metrics is appropriate when comparing models of different dimensionality (different number of predictors) because both RSS and MSE generally decrease as we include additional predictors to a linear model.
In fact, in the extreme case of n=p both metrics will be 0! This does not mean that we have a “good” model, just that we have overfitted our linear model to perfectly adjust to the training data. Overfitted models exhibit poor predictive performance (low training error but high prediction error). Our aim is to have simple, interpretable models with relatively small p (in relation to n), which have good predictive performance.

1.1.2 Validation

Validation involves keeping a subset of the available data separate for the purposes of calculating a prediction error (otherwise called a test error):

  1. Split the data into a training sample of size nT and a test/validation sample of size n (note: n=nT+n).
  2. Train the model using the training sample to get ˆβ0,ˆβ1,,ˆβd.
  3. Use the validation sample to get predictions: ˆy=ˆβ0+ˆβ1x1++ˆβdxd.
  4. Calculate the validation RSS=ni=1(yiˆyi)2.
  • This procedure can be performed for all possible models and the model with the smallest validation RSS can be selected.
  • A common option is to use 2/3 of the sample for training and 1/3 for testing/validating.
  • One criticism is that we do not use the entire sample for training: this can be problematic especially when the sample size is small to begin with.
  • Solution: K-fold cross-validation (CV) (see Section 1.1.6).

1.1.3 Example - Training and Prediction Error

Let us assume a hypothetical scenario where the true model is y=β0+β1x1+β2x2+ϵ where β0=1, β1=1.5, β2=0.5 for some given values of the predictors x1 and x2. Lets also say that the errors are normally distributed with zero mean and variance equal to one; i.e., ϵN(0,1). We will further assume that we have three additional predictors x3, x4 and x5 which are irrelevant to y (in the sense that β3=β4=β5=0), but of course we do not know that beforehand.

We can plot the training MSE and prediction MSE against the number of predictors used in the model, such as is illustrated in Figure 1.12.
Training MSE and prediction MSE against number of predictors for the example discussed in the text.

Figure 1.1: Training MSE and prediction MSE against number of predictors for the example discussed in the text.

Notice the following:

  • Training error: steady (negligible) decline after adding x3 – not really obvious how many predictors to include.
  • Prediction error: clearly here one would select the model with x1 and x2!
  • Prediction errors are larger than training errors – this is generally always the case!

1.1.4 R-squared

The R-squared value is another metric for evaluating model fit R2=1RSSTSS, where TSS=ni=1(yiˉy)2 is the Total Sum of Squares. As we know, R2 ranges from 0 (worst fit) to 1 (perfect fit).

R2 is just a function of RSS (since TSS is a constant), so as we add further predictors R2 will just increase.

In the case of n=p we would have an R2 value of 1 (overfitting).

In Figure 1.2, we show the value of R2 against the number of predictors in the model for the example of Section 1.1.3.
R-squared values against number of predictors for the example discussed in the text.

Figure 1.2: R-squared values against number of predictors for the example discussed in the text.

1.1.5 So, which model?

Given that in practice we may not wish to exclude part of the data for calculating a prediction error using validation (Section 1.1.2), we have two choices:

  • directly estimate prediction error using cross-validation techniques (Section 1.1.6). This is computational by nature.
  • indirectly estimate prediction error by making an adjustment to the training error that accounts for overfitting. Here, we have several model selection criteria available (Section 1.2).

We will proceed to look into these two options in the following sections.

1.1.6 Cross-Validation

k-fold Cross-validation (CV) involves randomly dividing the set of n observations into k groups (or folds) of approximately equal size.
Then, for each group i=1,...,k:

  • Fold i is treated as a validation set, and the model is fit on the remaining k1 folds.
  • The MSEi is then computed based on the observations of the held out fold i.

This process results in k estimates of the test error MSE1,...,MSEk. The k-fold CV estimate is computed by averaging these values. CV(k)=1kki=1MSEi

Figure 1.3 illustrates this nicely.

An illustration of k-fold CV with 5 folds.

Figure 1.3: An illustration of k-fold CV with 5 folds.

1.2 Model Selection Criteria

As an alternative to the direct estimation of prediction error considered in Section 1.1.6, we now consider a number of indirect techniques that involve adjusting the training error for the model size. These approaches can be used to select among a set of models with different numbers of variables. We proceed to consider four such criteria.

1.2.1 Cp

For a given model with d predictors (out of the available p predictors) Mallows’ Cp criterion is Cp=1n(RSS+2dˆσ2), where RSS is the residual sum of squares for the model of interest (with d predictors), and ˆσ2=RSSp/(np1) is an estimate of the error variance for the full model with all p possible predictors included. As such, we use RSSp to denote the Residual Sum of Squares for the full model with all p possible predictors included.

In practice, we choose the model which has the minimum Cp value: so we essentially penalise models of higher dimensionality (the larger d is the greater the penalty).

1.2.2 AIC

For linear models (with normal errors, as is often assumed), Mallows’ Cp is equivalent to the Akaike Information Criterion (AIC) (as the two are proportional), this being given by AIC=1nˆσ2(RSS+2dˆσ2).

1.2.3 BIC

Another metric is the Bayesian information criterion (BIC) BIC=1nˆσ2(RSS+log(n)dˆσ2), where again the model with the minimum BIC value is selected

In comparison to Cp or AIC, where the penalty is 2dˆσ2, the BIC penalty is log(n)dˆσ2. This means that generally BIC has a heavier penalty (because log(n)>2 for n>7), thus BIC selects models with fewer variables than Cp or AIC.

In general, all three criteria are based on rigorous theoretical asymptotic (n) justifications.

1.2.4 Adjusted R-squared value

A simple method which is not backed up by statistical theory, but that often works well in practice, is to simply adjust the R2 metric by taking into account the number of predictors.

The adjusted R2 value for a model with d variables is calculated as follows Adjusted~R2=1RSS/(nd1)TSS/(n1).
In this case we choose the model with the maximum adjusted R2 value.

1.2.5 Example

By plotting the various model selection criteria against the number of predictors (Figure 1.4), we can see that, in our example, Cp, AIC and BIC are in agreement (2 predictors). Adjusted R2 would select 3 predictors.

Various model selection criteria against number of predictors.

Figure 1.4: Various model selection criteria against number of predictors.

1.2.6 Further points

  • A shortcoming of the previous approaches is that they are not applicable for models that have more variables than the size of the sample (d>n).
  • Also, in the setting of penalised regression (which we will see later on in the course) deciding what d actually is becomes a bit problematic.
  • CV is an effective computational tool which can be used in all settings.

  1. For the purposes of this illustration, we add the variables by index order, so three predictors corresponds to x1,x2,x3 etc. How we should select x1 and x2 when we don’t know which variables are important comes later!↩︎