Chapter 1 Prerequisites

This book is thought as accompanying material for the course on High-dimensional statistics. Much of the content and examples are taken from the book Hastie, Tibshirani, and Friedman (2001). This book focuses on regularized regression and its application using R (R Core Team 2021).

1.1 General notation

We will typically denote an input variable by the symbol \(X\). If \(X\) is a vector, its components can be accessed by subscripts \(X_j\). Quantative outputs will be denoted by \(Y\), and qualitative outputs by \(G\) (for group). We use uppercase letters such as \(X\), \(Y\) or \(G\) when referring to the generic aspects of a variable. Observed values are written in lowercase; hence \(i\)th observed value of \(X\) is written as \(x_i\) (where \(x_i\) is again a scalar or vector). Matrices are represented by bold uppercase letters; for example, a set of \(N\) input p-vectors \(x_i\), \(i=1,\ldots,N\) would be represented by the \(N\times p\) matrix \(\bf{X}\). All vectors are assumed to be column vectors, the \(i\)th row of \(\bf{X}\) is \(x_i^T\).

1.2 Statistical learning

For the moment we can loosely state the learning task as follows: given the value of an input vector \(X\), make a good prediction of the output \(Y\), denoted by \(\hat Y\) (pronounced “y-hat”). \(\hat Y\) is the outcome of a learning rule \(\hat{f}(X)\).

We need data to construct learning rules. We thus suppose we have available a set of measurements \((x_i,y_i)\), \(i=1,\ldots,N\), known as the training data, with which to construct \(\hat{f}(X)\).

A key goal of a statistical learning is prediction. Therefore the assessment of how the method generalizes beyond the observed data is extremely important in practice.

When building a good learning rule \(\hat{f}(X)\) we need to keep in mind the following 2 fundamental steps:

  1. Training step: In this step we explore based on training data a series of learning rules \(\hat{f}_{\alpha}(X)\), where the index \(\alpha\) is a tuning parameter. We compare the prediction performance of each rule and choose the model with smallest error. This step is often referred to as model selection. An approximation of the prediction error is either achieved analytically using so-call information criteria (e.g. AIC, BIC or Mallows’ Cp) or by efficient use of re-sampling (cross-validation and the bootstrap).

  2. Testing step: In this step we evaluate the generalization error of the learning rule \(\hat{f}(X)\) obtained in the training step based on independent test data.

A key part of the training step is the selection of the tuning parameter. We briefly review cross-validation and the Akaike information criterion (AIC). Probably the simplest and most widely used method is K-fold cross-validation. It approximates the prediction error by splitting the training data into K chunks as illustrated below (here \(K=5\)).

Each chunk is then used as “hold-out” validation data to calculate the error of \(\alpha\)th model trained on the other \(K-1\) data chunks. In that way we obtain \(K\) error estimates and we typically take the average as the cross-validation error (denoted by \({\rm CV}(\alpha)\)). The next plot shows a typical cross-validation error plot. This curve attains its minimum at model with \(\alpha=4\) (here \(\alpha\) represents the number of included predictors).

The AIC approach is founded in information theory. We select the model with smallest AIC

\[ {\rm AIC}(\alpha)=-2\;{\rm loglik}+2\;d_{\alpha}. \] Thus, AIC rewards goodness of fit (as assessed by the likelihood function loglik), but it also includes a penalty that is an increasing function of the number of estimated parameters \(d_{\alpha}\). The figure below shows for the same example the AIC curve. Also the AIC approaches suggests to use a model with \(\alpha=4\) predictors.


Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Learning. Springer Series in Statistics. New York, NY, USA: Springer New York Inc.
R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.