2 Statistical Learning

2.1 What Is Statistical Learning?

For a quantitative response \(Y\) and \(p\) different predictors \(X_1, X_2, \dots, X_p\), we assume there is some relationship between \(Y\) and the predictors \(X\). This can be written in the very general form:

\[ Y = f(X) + \epsilon \]

where \(f\) is some fixed but unkown function of \(X_1, \dots, X_p\), and \(\epsilon\) is a random error term, which is independent of \(X\) and has mean zero.

In essence, statistical learning refers to a set of approaches for estimating f. In this chapter we outline some of the key theoretical concepts that arise in estimating f, as well as tools for evaluating the estimates obtained

2.1.1 Why Estimate \(f\)?

For prediction and inference.

Prediction

In cases where inputs \(X\) are available, but output \(Y\) is not easily obtained, we can predict \(Y\) using

\[ \hat{Y} = \hat{f}(X) \]

where \(\hat{f}\) represents our estimate for \(f\), and \(\hat{Y}\) represents our prediction for \(Y\). We don’t necessarily care about the form of \(\hat{f}\) – it can be treated as a block box as long as it gives good predictions.

Inference

When we are interested in understanding the association between \(Y\) and \(X\), we wish to estimate \(\hat{f}\) and know its exact form (i.e. we don’t treat it as a block box).

Depending on whether our ultimate goal is prediction, inference, or a combination of the two, different methods for estimating f may be appropriate. For example, linear models allow for relatively simple and inter- linear model pretable inference, but may not yield as accurate predictions as some other approaches. In contrast, some of the highly non-linear approaches that we discuss in the later chapters of this book can potentially provide quite accurate predictions for Y , but this comes at the expense of a less interpretable model for which inference is more challenging.

2.1.2 How Do We Estimate f?

Our goal is to apply a statistical learning method to the training data in order to estimate the unknown function \(f\). In other words, we want to find a function \(\hat{f}\) such that \(Y \approx \hat{f}(X)\) for any observation \((X, Y)\). Broadly speaking, most statistical learning methods for this task can be characterized as either parametric or non-parametric. We now briefly discuss these two types of approaches.

Parametric Methods

The parametric method involves estimating a set of parameters. It is a two-step model-based approach:

  1. Make an assumption about the function form or shape of \(f\).
  2. Fit or train the model.

A simple assumption is that \(f\) is linear in \(X\):

\[ f(X) = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p \]

In the case of this linear model, we need to estimate the \(\beta\) parameters such that \(Y \approx \beta_0 + \dots + \beta_p X_p\). The most common approach to fitting a model like this is (ordinary) least squares.

Non-Parametric Methods

Non-parametric methods do not make explicit assumptions about the functional form of \(f\). Instead they seek an estimate of \(f\) that gets as close to the data points as possible without being too rough or wiggly. Such approaches can have a major advantage over parametric approaches: by avoiding the assumption of a particular functional form for \(f\), they have the potential to accurately fit a wider range of possible shapes for \(f\).

2.1.3 The Trade-Off Between Prediction Accuracy and Model Interpretability

One might reasonably ask the following question: why would we ever choose to use a more restrictive method instead of a very flexible approach? There are several reasons that we might prefer a more restrictive model. If we are mainly interested in inference, then restrictive models are much more interpretable.

We have established that when inference is the goal, there are clear advantages to using simple and relatively inflexible statistical learning methods. In some settings, however, we are only interested in prediction, and the interpretability of the predictive model is simply not of interest. For instance, if we seek to develop an algorithm to predict the price of a stock, our sole requirement for the algorithm is that it predict accurately— interpretability is not a concern. In this setting, we might expect that it will be best to use the most flexible model available. Surprisingly, this is not always the case! We will often obtain more accurate predictions using a less flexible method.

2.1.4 Supervised Versus Unsupervised Learning

  • Supervised
    • for each observation, there is an associated response
    • we wish to fit a model that relates the response to predictors, with the aim of accurately predicting the response, or for better understanding the relationships
    • the focus of the book
  • Unsupervised
    • no associated response
    • we can seek to understand relationships between variables or between observations
    • one example: cluster analysis

2.2 Assessing Model Accuracy

There is no free lunch in statistics: no one method dominates all others over all possible data sets. On a particular data set, one specific method may work best, but some other method may work better on a similar but different data set. Hence it is an important task to decide for any given set of data which method produces the best results. Selecting the best approach can be one of the most challenging parts of performing statistical learning in practice.

2.2.1 Measuring the Quality of Fit

In the regression setting, the most commonly-used measure is the mean squared error (MSE):

\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{f}(x_i))^2 \]

We compute MSE with both training and testing data sets, but care much more about the latter. The goal is to minimize the testing MSE, but what if we only have training data? It is not always the case that we want to take the model that minimizes the training MSE due to overfitting.

In practice, one can usually compute the training MSE with relative ease, but estimating test MSE is considerably more difficult because usually no test data are available. As the previous three examples illustrate, the flexibility level corresponding to the model with the minimal test MSE can vary considerably among data sets.

2.2.2 The Bias-Variance Trade-Off

The expected test MSE for a given value of \(x_0\) can be decomposed into the sum of three quantities:

\[ E(y - \hat{f}(x_0))^2 = \text{Var}(\hat{f}(x_0)) + [\text{Bias}(\hat{f}(x_0))]^2 + \text{Var}(\epsilon) \]

This tells us that, to minimize MSE, we need to select a statistical learning method that achieves low variance and low bias (because the variance of the error term is irreducible).

The variance here refers to the amount by which \(\hat{f}\) would change if estimated using different training data. This is minimized by more rigid models; flexible models tend to have high variance.

The bias is, conceptually, the error introduced by approximating a complex real-life scenario with a much simpler model. A linear regression model tends to introduce much bias due to the very simplified assumption of linear relationships. Flexible methods tend to result in less bias.

2.2.3 The Classification Setting

When estimating \(f\) with qualitative \(y\), the most common approach for quantifying accuracy is the training error rate. This is the proportion of mistakes made by \(\hat{f}\) to the training data:

\[ \frac{1}{n} \sum_{i=1}^n I(y_i \neq \hat{y}_i) \]

As with the regression setting, we seek a classifier for which the test error rate is smallest.

The Bayes Classifier

It is possible to show (though the proof is outside of the scope of this book) that the test error rate given in (2.9) is minimized, on average, by a very simple classifier that assigns each observation to the most likely class, given its predictor values. In other words, we should simply assign a test observation with predictor vector \(x_0\) to the class \(j\) for which

\[ \text{Pr}(Y = j | X = x_0) \]

is largest… This very simple classifier is called the Bayes classifer.

The Bayes error rate is analogous to the irreducible error, in that classes may overlap and cannot be correctly classified.

\(K\)-Nearest Neighbors

The Bayes classifier is the gold standard, but unattainable for real data because we do not know the conditional distribution \(\text{Pr}(Y|X)\).

One method to which to compared is the \(K\)-nearest neighbors (KNN) classifier. For a test observation \(x_0\), KNN involves identifying the \(K\) nearest points in the training data that are closest to \(x_0\), represented by \(\mathcal{N}_0\). It then estimates the conditional probability of the classes using the fraction of the points:

\[ \text{Pr}(Y = j|X = x_0) = \frac{1}{K} \sum_{i \in \mathcal{N}_0} I (y_i = j). \] KNN classifies\(x_0\) into the class \(k\) with the highest probability.

The choices of \(k\) has a drastic effect on the performance of the classifer. \(K = 1\) is the most flexible, i.e. low bias but high variance. A large value of \(K\) (which is relative to the density of points) is the least flexible, i.e. high bias but low variance.

2.3 Lab: Introduction to R

I’ll be skipping this, as it is just a basic introduction of base R.