Chapter 2 Assessing ML models

To assess whether a generated model is performing well or to compare the performance of different models, metrics are needed to measure this performance. This chapter describes the most common metrics used.
It is important to realize that we are mainly interested in how well a model performs out of sample, i.e. outside the data used to generate the model. Comparing the performance within sample with the performance out of sample is also informative. If a model performs well within the sample but not outside the sample, this is generally a signal that it is over fitting. It is quite easy to assess how well a model performs within the sample. The performance of the model out of sample can only be estimated. Commonly used techniques to do just this are discussed in 2.3.

First we discuss the most common model performance metrics in 2.1 and 2.2.

2.1 Assessing Regression Models

The most common metric to assess a regression model is without doubt the Mean Squared Error (MSE) or the Root Mean Squared Error (RMSE). The latter is the square root of the former and is used if we assess one model; if we want to compare a couple of models the MSE is useful as well.

2.2 Assessing Classification Models

When assessing a classification model a justified choice for a metric must be made. The application of the model in its domain determines the choice of a metric. Commonly used metrics are: (1) Accuracy, (2) Sensitivity or True Positive Rate, (3) Specificity or True Negative Rate. See this article on Wikipedia for an explanation of these and other metrics in case of a binary classification problem.

2.3 Measuring out of sample performance

One pitfall in generating a Machine Learning model is overfitting. The model performs very good on the data used to generate the model, but much worse outside the used sample. When assessing a model, it should be estimated how well the model performs out of sample.
There are different ways to estimate how well a model performs out of the sample data. Commonly used techniques are discussed in the following sections.
The described techniques are used to investigate which model describes the structure in a data set the best, e.g. a linear regression model, or a regression decision tree or some other model. After it is decided which model is most useful in a certain situation, the final model parameters are estimated using the complete available data set.

2.3.1 Training and test data set

The most basic way to estimate out of sample performance is to split the data set at random in a training data set and a test data set. In general the training data set contains 50% or more of all the available data.
The model is generated and trained using the training data set. To estimate the performance out of sample, the performance of the model on the data in the test data set is used.
To improve the estimate of the out of sample error, this procedure can be repeated, e.g. a few hundred times, which leads to a couple of estimates for the out of sample performance. The mean of these estimates can be used as the final estimate for the out of sample performance. This procedure also gives an idea about how accurate this estimate is, or in other words the procedure can be used to construct an interval estimate for the out of sample performance of the model.

2.3.2 Cross validation

In cross-validation, the available data set is randomly divided into k groups of (approximately) equal size. One group is set apart as test set, the other groups are used to train the model. Then the next group is used as test set and so on. So all k-groups are used once as a test set. This procedures leads to k estimates for the out-of-sample error. The average of this estimates is the final estimate for the out-of-sample performance.

2.3.3 Repeated Cross Validation

To come to an even better estimate of the out-of-sample performance, the cross validation procedure can be repeated a couple of times, say t times. This leads to t*k estimates for the out-of sample performance which are averaged to one final estimate.

2.3.3.1 Leave One Out Cross Validation

Leave One Out Cross Validation (LOOCV) is a special case of cross validation. In this case, the test set contains one data point and the model is trained on all data points except the one left out. This leads to n estimates for the out-of-sample performance which are averaged to find the final estimate.

2.3.4 Bootstrapping

Bootstrapping is a state-of-the-art technique to estimate a population parameter. It can be used for point estimates as well as for interval estimates.
The bootstrapping technique is based on one assumption, that is that the sample data is a representative sample of the population being studied as far as the parameter under study is concerned.1 If that is the case, a simulation of this population can be generated by multiplying the sample many times. From this generated simulated population, many samples with the original sample size can be drawn. Every sample gives an estimate of the parameter studied. The so generated sample estimates together form a good estimate for the sample distribution of the parameter. This sample distribution can be used for point and interval estimates for the parameter.
In stead of multiplying the sample many times and drawing samples from this simulated population, in real, samples are drawn from the sample with replacement and with the same sample size as the original sample
Bootstrapping is used a lot in ML to estimate the out-of-sample error. Bootstrapped samples are used as training data, the data outside the bootstrapped samples as test data. If the generating process makes use of 1000 bootstrapped samples, the output of the process is 1000 estimates for the out-of-sample performance.


  1. if the probability of a particular event equals 1/6, then the odds ratio equals 1/5 or 1 against 5; the probability that the event will not occur is 5 times higher than that it will occur. If for a particular event the odd ratio equals 4 (4 against 1), then the probability the event happens is 4/5 (80%).↩︎