Chapter 5 Evaluating predictive models

In this chapter various metrics for evaluating predictive models, both for discrete targets and continuous targets. The content here will be used througout the remainder of the text as different algorithms are presented.

5.1 Introduction

In predictive analytics applications it is common practice to try multiple different models to assess predictions using a training sample and then to select the best of model in terms of some criterion.

There are, however, many things that can go wrong. The data may be bad, the underlying model used may be wrong, or the right predictors have not been included. This can result in false positives and false negatives in binary classification models. One of the outcomes is usually the one we’re concerned with or the one with a favorable result. If the model accurately predicts this outcome, then it’s a true positive. If the model accurately predicts the outcome we are not interested in, then it’s a true negative.

While looking at many different models is normally frowned upon in statistics, with large data sets where prediction is the objective, it is common practice.

Even if a single model is used, such as logistic regression, there are different formulations to be tried, e.g., different predictor variables, different transformations of the predictors, and so on.

Because of the practice of trying different models, we need a means to compare the models.

5.2 Training, Testing, and Validation samples

Since the process of building a predictive model usually involves trying several (or many) different models to determine which performs best on a given data set, it is important to have a separate data set that was not used to build the model.

The strategy for validating a predictive model depends on the number of observations available. The question is whether to create two subsets (training and validation) or three (training, testing, and validation). There should be a sufficient number of observations such that a training set can be created with at least 50 - 100 observations per predictor variable being considered. If the data set has enough observations, then a common approach is to create three subsets: training, test, and validation.6

A model is created using the training set. Then, the performance of the model is assessed using the test set. If the model doesn’t perform well with the test set, the process returns to the training subset and retrained using a different settings on the algorithm or a totally different algorithm. different different models or setting assumptions or different techniques. This process may be repeated several times.

The relative proportions for the three subsets is somewhat arbitrary, but typical ratios are 60/20/20, 50/30/20, 50/25/25 for the training, test, and validation sets respectively.

KNIME and other software tools provide convenient ways to construct training, testing, and validation sets. These subsets are created via a random sampling process as a way of avoiding bias.

One approach where the number of observations is limited is to create just training and validation sets. There is no firm rule for what the split should be, but typical ratios are 50/50, 60/40, 70/30, 80/20 for training and validation sets respectively. However, with only two subsets of data, using the validation set to inform the model building process will not provide a proper test of the ability of the model to predict new data. As example, some procedures, such as stepwise regression, use the validation set to evaluate the accuracy of each iteration.

Where only two subsets are used, the validation be held separate from the model building process. When a final model has been developed, the validation set can be used to evaluate performance on “new” data. Testing can be done using only the training data via k-fold validation.

With k-fold performance assessment, the training data is randomly divided into k groups, where k is typically five or ten. The analyis is run k times, with each of the k subsets assigned to the role of test data and the remaining data used to build the model. The goal is to create a model that will perform on unseen data, which is held out in the validation set.

The process is illustrated in Figure 5.1 for k = 5. The training data is randomly divided into five equally sized subsets. Then, the first subset is held out and the remaining four subsets used to build a model. The model is assessed on the held out first subset. This process is repeated five times. The model performance is then averaged across the five iterations.

K-fold validation.

Figure 5.1: K-fold validation.

KNIME and other software tools provide convenient ways to construct training, testing, and validation sets. These subsets are created via a random sampling process as a way of avoiding bias. KNIME also has nodes to perform k-fold validation.

5.3 Evaluating continuous versus discrete targets

There are two main types of supervised models for which evaluation measures are needed. First, prediction models involve continuous dependent or target variables. This is the case typically with multiple regression, although there are other techniques which make predictions of continuous variables as well.

The second main type is performance evaluation for classification. which involves assignment of case to a group. The second main type is performance evaluation for classification, which involves assignment of case to a group.

5.3.1 Evaluating performance with continuous targets

The objective is to find out how well the model predicts new data. So, prediction metrics should be computed on a new data set, e.g., the validation and/or test set or with k-fold validation methods. Sometimes it is useful to compare the performance with the training data with the holdout or new dat to assess whether the model overfit the training data

Typical measures used to evaluate continuous targets are listed below where y contains the actual target values and \(\hat{y}\) contains is the predicted values using the validation data.


${R}^2 can be interpreted as the variance “explained” by a model. As such, the value ranges between 0 and 1.0. ${R}^2 is probably the most familiar measure for continuous targets because of its use in multiple regression.

The formula for ${R}^2 is:

\[\begin{equation} {R}^{2} = 1 - \frac { \sum_{i=1}^{n} (\ y_i -\hat{y_i} )^{2} } { \sum_{i=1}^{n} (\ y_i -\bar{y_i} )^{2} } \end{equation}\]


The mean absolute error is measured in the same units as the data. Because it does not square the errors, as with other measures, it is less sensitive to the occasional very large errors.

The formula for MAE is:

\[\begin{equation} MAE = \frac{1}{n} \sum_{i=1}^{n} |{y_i}-\hat{y_i}| \tag{5.1} \end{equation}\]


The MAPE, or mean absolute percent error, is one of the more attractive measures becuase the report as a percentage makes intuitive sense. However, it has some limitations. First of all, if any of the actual values are zero, then the error becomes apparently infinite because there would be division by zero. Second, it has been found to favor models that are systematically predicting values that are too low rather than too high. That’s because for estimates that are too low, the percentage error cannot exceed 100%. But for predictions that are too high, there is no upper limit to the percentage error. Finally, if there are both positive and negative values in the estimates, MAPE should not be used.

The formula for MAPE is:

\[\begin{equation} MAPE = \frac{1}{n} \sum_{i=1}^{n} {\left|\frac{{y_i}-\hat{y_i}}{y_i}\right|} \tag{5.2} \end{equation}\]


This is the square root of the mean of the squared error. This metric captures how far the predicted values are from the actual values in the same units are the target variable. One issue with the root mean squared error is that it is affected more by the occasional large error than MAE or MAPE because of the squaring of errors. Whether this is a problem or not depends to some extent on the cost or seriousness of large errors in the situation being modeled. If the costs are roughly linear with respect to the error, then the statistic might not be the best one to use.

The formula for RMSE is:

\[\begin{equation} RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} ({y_i}-\hat{y_i}){^2}} \tag{5.3} \end{equation}\]

5.3.2 Evaluating performance with classification models

This discussion is focused on assessing performance where the target variable takes on two nominal levels (i.e., binary). The binary case is not the only situation that might be encountered, however. In some cases where the target has three or more categorical levels, the assessment can be reduced to binary for evaluation purposes. This is done by selecting one of multiple levels as the positive or 1 level and assigning the rest of the levels to the negative or 0 level.

The basic structure for evaluating binary models is the contingency table or classification matrix (sometimes called the confusion matrix). It is a two by two table with the actual classifications on the rows and the predicted classifications on the columns.7

2 X 2 Classification Matrix

Figure 5.2 shows a general two by two contingency table with the actual in the rows and the predicted in the columns. The predictions can be either positive or negative; these can be thought of as buy or not buy, try or not try, churn or not churn in the case of telecommunications, error or not an error, or even fraudulent or not fraudulent.

A labeled 2 X 2 table.

Figure 5.2: A labeled 2 X 2 table.

Each of the cells can be labeled as True Positive and True Negative (correct) or false positive and false negative (incorrect). To begin the evaluation, the “naive rule” can be used as basic benchmark. This rule classifies all records as belonging to the most prevalent actual level. The hope is that the model will perform better than the naive model.

There are several measures that are frequently used to assess model performance using a 2 X 2 table. These are:

  1. Accuracy = (True positive + True negative) / Total cases; this is also known as the “fraction correct.”
  2. Sensitivity = (True positive) / (True positive + False negative); this is also known as “recall,” “hit rate,” or the “true positive rate.”
  3. Specificity = (True negative) / ( True negative + False positive); this is also known as “true negative rate.”
  4. Precision= (True positive) / (True positive + False positive); this is also known as “positive predictive value.”
  5. Negative predictive value = (True negative) / (True negative + False negative).
  6. F-score = 2 /[(sensitivity-1) + (precision-1)]

It is important to select a metric appropriate to the situation being modeled. Accuracy is a good measure when the the number of false positives and false negatives are approximately equal and the cost or seriousness of false positives and false negatives are similar.

Consider the hypothetical data results shown in Figure 5.3.

Evaluation metrics for two situations.

Figure 5.3: Evaluation metrics for two situations.

The results on the left are for two different models for testing for Covid-19. The results on the right are for two models for SPAM detection. All four of the 2 X 2 tables have the same accuracy, yet the models are not equally favorable. Accuracy is not a good measure for these two situations since the consequences of false positives and false negatives are not the same.

For the Covid tests model 1 is preferred to model 2 because only 10 actual Covid cases were mistakenly classified as Not Covid. Sensitivity (0.998) is the best measure of performance since it is arguably more important to avoid classifying cases with Covid as not having Covid. If a person without Covid is classifyied as having Covid, this of course causes stress on the individual, but this can be remedied by conducting further tests.

On the other hand, for the SPAM tests the best metric is Precision and model 2 has the higher precision (0.998). Classifying cases where SPAM exists as not being SPAM is more important than filtering messages that are Not SPAM as SPAM. In the latter case, critical messages might never be seen by the email user.

In general, if false positives are not as serious as false negatives, sensitivity is a good metric. If false positive are more serious than false negatives, precision is a good metric. Evaluating models with class imbalance

There are situations where the outcome of interest in a binary prediction model occurs infrequently. This causes an imbalance in the outcomes. For example, say we are developing a model to predict fraud and we have 50,000 cases where only 500 cases are fraudulent and 49,500 are not.

There are many domains where imbalance occurs, including:

  • Churn prediction at telecoms most users do not churn.
  • Predicting insurance fraud - most claims are not fraudulent.
  • Diagnoses of rare medical conditions
  • Online advertising click-through rates - most viewers will not click.
  • Pharmaceutical research - most potential modules useful for further development.

Many classification models will not handle this situation very well. One likely outcome in such situations is that the model will simply predict all cases to not be fraudulent. After all, for the example cited, if all 50,000 cases are classified as non-fraudulent, then accuracy equals 49,500/50,000 or 99%. accuracy!

There are several ways to handle situations of severe class imbalance. One way is to use stratified sampling on the training set to force a balance in the two outcomes. In the case of the fraudulent data case, assume a training set of 40,000 cases and a validation set of 10,000 cases are formed stratefied by the presence of fraud. In the training sample, it is expected that there will be 400 cases of fraud in the training set and 39,600 cases with non-fraud. A random sample of 400 is taken from the 39,600 non-fraud cases. This creates a balanced data set with 400 cases each of fraud and non-fraud.

A model is then created using the balanced data set. The evaluation metrics from this model are not very useful, however. Instead, the model should then be applied to the 10,000 cases in the validation set and then evaluation metrics computed. A disadvantage of this approach is that most of the cases in the training set are not used to create the model.

Another approach is to oversample the rare cases in the data set. In the training data set in the fraud situation, the 400 fraud cases are repeatedly copied 99 times so that there are 39,600 cases of fraud and non-fraud in the set. The validation set is left alone. A more sophisticated approach is to oversample using similar cases. Using the characteristics of each case that is in the minority, a set of similar cases is selected. This is implemented in the R package smotefamily (SMOTE 2019) and the SMOTE node in KNIME.

Other approaches to dealing with imbalance include the following.

  • Lower the cutoff to increase the number of predictions of the minority class.
  • Adjust the prior probabilities (e.g., in naïve Bayes or discriminant analysis).
  • Weight the minority class more heavily, where case weighting is available in the algorithm; this is essentially like oversampling.
  • Use costs to differentially weight specific types of errors.

Examples of using some of the these approaches will be discussed in the chapters on classification prediction models. ROC curves

Another useful method for assessing the accuracy of predictive binary classification models is the ROC (receiver operating characteristic) curve. ROC curves show the tradeoff between sensitivity and (1-specificity) or the tradeoff between the true positive rate and the false positive rate for a predictive model. These curves are especially useful in comparing the performance of two or more models.

The ROC curve in Figure 5.4 shows the line of random performance in red. The blue line represents a model that achieves much better than random performance. The curve is generated by varying the cutoff threshold for assigning cases to one of the binary outcomes and computing the Sensitivity and 1 - Specificity and plotting the values. The cutoffs for 0.0 to 1.0 which result in the curve are shown in the figure. The closer the ROC curve is to the upper left corner (1.0, 1.0), the more accurate the model.

A measure of the quality of the model is the area under the curve or AUC, which in this case is approximately 0.72. The maximum possible area for a model that predicts perfectly is 1.0; the area under the red line for a model that only predicts randomly line is 0.50.

Example of an ROC curve.

Figure 5.4: Example of an ROC curve.

The concept of a tradeoff between true positives and false positives in a 2 X 2 table is illustrated in Figure 5.5.

Tradeoff between true positives and false positives.

Figure 5.5: Tradeoff between true positives and false positives.

If a threshold or cutoff is selected that achieves .85 true positives (in the chart on the left), then the model represented by the blue line will result in about .45 false positives. To lower the percentage of false positives to about .20 (in the chart in the center), the tradeoff shows that true positives are identified only about .65 of the time. The only way to achieve higher true positives and fewer false positives is to create a model that results in an ROC curve above and to the left, which is shown in panel on the right. This model achieved a true positive rate of .85 with only a .10 false positive rate.

5.4 Summary


SMOTE. 2019. “A Collection of Oversampling Techniques for Class Imbalance Problem Based on SMOTE.”

  1. Some books and software reverse the names for the test and validation sets, labeling the second set the validation set set and the third set the test.↩︎

  2. In some software the rows and columns are exchanged so that the actual classifictions are in the columns.↩︎