## 3.2 Measuring Performance

While often overlooked, the metric used to assess the effectiveness of a model to predict the outcome is very important and can influence the conclusions. The metric we select to evaluate model performance depends on the outcome, and the subsections below describe the main statistics that are used.

### 3.2.1 Regression Metrics

When the outcome is a number, the most common metric is the root mean squared error (RMSE). To calculate this value, a model is built and then it is used to predict the outcome. The *residuals* are the difference between the observed outcome and predicted outcome values. To get the RMSE for a model, the average of the squared residuals is computed, then the square root of this value is taken. Taking the square root puts the metric back into the original measurement units. We can think of RMSE as the average distance of a sample from its observed value to its predicted value. Simply put, the lower the RMSE, the better a model can predict samples’ outcomes.

Another popular metric is the coefficient of determination, usually known as \(R^2\). There are several formulas for computing this value (Kvalseth 1985), but the most conceptually simple one finds the standard correlation between the observed and predicted values (a.k.a. \(R\)) and squares it. The benefit of this statistic is, for linear models, it has a straightforward interpretation: \(R^2\) is the proportion of the total variability in the outcome that can be explained by the model. A value near 1.0 indicates an almost perfect fit while values near zero result from a model where the predictions have no linear association with the outcome. One other advantage of this number is that it makes comparisons between different outcomes easy since it is unitless.

Unfortunately, \(R^2\) can be a deceiving metric. The main problem is that it is a measure of correlation and not accuracy. When assessing the predictive ability of a model, we need to know how well the observed and predicted values *agree*. It is possible, and not unusual, that a model could produce predicted values that have a strong linear relationship with the observed values but the predicted values do *not* conform to the 45-degree line of agreement. One example of this phenomenon occurs when a model under-predicts at one extreme of the outcome and overpredicts at the other extreme of the outcome. Tree-based ensemble methods (e.g., random forest, boosted trees, etc.) are notorious for these kinds of predictions. A second problem with using \(R^2\) as a performance metric is that it can show very optimistic results when the outcome has large variance. Finally, \(R^2\) can be misleading if there are a handful of outcome values that are far away from the overall scatter of the observed and predicted values. In this case the handful of points can artificially increase \(R^2\).

To illustrate the problems with \(R^2\), let’s look at the results of one particular model of the Chicago train ridership data. For this model \(R^2\) was estimated to be 0.9; at face value we may conclude that this is an extremely good model. However, the high value is mostly due to the inherent nature of the ridership numbers which are high during the workweek and correspondingly low on the weekends. The bimodal nature of the outcome inflates the outcome variance and, in turn, the \(R^2\). We can see the impacts of the bi-modal outcome in Figure 3.1 (a). Part (b) of the figure displays a histogram of the residuals, some of which are greater than 10K rides. The RMSE for this model is 3,853 rides, which is somewhat large relative to the observed ridership values.

A second illustration of the problem of using \(R^2\) can be seen by examining the blue and black lines in Figure 3.1(a). The blue line is the linear regression fit between the observed and predicted values, while the black line represents the line of agreement. Here we can see that the model under-predicts the smaller observed values (left) and over-predicts the larger observed values (right). In this case, the offset is not huge but it does illustrate how the RMSE and \(R^2\) metrics can produce discordant results. For these reasons, we advise using RMSE instead of \(R^2\).

To address the problem that the correlation coefficient is overly optimistic when the data illustrates correlation but not agreement, Lawrence and Lin (1989) developed the concordance correlation coefficient (CCC). This metric provides a measure of correlation relative to the line of agreement and is defined as the product of the usual correlation coefficient and a measure of bias from the line of agreement. The bias coefficient ranges from 0 to 1, where a value of 1 indicates that the data falls on the line of agreement. The further the data deviates from the line of agreement, the smaller the bias coefficient. Therefore, the CCC can be thought of as penalized version of the correlation coefficient. The penalty will apply if the data exhibits poor correlation between the observed and predicted values or if the relationship between the observed and predicted values is far from the line of agreement.

Both RMSE and \(R^2\) are very sensitive to extreme values because each are based on the squared value of the individual samples’ residuals. Therefore a sample with a large residual will have an inordinately large effect on the resulting summary measure. In general, this type of situation makes the model performance metric appear worse that what it would be without the sample. Depending on the problem at hand, this vulnerability is not necessarily a vice but could be a virtue. For example, if the goal of the modeling problem is to rank-order new data points (e.g., the highest spending customers), then size of the residual is not an issue so long as the most extreme values are predicted to be the most extreme. However, it is more often the case that we are interested in predicting the actual response value rather than just the rank. In this case, we need metrics that are not skewed by one or just a handful of extreme values. The field of robustness was developed to study the effects of extreme values (i.e., outliers) on commonly used statistical metrics and to derive alternative metrics that achieved the same purpose but were *insensitive* to the impact of outliers (Hampel et al. 1972). As a broad description, robust techniques seek to find numerical summaries for the majority of the data. To lessen the impact of extreme values, robust approaches down-weight the extreme samples or they transform the original values in a way that brings the extreme samples closer to the majority of the data. Rank-ordering the samples is one type of transformation that reduces the impact of extreme values. In the hypothetical case of predicting customers’ spending, *rank correlation* might be a better choice of metric for the model since it measures how well the predictions rank order with their true values. This statistic computes the ranks of the data (e.g., 1, 2, etc.) and computes the standard correlation statistic from these values. Other robust measures for regression are the median absolute deviation (MAD) (Rousseeuw and Croux 1993) and the absolute error.

### 3.2.2 Classification Metrics

stem | other | |
---|---|---|

stem | 5134 | 6385 |

other | 2033 | 25257 |

When the outcome is a discrete set of values (i.e., qualitative data), there are two different types of performance metrics that can be utilized. The first type described below is based on qualitative class prediction (e.g., `stem`

or `other`

) while the second type uses the predicted class probabilities to measure model effectiveness (e.g., Pr[`stem`

] = 0.254).

Given a set of predicted classes, the first step in understanding how well the model is working is to create a *confusion matrix* which is a simple cross-tabulation of the observed and predicted classes. For the OkCupid data, a simple logistic regression model was built using the predictor set mentioned above and Table 3.1 shows the resulting confusion matrix^{17}.

The samples that were correctly predicted sit on the diagonal of the table. The STEM profiles mistakenly predicted as non-STEM are shown in the bottom left of the table (*n* = 2033) while the non-STEM profiles that were erroneously predicted are in the upper right cell (*n* = 6385) . The most widely utilized metric is classification accuracy which is simply the proportion of the outcome that were correctly predicted. In this example, the accuracy is 0.78 = (5134 + 25257)/(5134 + 6385 + 2033 + 25257). There is an implicit tendency to assess model performance by comparing the observed accuracy value to 1/*C*, where *C* is the number of classes. In this case, 0.78 is much greater than 0.5. However, this comparison should be made only when there are nearly the same number of samples in each class. When there is an imbalance between the classes, as there is in these data, accuracy can be a quite deceiving measure of model performance since a value of 0.82 can be achieved by predicting all profiles as non-STEM.

As an alternative to accuracy, another statistic called Cohen’s Kappa (Agresti 2012) can be used to account for class imbalances. This metric normalizes the error rate to what would be expected by chance. Kappa takes on values between -1 and 1 where a value of 1 indicates complete concordance between the observed and predicted values (and thus perfect accuracy). A value of -1 is complete discordance and is rarely seen^{18}. Values near zero indicate that there is no relationship between the model predictions and the true results. The Kappa statistic can also be generalized to problems that have more than two groups.

A visualization technique that can be used for confusion matrices is the mosaic plot (see Figure 3.3). In these plots, each cell of the table is represented as a rectangle whose area is proportional to the number of values in the cell. These plots can be rendered in a number of different ways and for tables of many sizes. See Friendly and Meyer (2015) for more examples.

There are also specialized sets of classification metrics when the outcome has two classes. To use them, one of the class values must be designated as the *event of interest*. This is somewhat subjective. In some cases, this value might be the worst-case scenario (i.e., death) but the designated event should be the value that one is most interested in predicting.

The first paradigm of classification metrics focuses on false positives and false negatives and is most useful when there is interest in comparing the two types of errors. The *sensitivity* metric is simply the proportion of the events that were predicted correctly and is the *true positive* rate in the data. For our example,

\[\begin{align*} sensitivity &= \frac{\text{# truly STEM predicted correctly}}{\text{# truly STEM}} \notag \\ &= 5,134/7,167 \notag \\ &= 0.716 \notag \end{align*}\]

The *false positive* rate is associated with the *specificity*, which is

\[\begin{align*} specificity &= \frac{\text{# truly non-STEM predicted correctly}}{\text{# truly non-STEM}} \notag \\ &= 25,257/31,642 \notag \\ &= 0.798 \notag \end{align*}\]

The false positive rate is 1 - specificity (0.202 in this example).

The other paradigm for the two-class system is rooted in the field of *information retrieval* where the goal is to find the events. In this case, the metrics commonly used are *precision* and *recall*. Recall is equivalent to sensitivity and focuses on the number of true events found by the model. Precision is the proportion of events that are predicted correctly out of the total number of predicted events, or

\[\begin{align*} precision &= \frac{\text{# truly STEM predicted correctly}}{\text{# predicted STEM}} \notag \\ &= 5,134/11,519 \notag \\ &= 0.446 \notag \end{align*}\]

One facet of sensitivity, specificity, and precision that is worth understanding is that they are *conditional* statistics. For example, sensitivity reflects the probability than an event is correctly predicted *given that a sample is truly an event*. The latter part of this sentence shows the conditional nature of the metric. Of course, the true class is usually unknown and, if it were known, a model would not be needed. In any case, if *Y* denotes the true class and *P* denotes the prediction, we could write sensitivity as Pr[*P* = STEM|*Y* = STEM].

The question that one really wants to know is “if my value was predicted to be an event, what are the chances that it is truly is an event?” or Pr[*Y* = STEM|*P* = STEM]. Thankfully, the field of Bayesian analysis (McElreath 2015) has an answer to this question. In this context, Bayes’ Rule states that

\[Pr[Y|P] = \frac{Pr[Y] \times Pr[P|Y]}{Pr[P]} = \frac{Prior \times Likelihood}{Evidence}\]

Sensitivity (or specificity, depending on one’s point of view) is the “likelihood” parts of this equation. The *prior probability*, or *prevalence*, is the overall rate that we see events in the wild (which may be different from what was observed in our training set). Usually, one would specify the overall event rate before data are collected and use it in the computations to determine the unconditional statistics. For sensitivity, its unconditional analog is called the *positive predictive value* (PPV):

\[PPV = \frac{sensitivity \times prevalence}{(sensitivity\times prevalence) + ((1-specificity)\times (1-prevalence))}\]

The *negative predictive value* (NPV) is the analog to specificity and can be computed as

\[NPV = \frac{specificity \times (1-prevalence)}{((1-sensitivity)\times prevalence) + (specificity\times (1-prevalence))}\]

See DG Altman and Bland (1994b) for a clear and concise discussion of these measures. Also, simplified versions of these formulas are often shown for these statistics that assume the prevalence to be 0.50. These formulas, while correct when prevalence is 0.50, can produce very misleading results if the prevalence is different from this number.

For the OkCupid data, the difference in the sensitivity and PPV are:

*sensitivity*: if the profile is truly STEM, what is the probability that it is correctly predicted?*PPV*: if the profile was predicted as STEM, what is the probability that it is STEM?

The positive and negative predictive values are not often used to measure performance. This is partly due to the nature of the prevalence. If the outcome is not well understood, it is very difficult to provide a value (even when asking experts). When there is a sufficient amount of data, the prevalence is typically estimated by the proportion of the outcome data that correspond to the event of interest. Also, in other situations, the prevalence may depend on certain factors. For example, the proportion of STEM profiles in the San Francisco area can be estimated from the training set to be 0.18. Using this value as the prevalence, our estimates are PPV = 0.45 and NPV = 0.93. The PPV is significantly smaller than the sensitivity due to the model missing almost 28% of the true STEM profiles and the fact that the overall likelihood of being in the STEM fields is already fairly low.

The prevalence of people in STEM professions in San Francisco is likely to be larger than in other parts of the country. If we thought that the overall STEM prevalence in the United States was about 5%, then our estimates would change to PPV = 0.16 and NPV = 0.98. These computations only differ by the prevalence estimates and demonstrate how the smaller prevalence affects the unconditional probabilities of the results.

The metrics discussed so far depend on having a *hard* prediction (e.g., STEM or other). Most classification models can produce class probabilities as *soft* predictions that can be converted to a definitive class by choosing the class with the largest probability. There are a number of metrics that can be created using the probabilities.

For a two-class problem, an example metric is the binomial log-likelihood statistic. To illustrate this statistic, let \(i\) represent the index of the samples where \(i=1, 2, \ldots, n\), and let \(j\) represent the numeric value of the number of outcome classes where \(j=1, 2\). Next, we will use \(y_{ij}\) to represent the indicator of the true class of the \(i^{th}\) sample. That is, \(y_{ij} = 1\) if the \(i^{th}\) sample is in the \(j^{th}\) class and 0 otherwise. Finally, let \(p_{ij}\) represent the predicted probability of the \(i^{th}\) sample in the \(j^{th}\) class. Then the log-likelihood is calculated as

\[ \log \ell = \sum_{i=1}^n \sum_{j=1}^C y_{ij} \log(p_{ij}), \]

where \(C\) = 2 for the two-class problem. In general, we want to maximize the log-likelihood. This value will be maximized if all samples are predicted with high probability to be in the correct class.

Two other metrics that are commonly computed on class probabilities are the Gini criterion (Breiman et al. 1984)

\[ G = \sum_{i=1}^n \sum_{j \ne j'} p_{ij} p_{ij'} \]

and entropy (\(H\)) (MacKay 2003):

\[ H = -\sum_{i=1}^n \sum_{j=1}^C p_{ij} \log_2p_{ij} \]

Unlike the log-likelihood statistic, both of these metrics are measures of *variance* or *impurity* in the class probabilities^{19} and should be *minimized*.

Class 1 | Class 2 | Log-Likelihood | Gini | Entropy | |
---|---|---|---|---|---|

Equivocal Model | 0.5 | 0.5 | -0.693 | 0.25 | 1.000 |

Good Model | 0.8 | 0.2 | -0.223 | 0.16 | 0.722 |

Bad Model | 0.2 | 0.8 | -1.609 | 0.16 | 0.722 |

Of these three metrics, it is important to note that the likelihood statistic is the only one to use the true class information. Because of this, it penalizes poor models in a supervised manner. The Gini and entropy statistics would only penalize models that are equivocal (i.e., produce roughly equal class probabilities). For example, Table 3.2 shows a two-class example. If the true outcome was the first class, the model results shown in the second row would be best. The likelihood statistic only takes into account the column called “Class 1” since that is the only column where \(y_{ij} = 1\). In terms of the likelihood statistic, the equivocal model does better than the model that confidently predicts the wrong class. When considering Gini and entropy, the equivocal model does worst while the good and bad model are equivalent^{20}.

When there are two classes, one advantage that the log-likelihood has over metrics based on a hard prediction is that it sidesteps the issue of the appropriateness of the probability cutoff. For example, when discussing accuracy, sensitivity, specificity, and other measures, there is the implicit assumption that the probability cutoff used to go between a soft prediction to a hard prediction is valid. This can often not be the case, especially when the data have a severe class imbalance^{21}. Consider the OkCupid data and the logistic regression model that was previously discussed. The class probability estimates that were used to make definitive predictions that were contained in Table 3.1 are shown in Figure 3.2 where the top panel contains the profiles that were truly STEM and the bottom panel has the class probability distribution for the other profiles^{22}. The common 50% cutoff was used to create the original table of observed by predicted classes. Table 3.1 can also be visualized using a mosaic plot such as the one shown in Figure 3.3(b) where the sizes of the blocks are proportional to the amount of data in each cell. What would happen to this table if we were more permissive about the level of evidence needed to call a profile STEM? Instead of using a 50% cutoff, we might *lower* the threshold for the event to 20%. In this instance, we more profiles would be called STEM. This might raise sensitivity since the true STEM profiles are more likely to be correctly predicted, but the cost is to increase the number of false positives. The mosaic plot for this confusion matrix is shown in Figure 3.3(a) where the blue block in the upper left becomes larger. But there is also an increase in the red block in the upper right. In doing so, the sensitivity increases from 0.72 to 0.96, but the specificity drops from 0.8 to 0.35. Increasing the level of evidence needed to predict a STEM profile to 80%, has the opposite effect as shown in Figure 3.3(c). Here, specificity improves but sensitivity is undermined.

The question then becomes “what probability cutoff should be used?” This depends on a number of things, including which error (false positive or false negative) hurts the most. However, if both types of errors are equally bad, there may be cutoffs that do better than the default.

The receiver operating characteristic (ROC) (DG Altman and Bland 1994a) curve can be used to alleviate this issue. It considers *all possible cutoffs* and tracks the changes in sensitivity and specificity. The curve is composed by plotting the false positive rate (1 - specificity) versus the true positive rate. The ROC curve for the OkCupid data is shown in Figure 3.4(a). The best model is one that hugs the *y*-axis and directly proceeds to the upper left corner (where neither type of error is made) while a completely ineffective model’s curve would track along the diagonal line shown in grey. This curve allows the user to do two important tasks. First, an appropriate cutoff can be determined based on one’s expectations regarding the importance of either sensitivity or specificity. This cutoff can then be used to make the qualitative predictions. Secondly, and perhaps more importantly, it allows a model to be assessed without having to identify the best cutoff. Commonly, the area under the ROC curve (AUC) is used to evaluate models. If the best model immediately proceeds to the upper left corner, the area under this curve would be one while the poor model would produce an AUC in the neighborhood of 0.50. Caution should be used though since two curves for two different models may cross; this indicates that there are areas where one model does better than the other. Used as a summary measure, the AUC annihilates any subtleties that can be seen in the curves. For the curve in Figure 3.4(a), the AUC was 0.839, indicating a moderately good fit.

From the information retrieval point of view, the precision-recall curve is more appropriate (Christopher, Prabhakar, and Hinrich 2008). This is similar to the ROC curve in that the two statistics are calculated over every possible cutoff in the data. For the OkCupid data, the curve is shown in Figure 3.4(b). A poor model would result in a precision-recall curve that is in the vicinity of the horizontal grey line that is at the value of the observed prevalence (0.18 here). The area under the curve is used to summarize model performance. The best possible value is 1.0, while the worst is the prevalence. The area under this curve is 0.603.

During the initial phase of model building, a good strategy for data sets with two classes is to focus on the AUC statistics from these curves instead of metrics based on hard class predictions. Once a reasonable model is found, the ROC or precision-recall curves can be carefully examined to find a reasonable cutoff for the data and then qualitative prediction metrics can be used.

### 3.2.3 Context-Specific Metrics

While the metrics discussed previously can be used to develop effective models, they may not answer the underlying question of interest. As an example, consider a scenario where we have collected data on customer characteristics and whether or not the customers clicked on an ad. Our goal may be to relate customer characteristics to the probability of a customer clicking on an ad. Several of the metrics described above would enable us to assess model performance if this was the goal. Alternatively, we may be more interested in answering “how much money will my company make if this model is used to predict who will click on an ad?” In another context, we may be interested in building a model to answer the question “what is my expected profit when the model is used to determine if this customer will repay a loan?”. These questions are very context specific and do not directly fit into the previously described metrics.

Take the loan example. If a loan is requested for \(\$\)*M*, can we compute the expected profit (or loss)? Let’s assume that our model is created on an appropriate data set and can produce a class probability \(P_r\) that the loan will be paid on time. Given these quantities, the interest rate, fees, and other known factors, the gross return on the loan can be computed for each data point and the average can then be used to optimize the model.

Therefore, we should let the question of interest lead us to an appropriate metric for assessing a model’s ability to answer the question. It may be possible to use common, existing metrics to address the question. Or the problem may require development of custom metrics for the context. See Chapter 16 of Kuhn and Johnson (2013) for an additional discussion.

Note that these values were

*not*obtained by simply re-predicting the data set. The values in this table are the set of “assessment” sets generated during the cross-validation procedure defined in Section 3.4.1.↩Values close to -1 are rarely seen in predictive modeling since the models are seeking to find predicted values that are similar to the observed values. We have found that a predictive model that has difficulty finding a relationship between the predictors and the response has a Kappa value slightly below or near 0↩

In fact, the Gini statistic is equivalent to the binomial variance when there are two classes.↩

While not helpful for comparing models, these two statistics are widely used in the process of creating decision trees. See Breiman et al. (1984) and Quinlan (1993) for examples. In that context, these metrics enable tree-based algorithms to create effective models. ↩

As with the confusion matrix in Table 3.1, these data were created during 10-fold cross-validation.↩