9 Types of Error

We deal with two primary error measures in this course, mean square error (\(MSE\)) and error rate. Although this seems simple on the surface, people often mix these up when actually applying their models, so I figured it was worth talking about them a bit here as well as issues that arise when calculating them.

9.1 Regression vs. Classification

When I talk about a regression problem, I’m talking about any problem where you have to understand/predict a continuous target (e.g. the weight of a person, the cost of an item, how much to sell a house for, etc.). Sometimes people get tripped up as we’ll later use non-regression based models to predict a continuous target. We still consider the problem a ‘regression’ one.

When I talk about a classification problem, I’m talking about any problem where you have to understand/predict a categorical target (e.g. click or not, churn or not, is a transaction fraudulent or not, etc.). They key thing is that these are defined groups… If you are predicting if a credit card transaction is fraudulent or not, there’s no middle ground in your groups. It’s either fraudulent or it’s not. We use logistic regression, kNN, decision trees, and a variety of other models to understand/predict these categorical targets. And yes, it’s confusing that you use logistic regression for a classification problem. Last thing.

9.2 \(MSE\) vs. Error Rate

So the simple version is this. Use \(MSE\) to assess regression problems, and error rate to assess classification problems.

9.2.1 What is \(MSE\)

\(MSE\) is the mean squared error. In other words, it’s the total squared error of how much your predicted target (\(\hat{y}\)) differs from you actual target value (\(y\)).

Here’s the formula. Notice that it’s actually just our \(RSS\) divided by the total sample size.

\[MSE = \frac{1}{n}\sum_{x = 1}^{n} (\hat{y}_i - y_i)^2\] If you take the square root of your \(MSE\) you get the mean error, which tells you on average how different your predicted target is from the actual target. So say were trying to predict a home price for Zillow and you had a mean error of 6500 dollars. That would mean on average your prediction for any given home price is off by that amount. Sometimes it’s higher, other times lower, but that’s the average.

9.2.2 What is error rate

Error rate is on average how often we predict the class of our target incorrectly.

You can see in the formula below it’s very similar to our \(MSE\), but instead it’s asking how often \(\hat{y}\) does not equal our true \(y\). So, how often do you predict someone will click and advertisement when they actually don’t, for example.
\[ \frac{1}{n} \sum_{i = 1}^{n} (y_i \ne \hat{y}_i)\]

9.3 Where do people get tripped up? Regression problems.

Where I see people make mistakes on this topic is when assessing a model. The rest of this class will revolve around you learning a new model type and comparing it to the ones you’ve learned before. Thus it’s important to know how to do this! What people commonly do is to try and calculate \(MSE\) even though it’s a classification problem, or error rate for regression problems.

Let’s say you have the following sets of true targets and predicted targets for home prices in Tucson

true_home_value <- c(214000, 349000, 144500, 273800, 89000)
predicted_home_value <- c(225000, 336000, 175000, 249000, 105000)

Predicting a dollar value is predicting a continuous target, so we need to calculate \(MSE\). We can do that quickly.

home_value_mse <- mean((true_home_value - predicted_home_value)^2)
home_value_mse
## [1] 418258000

There we go! So we can calculate \(MSE\) values for various models and see which one has the lowest value. That’s our best model then.

If we want to make this more understandable we can take the square root. It’s telling us that on average our model’s prediction of a home’s value is off by 20451 dollars.

sqrt(home_value_mse)
## [1] 20451.36

Where people make mistakes is if they try to calculate the error rate here. You do that by comparing if the predicted value is equal to the true value. But, that doesn’t work with continuous targets as they vary, well, continuously. Watch

true_home_value != predicted_home_value
## [1] TRUE TRUE TRUE TRUE TRUE

So if you tried to calculate the error rate from this you’d get one, which is obviously wrong.

mean(home_value_mse != predicted_home_value)
## [1] 1

9.4 Where do people get tripped up? Classification problems.

The same issues pop up if you try to calculate \(MSE\) for classification problems. Let’s say you are looking credit card fraud where you are predicting if a transaction if fraudulent or not. You have the following sets of true and predicted values

true_trans_fraud <- c('yes', 'no', 'no', 'no', 'yes', 'no', 'yes')
predicted_trans_fraud <- c('yes', 'yes', 'no', 'no', 'no', 'no', 'yes')

We can get error rate by taking the average of how often the true doesn’t equal the predicted.

mean(true_trans_fraud != predicted_trans_fraud)
## [1] 0.2857143

So our model incorrectly identifies fraudulent transactions roughly 29% of the time.

This obviously doesn’t work if you tried to subtract them as they’re just text classes.

9.5 Conclusion

To wrap up, make sure you understand the nature of the problem you’re trying to solve (regression vs. classification) so that you can use the right error measure. Just because R will give you an answer, as it will when trying to get error rate of regression problems, but that sure doesn’t mean that answer is right!