19.2 Essentials of prediction
Consider the following examples:
This screening test is 99% accurate in detecting some condition.
This algorithm detects fraudulent credit card transactions with an accuracy of 99%.
Both statements sound pretty good, but we should not stop there, but rather ask further questions.
What is the measure on which the test quality is being evaluated and reported? What about other measures?
What is the baseline performance and the performance of alternative algorithms?
Three key questions to ask before trying to predict anythying:
How is performance or predictive success evaluated?
What is the baseline performance and the performance of alternative benchmarks?
Distinguish between different prediction tasks and the measures used for evaluating success:
1. Two principle predictive tasks: classification vs. point predictions
2. Different measures for evaluating (quantifying) the quality of predictions
Note: It is tempting to view classification tasks and quantitative tasks as two types of “qualitative” vs. “quantitative” predictions. However, this would be misleading, as qualitative predictions are also evaluated in a quantitative fashion. Thus, we prefer to distinguish between different tasks, rather than different types of prediction.
19.2.1 Types of tasks
ad 1A: Two types of predictive tasks
qualitative prediction tasks: Classification tasks. Main goal: Predict the membership in some category.
Secondary goal: Evaluation by a 2x2 matrix of predicted vs. true cases (with 2 correct cases and 2 errors).
quantitative prediction tasks: Point predictions with numeric outcomes. Main goal: Predict some value on a scale.
Secondary goal: Evaluation by the distance between predicted and true values.
Note: Some authors (e.g., in 7 Fitting models with parsnip of Tidy Modeling with R) distinguish between different modes. The mode reflects the type of prediction outcome. For numeric outcomes, the mode is regression; for qualitative outcomes, it is classification.
19.2.2 Evaluating predictive success
ad 1B: Given some prediction, how is predictive success evaluated (quantified)?
Remember the earlier example of the mammography problem: Screening has high sensitivity and specificity, but low PPV.
Note that — for all types of prediction — there are always trade-offs between many alternative measures for quantifying their success. Maximizing only one of them can be dangerous and misleading.
19.2.3 Baseline performance and other benchmarks
ad 2. On first glance, the instruction “Predict the phenomenon of interest with high accuracy.”
seems a reasonable answer to the question “What characterizes a successful predictive algorithm?”
However, high accuracy is not very impressive if the baseline is already quite high. For instance, if it rains on only 10% of all summer days in some region, always predicing “no rain” will achieve an accuracy of 90%.
This seems trivial, but consider the above examples of detecting some medical condition or fraudulent credit card transactions: For a rare medical condition, a fake pseudo-test that always says “healthy” would achieve an impressive accuracy. Similarly, if over 99% of all credit card transactions are legitimate, always predicting “transaction is ok” would achieve an accuracy of over 99%…
Hence, we should always ask ourselves:
What is the lowest possible benchmark (e.g., for random predictions)?
What levels can be achieved by naive or very simple predictions?
As perfection is typically impossible, we need to decide how much better our rule needs to be than alternative algorithms. For this latter evaluation, it is important to know the competition.