8.3 The Dangers of Backtesting

$\newcommand{\bm}[1]{\boldsymbol{#1}} \newcommand{\textm}[1]{\textsf{#1}} \newcommand{\textnormal}[1]{\textsf{#1}} \def\T{{\mkern-2mu\raise-1mu\mathsf{T}}} \newcommand{\R}{\mathbb{R}} % real numbers \newcommand{\E}{{\rm I\kern-.2em E}} \newcommand{\w}{\bm{w}} % bold w \newcommand{\bmu}{\bm{\mu}} % bold mu \newcommand{\bSigma}{\bm{\Sigma}} % bold mu \newcommand{\bigO}{O} %\mathcal{O} \renewcommand{\d}[1]{\operatorname{d}\!{#1}}$

We have learned from the “seven sins of quantitative investing” in Section 8.2 that backtesting is a dangerous process, fraught with many potential pitfalls. In fact, there are more than these seven types of errors one can make: “A full book could be written listing all the different errors people make while backtesting” (López de Prado, 2018a).

Arguably, the most common mistake in backtesting involves overfitting or data snooping. The following quote by John von Neumann about the general concept of overfitting is quite amusing and illustrative: “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk” (Mayer et al., 2010).

8.3.1 Backtest Overfitting

Overfitting is a concept borrowed from machine learning and refers to the situation when a model targets particular (noisy) observations rather than a general and persistent structure in the data. In the context of investment strategies, it takes place when a strategy is developed to perform well on a backtest, by monetizing random historical patterns. Because those random patterns are unlikely to occur again in the future, the strategy so developed will fail.

This is why the performance of a strategy in the training data (in-sample data) can be totally misleading and we need to resort to test data (out-of-sample data). However, even the performance in the test data can also be totally misleading (Bailey et al., 2014; Bailey, Borwein, and López de Prado, 2016). The reason is that, when a researcher is backtesting an investment strategy, it is only natural to try to adjust some of the strategy’s parameters to see the effect in the backtest performance. By doing this, typically over and over again, it is inevitable that the test data has indirectly become part of the training data (or cross-validation data) and it is not really test data anymore. This leads to the unfounded belief that a portfolio is going to perform well only to find out when trading live that it does not live up to expectations. This also leads to publications with backtest results that are not representative of reality.

The reality is that it takes a relatively small number of trials to identify an investment strategy with a spuriously high backtested performance, especially for complex strategies. Bailey et al. (2014) argued that “Not reporting the number of trials involved in identifying a successful backtest is a similar kind of fraud.”

Indeed, what makes backtest overfitting so hard to assess is that the probability of false positives changes with every new test conducted on the same dataset. That information is either unknown by the researcher or not shared with investors or referees. The only backtests that most people share are those that portray supposedly winning investment strategies.

8.3.2 $p$ -Hacking

In fact, backtest overfitting due to multiple testing is related to a more general phenomenon in statistics called “ $p$ -hacking,” also known as cherry-picking, data dredging, significance chasing, significance questing, and selective inference. The term $p$ -hacking refers to the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives. This is done by performing many statistical tests on the data and only reporting those that come back with significant results. We next elaborate on the concepts of $p$ -value and $p$ -hacking.

In hypothesis testing, one wants to determine whether the data really come from some candidate distribution, the so-called null hypothesis. This can be formally assessed via the $p$ -value, which is the probability of obtaining the observed results under the assumption that the null hypothesis is correct. A small $p$ -value means that there is strong evidence to reject the null hypothesis and accept the alternative hypothesis. Typical thresholds for determining whether a $p$ -value is small enough are in the range 0.01–0.05. Thus, if the $p$ -value is smaller than the threshold, then we can reject the null hypothesis that the data came from that distribution. The $p$ -value is routinely used in all scientific areas, such as physics (to determine whether the data supports or rejects a hypothesis) or medicine (to determine the effectiveness of new drugs).

The term “ $p$ -hacking” refers to the dangerous practice of testing multiple hypotheses and only reporting (cherry-picking) the one that produces a small $p$ -value. For example, a researcher may report a portfolio showing excellent results during the period 1970–2017, but does not reveal that the same result is weaker for the period 1960–2017. Similarly, a portfolio may look profitable using some specific universe of stocks, but it is not reported that a variation of the universe produced a degraded performance. The problem with this practice is that, when reporting the results, typically the number of experiments conducted is omitted. The reader then may wrongly infer that it was a single trial.

Indeed, most of the claimed research findings in financial economics are likely false due to $p$ -hacking (C. R. Harvey, 2017; C. R. Harvey et al., 2016). For example, some observations from C. R. Harvey (2017) include: “Empirical research in financial economics relies too much on $p$ -values, which are poorly understood in the first place” and “Journals want to publish papers with positive results and this incentivizes researchers to engage in data mining and $p$ -hacking.”

8.3.3 Backtests Are Not Experiments

Experiments, for example in physics, are conducted in a lab and can be repeated multiple times to control for different variables. In contrast, a backtest is a historical simulation of how a strategy would have performed in the past. Thus, a backtest is not an experiment, and it does not prove anything.

In fact, a backtest guarantees nothing, not even achieving that Sharpe ratio if we could travel back in time, simply because random draws would have been different and the past would not repeat itself (López de Prado, 2018a).

8.3.4 The Paradox of Flawless Backtests

The irony of a backtest is that, even if it is flawless, it is probably wrong (López de Prado, 2018a). Indeed, suppose you have implemented a flawless backtest (i.e., everyone can reproduce your results, you have considered more than the necessary slippage and transaction costs, etc.) and it still shows good performance. Unfortunately, this flawless backtest is still probably wrong. Why?

First of all, only an expert can produce a flawless backtest. This expert must have run a myriad of backtests over the years. So we need to account for the possibility that this is a false discovery, a statistical fluke that inevitably comes up after running multiple tests on the same dataset (i.e., overfitting).

The maddening thing about backtesting is that the better you become at it, the more likely false discoveries will pop up (López de Prado, 2018a).

8.3.5 Limitations of Backtesting Insights

Backtesting provides us with very little insight into the reason why a particular strategy would have made money (López de Prado, 2018a). Just as a lottery winner may feel he has done something to deserve his luck, there is always some ex post story.

Regarding financial data, many authors claim to have found hundreds of “alphas” and “factors,” and there is always some convoluted explanation for them. Instead, what they have found are the lottery tickets that won the last game. Those authors never tell us about all the tickets that were sold, that is, the millions of simulations it took to find these “lucky” alphas.

8.3.6 What is the Point of Backtesting Then?

While backtesting cannot guarantee the future good performance of a strategy, it can serve the opposite purpose, that is, to identify strategies that underperform so that we can eliminate them.

In addition, a backtest can provide a sanity check on a number of variables, including bet sizing, turnover, resilience to costs, and behavior under a given scenario.

Thus, the purpose of a backtest is to discard bad models, not to improve them. It may sound counter-intuitive, but one should not adjust the model based on the backtest results since it is a waste of time and it is dangerous due to overfitting.

One should invest time and effort developing a sound strategy. However, by the time backtests are performed, it is too late to modify the strategy. So never backtest until your model has been fully specified (López de Prado, 2018a).

Summarizing, a good backtest can still be extremely helpful, but backtesting well is extremely hard to execute.

8.3.7 Recommendations to Avoid Overfitting

How to address backtest overfitting is arguably the most fundamental question in quantitative finance. While there is no easy way to prevent backtest overfitting, a number of recommendations were compiled in López de Prado (2018a) and some are listed here for convenience:

Develop models for entire asset classes or investment universes, rather than for specific securities, to reduce the probability of false discoveries.
Apply model averaging (see Chapter 14 for details) as a means to both prevent overfitting and reduce the variance of the forecasting error.
Do not backtest until all your research is complete (i.e., do not fall into the vicious cycle of keeping tweaking parameters and running the backtest over and over again).
Keep track of the number of backtests conducted on a dataset so that the probability of backtest overfitting may be estimated and the Sharpe ratio may be properly deflated (Bailey and López de Prado, 2014).
Apart from backtesting on historical data, consider simulating scenarios rather than history, such as stress tests (see Section 8.5 for details). Your strategy should be profitable under a wide range of scenarios, not just the anecdotal historical path.

A list of best research practices for backtesting were proposed in Arnott et al. (2019), such as:

Establish an ex ante economic foundation: following the scientific method, a hypothesis is developed and the empirical tests attempt to find evidence inconsistent with the hypothesis.
Beware an ex post economic foundation: it is also almost always a mistake to create an economic story – a rationale to justify the findings – after the data mining has occurred.
Keep track of the multiple tests tried: this is needed to assess the statistical significance of the results; too many trials and any spurious result can be obtained.
Define the test sample ex ante: the sample data and data transformations (such as volatility scaling or standardization) should never change after the research begins.
Acknowledge that out-of-sample data is not really out of sample: simply because researchers have lived through the hold-out sample and thus understand the history, are knowledgeable about when markets rose and fell, and associate leading variables with past experience. As such, no true out-of-sample data exists; the only true out of sample is the live trading experience.
Understand iterated out of sample is not out of sample.
Do not ignore trading costs and fees.
Refrain from tweaking the model.
Beware of model complexity: pursue simplicity and regularization.

8.3.8 Mathematical Tools to Combat Overfitting

A number of mathematical techniques have been proposed in the past decade to combat backtest overfitting; to name a few:

A general framework to assess the probability of backtest overfitting was proposed in Bailey et al. (2017).
For the single testing case, the minimum backtest length metric was proposed in Bailey et al. (2014) to avoid selecting a strategy with a high Sharpe ratio on in-sample data, but zero or less on out-of-sample data. A probabilistic Sharpe ratio was proposed in Bailey and López de Prado (2012) to calculate the probability of an estimated SR being greater than a benchmark Sharpe ratio.
For the multiple testing case, the deflated Sharpe ratio was developed in Bailey and López de Prado (2014) to provide a more robust performance statistic, in particular, when the returns follow a nonnormal distribution.
Online tools are presented in Bailey, Borwein, López de Prado, Salehipour, et al. (2016) to demonstrate how easy it is to overfit an investment strategy, and how this overfitting may affect the financial bottom-line performance.
Section 8.4.4 describes a way to execute multiple randomized backtests that helps the prevention of overfitting.

References

Arnott, R., Harvey, C. R., and Markowitz, H. M. (2019). A backtesting protocol in the era of machine learning. The Journal of Financial Data Science, 1(1), 64–74.

Bailey, D. H., Borwein, J. M., López de Prado, M., Salehipour, A., and Zhu, Q. J. (2016). Backtest overfitting in financial markets. Automated Trader, 39(2), 52–57.

Bailey, D. H., Borwein, J. M., López de Prado, M., and Zhu, Q. J. (2014). Pseudo-mathematics and financial charlatanism: The effects of backtest overfitting on out-of-sample performance. Notices of the American Mathematical Society, 61(5), 458–471.

Bailey, D. H., Borwein, J. M., López de Prado, M., and Zhu, Q. J. (2017). The probability of backtest overfitting. Journal of Computational Finance (Risk Journals), 20(4), 458–471.

Bailey, D. H., Borwein, J., and López de Prado, M. (2016). Stock portfolio design and backtest overfitting. Journal of Investment Management, 15(1), 1–13.

Bailey, D. H., and López de Prado, M. (2012). The Sharpe ratio efficient frontier. The Journal of Risk, 15(2), 3–44.

Bailey, D. H., and López de Prado, M. (2014). The deflated Sharpe ratio: Correcting for selection bias, backtest overfitting and non-normality. Journal of Portfolio Management, 40(5), 94–107.

Harvey, C. R. (2017). Presidential address: The scientific outlook in financial economics. The Journal of Finance, 72(4), 1399–1440.

Harvey, C. R., Liu, Y., and Zhu, H. (2016)... And the cross-section of expected returns. Review of Financial Studies, 29(1), 5–68.

López de Prado, M. (2018a). Advances in Financial Machine Learning. John Wiley & Sons.

Mayer, J., Khairy, K., and Howard, J. (2010). Drawing an elephant with four complex parameters. American Journal of Physics, 78(6), 648–649.