Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance

David H. Bailey, Jonathan M. Borwein, Marcos López de Prado, Qiji Jim Zhu
Notices of the AMS
Lawrence Berkeley National Laboratory

First Page Preview

First page preview

Table of Contents

Overall Summary

Study Background and Main Findings

This paper examines the critical issue of backtest overfitting in financial modeling, where investment strategies appear successful in historical simulations (backtests) due to chance rather than genuine predictive power. The authors argue that this phenomenon is widespread, particularly due to the common practice of not reporting the number of trials (different strategy configurations) attempted during research. They introduce a key metric, the Minimum Backtest Length (MinBTL), which quantifies the minimum amount of historical data needed to have reasonable confidence that a strategy's apparent success is not just due to random chance, given the number of trials performed.

The methodology involves theoretical derivations based on the statistical properties of the Sharpe ratio, a common performance metric. The authors demonstrate how, as the number of trials increases, the likelihood of finding a spuriously high Sharpe ratio also increases. They then derive the MinBTL formula, which shows that the required backtest length grows with the number of trials. Monte Carlo simulations are used to illustrate the concepts and demonstrate the divergence between in-sample (IS, the data used to develop the strategy) and out-of-sample (OOS, new, unseen data) performance. These simulations explore scenarios both with and without "compensation effects," which are factors that can create a negative relationship between IS and OOS performance, making overfitting even more detrimental.

The results reveal that even a relatively small number of trials can lead to significantly inflated IS performance. Critically, when compensation effects are present (e.g., global constraints or serial correlation in returns), optimizing a strategy for IS performance can actively lead to negative OOS performance. A practical example involving the search for a seasonal trading pattern in random data highlights how easily spurious "discoveries" can be made. The authors argue that not reporting the number of trials is akin to fraud, as it obscures the risk of overfitting and misleads investors.

The paper concludes with a strong call for higher standards in financial research and practice. The authors emphasize the importance of reporting the number of trials and using the MinBTL to assess backtest reliability. They also discuss the ethical implications of presenting overfit results and urge the mathematical community to address the misuse of mathematical concepts in finance. The paper aims to spark a broader discussion about overfitting and its consequences, advocating for greater transparency and rigor in the evaluation of investment strategies.

Research Impact and Future Directions

This paper provides valuable insights into the pervasive problem of backtest overfitting in finance. By quantifying the relationship between the number of trials and the minimum backtest length needed to avoid spurious results, the authors offer a practical tool (MinBTL) for researchers and investors. The paper's strength lies in its clear exposition of the problem, rigorous mathematical framework, and compelling simulations demonstrating how overfitting can lead to actively detrimental out-of-sample performance. The ethical discussion surrounding the non-disclosure of trials is particularly relevant, highlighting the potential for misleading practices in the field.

However, the paper's reliance on the assumption of independent trials for MinBTL calculation presents a limitation. While the authors briefly mention dimension reduction techniques like PCA, further exploration of how to handle correlated trials, which are common in practice, would enhance the framework's applicability. Additionally, a more detailed sensitivity analysis of MinBTL to its input parameters would provide further practical guidance. Despite these limitations, the paper makes a significant contribution by raising awareness about the dangers of overfitting and advocating for greater transparency and rigor in financial research and practice. The call for higher standards, stricter validation procedures, and explicit reporting of all trials performed is crucial for improving the reliability and trustworthiness of backtested investment strategies.

The study's simulation-based design, while effective in demonstrating the potential for overfitting, inherently limits its ability to definitively capture the full complexity of real-world financial markets. The simplified scenarios, though illustrative, may not fully represent the intricate interactions and dynamics that influence actual trading outcomes. Therefore, the findings should be interpreted as demonstrating the potential for harm from overfitting, rather than providing precise predictions of real-world losses. Further research involving empirical data and more realistic market simulations would be valuable in validating and refining the proposed framework. Despite this inherent limitation of its design, the study successfully raises a critical issue and provides a valuable starting point for developing more robust methods for evaluating and mitigating the risks of backtest overfitting.

Critical Analysis and Recommendations

Clear Problem Definition (written-content)
The introduction clearly articulates the problem of backtest overfitting and its implications, setting a strong foundation for the paper's arguments. This clarity is essential for engaging both expert and lay readers and establishing the relevance of the research.
Section: Introduction
Define 'Pseudo-Mathematics' (written-content)
The paper lacks an early, explicit definition of "pseudo-mathematics," a central term. Adding a concise definition when the term is first introduced would enhance clarity and strengthen the framing of the core argument.
Section: Introduction
Rigorous Mathematical Framework (written-content)
The method section develops a rigorous mathematical framework for analyzing backtest overfitting, culminating in the MinBTL formula. This quantitative approach provides concrete tools for assessing backtest reliability.
Section: Method
Address Correlated Trials (written-content)
The MinBTL calculation assumes independent trials, a significant practical limitation. Expanding on how to estimate an 'effective N' for correlated trials would greatly enhance the framework's real-world applicability.
Section: Method
Effective Visualization of MinBTL (graphical-figure)
Figure 2 effectively visualizes the relationship between the number of trials and the minimum backtest length, clearly illustrating the tradeoff and supporting the core concept of MinBTL. This visualization aids in understanding the practical implications of increasing trials.
Section: Results
Quantify OOS Degradation (written-content)
While Figures 6 and 7 show a negative relationship between in-sample and out-of-sample performance, quantifying the extent of out-of-sample degradation for top in-sample performers would strengthen these key results.
Section: Results
Strong Conclusion and Call to Action (written-content)
The discussion powerfully concludes the paper by synthesizing findings, calling for higher standards, and highlighting ethical considerations. This strong call to action increases the paper's impact beyond purely academic circles.
Section: Discussion
Quantify Negative OOS Performance (written-content)
The conclusion states that out-of-sample performance can be "significantly negative." Quantifying this with a representative range from the simulations would provide a more concrete and memorable takeaway.
Section: Discussion

Section Analysis

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Method

Key Aspects

Strengths

Suggestions for Improvement

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 1. Overfitting a backtest's results as the number of trials grows.
Figure/Table Image (Page 3)
Figure 1. Overfitting a backtest's results as the number of trials grows.
First Reference in Text
Figure 1 provides a graphical representation of Proposition 1.
Description
  • Axes and Variables: The graph displays a single curve on a 2D plot. The horizontal x-axis, labeled 'Number of Trials (N)', ranges from 0 to 1000. The 'Number of Trials' refers to how many different investment strategy configurations are tested. The vertical y-axis, labeled 'Minimum Backtest Length (in Years)', ranges from 0 to 12. The 'Minimum Backtest Length (MinBTL)' signifies the shortest historical data period, in years, that a simulated investment strategy (a backtest) should be run on to make its results reliable enough.
  • Trend of the Curve: The blue curve shows that as the 'Number of Trials (N)' increases, the 'Minimum Backtest Length (MinBTL)' required also increases. This increase is steep for a small number of trials (low N) and becomes progressively flatter as N gets larger. For example, to avoid being misled by a strategy that appears to have an in-sample Sharpe ratio (a measure of risk-adjusted return on the tested data) of 1 purely by chance:
  • Data Point: Low N: If only a few trials are conducted (e.g., around N=7, as implied by the text referencing a two-year backtest), the MinBTL is approximately 2 years.
  • Data Point: Moderate N (45): If the number of trials increases to N=45 (as implied by the text for a five-year backtest), the MinBTL needed rises to 5 years.
  • Data Point: Moderate N (200): For N=200 trials, the MinBTL is approximately 7 years.
  • Data Point: High N: When a large number of trials, N=1000, are performed, the MinBTL required is about 11.5 years.
  • Core Implication for Overfitting: The figure illustrates a crucial concept in financial strategy development: the more strategies you try (increasing N), the longer your historical testing period (MinBTL) must be. This is to avoid 'overfitting,' where a strategy looks good on past data simply because it was one of many tested, rather than possessing genuine predictive power. The reference text specifies this graph is for preventing 'skill-less strategies' (those with no real insight) from falsely appearing to achieve an in-sample Sharpe ratio of 1.
Scientific Validity
  • ✅ Appropriate Visualization Method: The line graph is an appropriate choice for visualizing the functional relationship between the number of trials (N) and the Minimum Backtest Length (MinBTL).
  • ✅ Supports Stated Claims: The figure clearly supports the paper's argument, as stated in the caption and reference text, that increasing the number of trials necessitates a longer backtest period to avoid spurious results, specifically achieving an in-sample Sharpe ratio of 1 with a skill-less strategy.
  • 💡 Dependence on Underlying Model Assumptions: The curve is presumably derived from a theoretical model or formula (likely Theorem 2 mentioned later in the paper). While the figure itself doesn't detail these assumptions, its validity hinges on the soundness of that underlying model. For instance, the definition of 'skill-less' (implying a true Sharpe ratio of zero) and the statistical properties of returns are crucial implicit assumptions.
  • 💡 Specificity to SR_IS = 1: The figure is generated for a specific target in-sample Sharpe ratio of 1. The relationship shown would change if a different target Sharpe ratio (e.g., 0.5 or 2.0) were considered. This specificity is noted in the reference text but is a key constraint on the direct applicability of these exact MinBTL values.
  • 💡 Assumption of Independent Trials: The concept of 'independent model configurations' (trials) is central. If the N trials are not truly independent (e.g., slight variations of the same core idea), the effective N might be lower, which would affect the MinBTL. The figure assumes independence of trials.
  • ✅ Significant Practical Implication: The practical implication for researchers is significant: it provides a quantitative guideline for determining necessary data length based on the exploratory breadth of their research, which is a valuable contribution to preventing overfitting.
Communication
  • ✅ Clear Axis Labels: The x-axis ('Number of Trials (N)') and y-axis ('Minimum Backtest Length (in Years)') are clearly labeled, making the variables under consideration immediately understandable.
  • ✅ Informative Title and Caption: The title and caption effectively summarize the figure's purpose, which is to show the relationship between the number of strategy trials and the required historical data length to avoid being misled by chance findings.
  • ✅ Visual Clarity and Simplicity: The single, solid blue line is easy to follow, and the overall graph is clean and uncluttered, facilitating a quick grasp of the depicted trend.
  • ✅ Effective Illustration of Tradeoff: The plot directly illustrates the tradeoff described in the reference text and caption, making the concept of increasing MinBTL with more trials intuitive.
  • 💡 Enhanced Gridlines/Tick Marks: While the major gridlines on the y-axis are helpful, adding minor gridlines or more frequent tick marks on both axes could improve the ease of estimating specific values, particularly for intermediate numbers of trials. For example, explicitly marking N=50, 150, etc., and corresponding values on the y-axis.
  • 💡 Reinforce Key Parameter Visually: The caption mentions the goal is to prevent generating a Sharpe ratio IS of 1. Adding a small annotation directly on the graph or explicitly in the axis label, like 'MinBTL (Years) to avoid spurious SR_IS=1', could further reinforce this critical parameter without requiring the reader to refer back to the caption text.
Figure 2. Minimum Backtest Length needed to avoid overfitting, as a function of...
Full Caption

Figure 2. Minimum Backtest Length needed to avoid overfitting, as a function of the number of trials.

Figure/Table Image (Page 4)
Figure 2. Minimum Backtest Length needed to avoid overfitting, as a function of the number of trials.
First Reference in Text
Figure 2 shows the tradeoff between the number of trials (N) and the minimum backtest length (MinBTL) needed to prevent skill-less strategies to be generated with a Sharpe ratio IS of 1.
Description
  • Axes and Variables: The graph displays a single curve on a 2D plot. The horizontal x-axis, labeled 'Number of Trials (N)', ranges from 0 to 1000. The 'Number of Trials' refers to how many different investment strategy configurations are tested. The vertical y-axis, labeled 'Minimum Backtest Length (in Years)', ranges from 0 to 12. The 'Minimum Backtest Length (MinBTL)' signifies the shortest historical data period, in years, that a simulated investment strategy (a backtest) should be run on to make its results reliable enough.
  • Trend of the Curve: The blue curve shows that as the 'Number of Trials (N)' increases, the 'Minimum Backtest Length (MinBTL)' required also increases. This increase is steep for a small number of trials (low N) and becomes progressively flatter as N gets larger. For example, to avoid being misled by a strategy that appears to have an in-sample Sharpe ratio (a measure of risk-adjusted return on the tested data) of 1 purely by chance:
  • Data Point: Low N: If only a few trials are conducted (e.g., around N=7, as implied by the text referencing a two-year backtest), the MinBTL is approximately 2 years.
  • Data Point: Moderate N (45): If the number of trials increases to N=45 (as implied by the text for a five-year backtest), the MinBTL needed rises to 5 years.
  • Data Point: Moderate N (200): For N=200 trials, the MinBTL is approximately 7 years.
  • Data Point: High N: When a large number of trials, N=1000, are performed, the MinBTL required is about 11.5 years.
  • Core Implication for Overfitting: The figure illustrates a crucial concept in financial strategy development: the more strategies you try (increasing N), the longer your historical testing period (MinBTL) must be. This is to avoid 'overfitting,' where a strategy looks good on past data simply because it was one of many tested, rather than possessing genuine predictive power. The reference text specifies this graph is for preventing 'skill-less strategies' (those with no real insight) from falsely appearing to achieve an in-sample Sharpe ratio of 1.
Scientific Validity
  • ✅ Appropriate Visualization Method: The line graph is an appropriate choice for visualizing the functional relationship between the number of trials (N) and the Minimum Backtest Length (MinBTL).
  • ✅ Supports Stated Claims: The figure clearly supports the paper's argument, as stated in the caption and reference text, that increasing the number of trials necessitates a longer backtest period to avoid spurious results, specifically achieving an in-sample Sharpe ratio of 1 with a skill-less strategy.
  • 💡 Dependence on Underlying Model Assumptions: The curve is presumably derived from a theoretical model or formula (likely Theorem 2 mentioned later in the paper). While the figure itself doesn't detail these assumptions, its validity hinges on the soundness of that underlying model. For instance, the definition of 'skill-less' (implying a true Sharpe ratio of zero) and the statistical properties of returns are crucial implicit assumptions.
  • 💡 Specificity to SR_IS = 1: The figure is generated for a specific target in-sample Sharpe ratio of 1. The relationship shown would change if a different target Sharpe ratio (e.g., 0.5 or 2.0) were considered. This specificity is noted in the reference text but is a key constraint on the direct applicability of these exact MinBTL values.
  • 💡 Assumption of Independent Trials: The concept of 'independent model configurations' (trials) is central. If the N trials are not truly independent (e.g., slight variations of the same core idea), the effective N might be lower, which would affect the MinBTL. The figure assumes independence of trials.
  • ✅ Significant Practical Implication: The practical implication for researchers is significant: it provides a quantitative guideline for determining necessary data length based on the exploratory breadth of their research, which is a valuable contribution to preventing overfitting.
Communication
  • ✅ Clear Axis Labels: The x-axis ('Number of Trials (N)') and y-axis ('Minimum Backtest Length (in Years)') are clearly labeled, making the variables under consideration immediately understandable.
  • ✅ Informative Title and Caption: The title and caption effectively summarize the figure's purpose, which is to show the relationship between the number of strategy trials and the required historical data length to avoid being misled by chance findings.
  • ✅ Visual Clarity and Simplicity: The single, solid blue line is easy to follow, and the overall graph is clean and uncluttered, facilitating a quick grasp of the depicted trend.
  • ✅ Effective Illustration of Tradeoff: The plot directly illustrates the tradeoff described in the reference text and caption, making the concept of increasing MinBTL with more trials intuitive.
  • 💡 Enhanced Gridlines/Tick Marks: While the major gridlines on the y-axis are helpful, adding minor gridlines or more frequent tick marks on both axes could improve the ease of estimating specific values, particularly for intermediate numbers of trials. For example, explicitly marking N=50, 150, etc., and corresponding values on the y-axis.
  • 💡 Reinforce Key Parameter Visually: The caption mentions the goal is to prevent generating a Sharpe ratio IS of 1. Adding a small annotation directly on the graph or explicitly in the axis label, like 'MinBTL (Years) to avoid spurious SR_IS=1', could further reinforce this critical parameter without requiring the reader to refer back to the caption text.
Figure 3. Performance IS vs. OOS before introducing strategy selection.
Figure/Table Image (Page 5)
Figure 3. Performance IS vs. OOS before introducing strategy selection.
First Reference in Text
Figure 3 shows the relation between SR IS (x-axis) and SR OOS (y-axis) for μ = 0, σ = 1,N = 1000, T = 1000.
Description
  • Axes and Variables: The figure is a scatter plot illustrating the relationship between two types of performance measures for simulated investment strategies. The x-axis, labeled 'SR a priori', represents the In-Sample Sharpe Ratio (SR IS). The Sharpe Ratio is a measure of risk-adjusted return; 'In-Sample' means it's calculated on the data used to develop or initially observe the strategy. Values range from approximately -2.0 to 1.5. The y-axis, labeled 'SR a posteriori', represents the Out-of-Sample Sharpe Ratio (SR OOS). 'Out-of-Sample' means it's calculated on new data that the strategy hasn't 'seen' before, which is a better test of its true predictive power. Values range from approximately -2.0 to 2.0.
  • Data Points and Simulation Parameters: Each point on the scatter plot represents one simulated strategy (out of N=1000 strategies). These strategies are based on a random walk process, meaning they have no genuine predictive skill. This is indicated by the parameters μ=0 (average return is zero) and σ=1 (standard deviation of returns is one). The simulation period length is T=1000 data points.
  • Distribution of Points: The points are distributed in a roughly circular cloud centered around the origin (0,0). This visual pattern suggests that there is little to no linear relationship between the in-sample performance (SR IS) and the out-of-sample performance (SR OOS) under these conditions.
  • Regression Analysis Results: A regression line is faintly visible, and its equation is provided at the top: '[SR a posteriori]=0.01+0.01*[SR a priori]+err | adjR2=-0.0'. This equation quantifies the relationship. The coefficient for 'SR a priori' (0.01) is very close to zero, and the adjusted R-squared (adjR2) value of -0.0 (effectively zero) indicates that the in-sample performance explains virtually none of the variance in the out-of-sample performance.
  • Core Implication: No Relationship Before Selection: The key takeaway, as stated in the detailed caption below the figure, is that when there are no 'compensation effects' (biases that might link IS and OOS performance), overfitting the in-sample performance (getting a high SR IS by chance) has no bearing on the out-of-sample performance, which remains centered around the true mean (zero in this case). This is before any selection based on IS performance is made.
Scientific Validity
  • ✅ Appropriate Visualization Method: A scatter plot is an appropriate visualization method to show the relationship (or lack thereof) between two continuous variables like SR IS and SR OOS.
  • ✅ Clear Simulation Parameters: The parameters used (μ=0, σ=1, N=1000, T=1000) are clearly stated, allowing for reproducibility and understanding of the simulation context. The scenario (random walk, no true skill) is crucial for illustrating the baseline relationship before selection bias.
  • ✅ Strong Support for Claims: The visual representation (circular cloud) and the reported regression statistics (adjR2 ≈ 0) strongly support the claim made in the text and caption that, in the absence of compensation effects and before strategy selection, IS performance has no bearing on OOS performance for these skill-less strategies.
  • ✅ Establishes Important Baseline: The figure effectively establishes a baseline. It demonstrates what happens when strategies are generated randomly without any inherent skill, showing that good IS performance can occur by chance without translating to OOS success. This is a critical foundational point for the subsequent discussion on overfitting due to strategy selection.
  • 💡 Clarification of 'err' in Equation (Minor): The term 'err' in the regression equation is standard, but explicitly stating what it represents (e.g., residual error) might be helpful for a broader audience, though it's generally understood in this context.
  • 💡 Inclusion of p-values on Figure: The p-values for the regression coefficients, mentioned in the main text (page 464) as 0.6261 for the intercept and 0.7469 for SR a priori, are not shown on the figure itself. Including these p-values directly on the figure (or at least the p-value for the slope) would provide immediate statistical context for the non-significance of the relationship, further strengthening the visual message.
Communication
  • ✅ Clear Axis Labels and Title: The x-axis ('SR a priori') and y-axis ('SR a posteriori') are clearly labeled, though 'SR a priori' refers to IS performance and 'SR a posteriori' to OOS performance as clarified in the caption below the figure. The title 'OOS Perf. Degradation' above the plot is also informative.
  • ✅ Effective Visualization of Lack of Correlation: The scatter plot effectively visualizes the lack of correlation between in-sample and out-of-sample performance for the given parameters, which is the central message.
  • ✅ Inclusion of Regression Equation: The inclusion of the regression equation '[SR a posteriori]=0.01+0.01*[SR a priori]+err | adjR2=-0.0' directly on the plot is highly informative, quantifying the near-zero relationship.
  • ✅ Informative Detailed Caption: The caption below the figure clearly explains the parameters used (μ=0, σ=1, N=1000, T=1000) and the key takeaway (circular shape, no bearing of IS on OOS performance).
  • 💡 Point Density: The points are somewhat dense, especially around the center. Using a slightly smaller point size or introducing transparency (alpha blending) could improve visibility in the densest regions, though the current representation is still largely interpretable.
  • 💡 Consistency in Terminology on Axes: The terms 'SR a priori' and 'SR a posteriori' are used on the axes, while the main caption uses 'SR IS' and 'SR OOS'. While the detailed caption below the figure clarifies this, consistently using 'SR IS' and 'SR OOS' on the axes themselves would enhance immediate clarity and reduce potential confusion for readers glancing quickly at the plot.
Figure 4. Performance IS vs. performance OOS for one path after introducing...
Full Caption

Figure 4. Performance IS vs. performance OOS for one path after introducing strategy selection.

Figure/Table Image (Page 6)
Figure 4. Performance IS vs. performance OOS for one path after introducing strategy selection.
First Reference in Text
Figure 4 provides a graphical representation of what happens when we select the random walk with highest SR IS.
Description
  • Axes and Time Periods: The figure displays a line graph representing the cumulative performance of a single investment strategy path over time. The x-axis implicitly represents time, divided into two segments: 'In-Sample (IS)' from 0 to approximately 500 (midpoint between 400 and 600) and 'Out-Of-Sample (OOS)' from approximately 500 to 1000. The y-axis, labeled 'Performance', shows values ranging from -10 to 80, representing the cumulative profit or loss of the strategy.
  • Strategy Selection Context: This specific path was chosen because it exhibited the highest In-Sample Sharpe Ratio (SR IS) from a set of simulated random walks. The In-Sample (IS) period is the data timeframe used to identify this 'best' strategy. The Out-of-Sample (OOS) period represents subsequent, unseen data used to test the strategy's true performance. The Sharpe Ratio (SR) is a measure of risk-adjusted return; a higher SR is generally considered better.
  • In-Sample Performance: In the 'In-Sample (IS)' period, the performance line shows a strong and relatively consistent upward trend, starting from 0 and reaching a peak performance of approximately 75. This period yielded an In-Sample Sharpe Ratio (SR_IS) of 1.49, as indicated at the top of the graph.
  • Out-of-Sample Performance: In the subsequent 'Out-Of-Sample (OOS)' period, the performance of the selected strategy becomes much more volatile. It initially declines from its IS peak, then fluctuates, and ultimately ends at a performance level of around 50-55. This OOS period resulted in a negative Out-of-Sample Sharpe Ratio (SR_OOS) of -0.3, also indicated at the top.
  • Core Implication: Overfitting and Performance Degradation: The figure illustrates that a strategy selected for its excellent performance on past data (high SR IS = 1.49) can perform poorly on new data (negative SR OOS = -0.3). This is a classic example of overfitting, where the strategy selection process capitalized on random chance in the in-sample data rather than identifying a genuinely robust predictive pattern.
Scientific Validity
  • ✅ Appropriate Visualization Method: The use of a line graph to show cumulative performance over distinct IS and OOS periods is an appropriate and standard way to visualize the phenomenon of overfitting and subsequent performance degradation.
  • ✅ Clearly Illustrates Overfitting Consequence: The figure clearly illustrates the central concept: selecting a strategy based on superior IS performance (especially when that performance is due to chance, as in a random walk) does not guarantee, and can even lead to, poor OOS performance. The SR_IS of 1.49 and SR_OOS of -0.3 quantify this effectively.
  • 💡 Single Path Representation and Context: The figure represents 'one path.' While highly illustrative of the concept, it is a single realization. The strength of the argument relies on this being a representative example of what happens when selecting the 'best' from many random trials. The caption below the figure clarifies that 'in the absence of memory, there is no reason to expect overfitting to induce negative performance,' which this specific path does show. This suggests this path might be from a simulation that does possess some form of memory or constraint leading to negative OOS, or it's a particularly illustrative random outcome.
  • ✅ Supports Paper's Narrative: The transition from a high positive SR_IS (1.49) to a negative SR_OOS (-0.3) strongly supports the paper's narrative about the dangers of backtest overfitting when strategy selection is performed without accounting for the selection process itself.
  • 💡 Reliance on Implicit Parameters from Context: The parameters of the underlying random walk (e.g., mean, volatility, number of total paths from which this one was selected) are not detailed in this specific figure's immediate context but are implied from previous sections (e.g., Figure 3 used μ=0, σ=1). Assuming consistency, the results are interpretable within that framework.
Communication
  • ✅ Clear Title and Period Demarcation: The title is clear and concise. The labels for 'In-Sample (IS)' and 'Out-Of-Sample (OOS)' periods directly below the x-axis clearly demarcate the two phases of the performance path.
  • ✅ Direct Display of Key Metrics: Displaying the key metrics 'SR_IS=1.49 | SR_OOS=-0.3' directly on the plot is highly effective for immediate comprehension of the quantitative outcome.
  • ✅ Effective Use of Line Graph: The line graph effectively visualizes the trajectory of performance over time, making the contrast between the IS and OOS periods visually apparent.
  • 💡 Y-axis Label Specificity: The y-axis is labeled 'Performance'. While understandable, specifying 'Cumulative Performance' or 'Cumulative P&L' could offer slightly more precision, though the context makes it clear.
  • 💡 X-axis Unit Label: The x-axis tick labels (200, 400, 600, 800, 1000) represent time or observation counts. Adding a unit label to the x-axis itself (e.g., 'Time (Observations)' or 'Trading Days') would enhance clarity, although the context implies observation counts from the previous figure's T=1000 parameter.
  • ✅ Strong Visual Contrast: The visual distinction between the strong upward trend in the IS period and the volatile, ultimately negative trend in the OOS period is striking and clearly communicates the core message of performance degradation after selection.
Figure 5. Performance degradation after introducing strategy selection in...
Full Caption

Figure 5. Performance degradation after introducing strategy selection in absence of compensation effects.

Figure/Table Image (Page 7)
Figure 5. Performance degradation after introducing strategy selection in absence of compensation effects.
First Reference in Text
Figure 5 illustrates what happens once we add a “model selection” procedure.
Description
  • Axes and Variables: The figure is a scatter plot showing the relationship between in-sample and out-of-sample performance after a 'model selection' procedure has been applied. The x-axis, labeled 'SR a priori', represents the In-Sample Sharpe Ratio (SR IS), which is a measure of an investment strategy's risk-adjusted return calculated on the data used to select it. These SR IS values range from approximately 1.2 to 2.6. The y-axis, labeled 'SR a posteriori', represents the Out-of-Sample Sharpe Ratio (SR OOS), calculated on new, unseen data to test the strategy's true predictive power. These SR OOS values range from approximately -2.0 to 2.0.
  • Model Selection Context: The 'model selection' procedure means that only strategies that showed high performance (a high SR IS, specifically centered around 1.7 as per the caption) during the in-sample period were chosen for this analysis. This is why the x-axis values are concentrated in a higher positive range compared to a scenario without selection (like Figure 3).
  • Distribution of Points: Each point on the plot represents a strategy. Despite selecting strategies with good in-sample performance (x-values are high), the out-of-sample performance (y-values) is still widely dispersed and visually centered around zero. This indicates that the high in-sample performance did not translate into high out-of-sample performance.
  • Regression Analysis Results: A regression analysis is summarized by the equation '[SR a posteriori]=-0.01+0.0*[SR a priori]+err | adjR2=-0.0' displayed on the plot. The coefficient for 'SR a priori' (the in-sample performance) is 0.0, and the adjusted R-squared (adjR2) is -0.0 (effectively zero). This statistically confirms that the selected in-sample performance has virtually no predictive power for out-of-sample performance in this specific scenario.
  • Core Implication: Selection without Compensation Effects: The figure illustrates that, in the specific situation where there are no 'compensation effects' (biasing factors that might create a link, often negative, between in-sample and out-of-sample results), selecting strategies based on high in-sample performance (e.g., an average SR IS of 1.7) does not lead to improved out-of-sample performance. The out-of-sample Sharpe ratio remains centered around zero, implying the selected strategies were likely just lucky in the in-sample period, especially if they are fundamentally 'skill-less' (having a true underlying mean return of zero).
Scientific Validity
  • ✅ Appropriate Visualization Method: The scatter plot is an appropriate method to visualize the relationship between selected SR IS and resulting SR OOS.
  • ✅ Supports Claims Regarding Selection: The figure effectively demonstrates that even after selecting for high SR IS (centered at 1.7), the SR OOS remains centered around 0 when compensation effects are absent. This strongly supports the textual claim.
  • ✅ Quantitative Support via Regression: The inclusion of the regression equation with an adjR2 near zero provides quantitative support for the visual lack of correlation between SR IS and SR OOS post-selection.
  • ✅ Valid Illustration of a Specific Scenario: The key condition for this figure's interpretation is the 'absence of compensation effects.' The figure validly illustrates this specific scenario, which serves as a contrast to scenarios where compensation effects might be present (as discussed later in the paper).
  • 💡 Reliance on Broader Context for Parameters: The simulation parameters (e.g., μ=0, σ=1, N, T) are not explicitly restated for this figure but are implied from the context of previous figures and the overall study design focusing on skill-less strategies. This context is essential for interpretation.
  • 💡 Consider Adding P-values to Figure: The text (page 464) mentions p-values associated with the regression for this scenario (0.2146 for intercept, 0.2131 for slope). Including these p-values directly on the figure would provide immediate statistical evidence for the non-significance of the relationship, further strengthening the conclusion drawn from the adjR2 value.
Communication
  • ✅ Clear Axis Labels and Plot Title: The x-axis ('SR a priori') and y-axis ('SR a posteriori') are clearly labeled. The title 'OOS Perf. Degradation' above the plot effectively communicates the theme.
  • ✅ Direct Inclusion of Regression Statistics: The inclusion of the regression equation '[SR a posteriori]=-0.01+0.0*[SR a priori]+err | adjR2=-0.0' directly on the plot is highly beneficial, immediately quantifying the lack of predictive power of IS performance for OOS performance in this scenario.
  • ✅ Informative Detailed Caption: The detailed caption below the figure clearly explains the shift in the in-sample Sharpe Ratio (SR IS) range (1.2 to 2.6, centered at 1.7) due to model selection and the key finding that out-of-sample Sharpe Ratio (SR OOS) remains centered around 0.
  • ✅ Effective Visualization of Selection Outcome: The scatter plot effectively visualizes the outcome of the model selection procedure, showing a concentration of points at higher SR IS values but still a wide, centered spread for SR OOS.
  • 💡 Consistency in Axis Terminology: The terms 'SR a priori' and 'SR a posteriori' are used on the axes, while the main text often uses 'SR IS' and 'SR OOS'. Consistently using 'SR IS' (for x-axis) and 'SR OOS' (for y-axis) directly on the plot would enhance immediate clarity and reduce the need for readers to mentally map the terms.
  • ✅ Good Point Density and Clarity: The density of points is manageable. The visual message that SR OOS remains centered around zero despite selection for high SR IS is clear.
Figure 6. Performance degradation as a result of strategy selection under...
Full Caption

Figure 6. Performance degradation as a result of strategy selection under compensation effects (global constraint).

Figure/Table Image (Page 8)
Figure 6. Performance degradation as a result of strategy selection under compensation effects (global constraint).
First Reference in Text
Figure 6 shows that adding a single global constraint causes the OOS performance to be negative even though the underlying process was trendless.
Description
  • Axes and Variables: The figure is a scatter plot depicting the relationship between in-sample and out-of-sample performance of investment strategies after a selection process and under the influence of a 'global constraint'. The x-axis, labeled 'SR a priori', represents the In-Sample Sharpe Ratio (SR IS). The Sharpe Ratio is a measure of how much return an investment provides for the risk taken; 'In-Sample' refers to performance on the data used to select the strategy. SR IS values range from approximately 0.8 to 2.0. The y-axis, labeled 'SR a posteriori', represents the Out-of-Sample Sharpe Ratio (SR OOS), which measures performance on new, unseen data. SR OOS values are all negative, ranging from approximately -2.0 to -0.8.
  • Context: Global Constraint and Compensation Effects: The 'global constraint' mentioned is a type of 'compensation effect'. Compensation effects are factors or conditions in the data or simulation setup that cause past (in-sample) performance to influence future (out-of-sample) performance, often in a contrary way when strategies are selected based on extreme in-sample results. In this case, even though the underlying process for generating strategy returns was 'trendless' (meaning it had no inherent upward or downward bias, simulated with an average return μ=0), the global constraint forces a negative relationship.
  • Distribution of Points and Negative Correlation: Each point on the plot represents a strategy selected for its in-sample performance. The data points form a clear downward-sloping band, indicating a strong negative correlation: strategies with higher in-sample Sharpe Ratios tend to have lower (more negative) out-of-sample Sharpe Ratios.
  • Regression Analysis Results: A linear regression line is fitted to the data and displayed, along with its equation: '[SR a posteriori]=-0.03+-0.97*[SR a priori]+err | adjR2=0.85'. The coefficient for 'SR a priori' is -0.97, meaning that for each unit increase in the in-sample Sharpe Ratio, the out-of-sample Sharpe Ratio is expected to decrease by approximately 0.97 units. The adjusted R-squared (adjR2) value of 0.85 indicates that 85% of the variance in the out-of-sample Sharpe Ratios can be explained by the in-sample Sharpe Ratios in this model, signifying a very strong relationship.
  • Core Implication: Detrimental Effect of Selection Under Constraint: The core message is that when certain types of constraints or compensation effects are present (here, a 'global constraint'), selecting strategies based on high in-sample performance can lead to predictably poor and negative out-of-sample performance. The better a strategy looked in-sample, the worse it performed out-of-sample. This contrasts with Figure 5, where, in the absence of such effects, OOS performance was simply random around zero despite IS selection.
Scientific Validity
  • ✅ Appropriate Visualization Method: The scatter plot with a fitted regression line is an appropriate and effective method for visualizing the strong linear relationship between SR IS and SR OOS under the specified global constraint.
  • ✅ Strong Support for Claims: The figure strongly supports the claim in the reference text and caption that the introduction of a global constraint induces negative OOS performance and a strong negative correlation with IS performance, even from an initially trendless process. The adjR2 of 0.85 is compelling.
  • ✅ Highlights Significant Overfitting Mechanism: The demonstration of how a global constraint can induce such a strong negative relationship is a significant finding, highlighting a specific mechanism for detrimental overfitting beyond simple selection bias on random data.
  • 💡 Definition of 'Global Constraint' Needed: The exact nature of the 'single global constraint' is critical for a full assessment of scientific validity and reproducibility. While the figure shows its effect, the paper needs to precisely define this constraint (e.g., how it was mathematically or procedurally implemented in the simulation) for the results to be fully interpretable and verifiable by others.
  • 💡 Consider Adding P-value for Slope to Figure: The text (page 465) states the p-value for the slope (-0.97) is 0. Including this p-value directly on the figure would immediately convey the statistical significance of this strong negative coefficient, complementing the high adjR2.
  • 💡 Assumption of Consistent Simulation Parameters: The underlying simulation parameters (e.g., N strategies from which selection occurred, T observations per strategy, σ for the trendless process) are assumed from the general context of the paper. Explicitly stating if these differ from previous figures would be beneficial, though consistency is implied.
Communication
  • ✅ Clear Axis Labels and Plot Title: The axis labels 'SR a priori' (x-axis) and 'SR a posteriori' (y-axis) are clear. The plot title 'OOS Perf. Degradation' effectively sets the context.
  • ✅ Direct Inclusion of Regression Statistics: The inclusion of the regression equation '[SR a posteriori]=-0.03+-0.97*[SR a priori]+err | adjR2=0.85' directly on the plot is excellent, providing immediate quantitative insight into the strong negative relationship and the model's fit.
  • ✅ Vivid Visualization of Negative Trend: The strong downward trend of the scatter points and the fitted regression line vividly communicates the negative relationship between in-sample and out-of-sample performance under these specific conditions.
  • ✅ Informative Detailed Caption: The detailed caption below the figure clearly articulates the main finding: the introduction of a global constraint leads to negative OOS performance, and a strong negative linear relationship emerges.
  • 💡 Consistent Terminology for Axis Labels: For consistency and immediate understanding, consider using 'SR IS' for the x-axis label and 'SR OOS' for the y-axis label directly on the plot, aligning with the terminology frequently used in the main text, rather than 'SR a priori' and 'SR a posteriori'.
Figure 7. Performance degradation as a result of strategy selection under...
Full Caption

Figure 7. Performance degradation as a result of strategy selection under compensation effects (first-order serial correlation).

Figure/Table Image (Page 9)
Figure 7. Performance degradation as a result of strategy selection under compensation effects (first-order serial correlation).
First Reference in Text
Figure 7 illustrates that a serially correlated performance introduces another form of compensation effects, just as we saw in the case of a global constraint.
Description
  • Axes, Variables, and Serial Correlation Context: The figure is a scatter plot that illustrates the relationship between in-sample and out-of-sample performance of investment strategies when the performance series itself exhibits 'serial correlation'. Serial correlation means that the performance at one point in time is related to its performance at previous points, like having momentum or a tendency to revert. The x-axis, 'SR a priori', represents the In-Sample Sharpe Ratio (SR IS), a measure of risk-adjusted return on the data used for strategy selection, ranging from approximately 0.5 to 1.1. The y-axis, 'SR a posteriori', shows the Out-of-Sample Sharpe Ratio (SR OOS) on new data, with values ranging from about -1.4 to 0.2.
  • Autoregressive Process and Compensation Effect: This serial correlation is introduced by modeling the strategy performance as a 'first-order autoregressive process' with a coefficient φ = 0.995 (mentioned in the accompanying text). This means each performance data point is strongly influenced by the immediately preceding data point, creating a 'memory' in the performance series. This setup is used to demonstrate another type of 'compensation effect,' where characteristics of the data generation process can distort the relationship between past and future performance after selection.
  • Distribution of Points and Negative Trend: The scatter points, each representing a selected strategy, generally trend downwards from left to right, indicating a negative relationship: strategies that had higher in-sample Sharpe Ratios tended to have lower (often negative) out-of-sample Sharpe Ratios. The spread of points is noticeable, suggesting the relationship is not perfectly deterministic.
  • Regression Analysis Results: A linear regression line is fitted to these points, with the equation '[SR a posteriori]=-0.04+-0.85*[SR a priori]+err | adjR2=0.08' displayed on the plot. The coefficient for 'SR a priori' is -0.85, implying that, on average, a one-unit increase in the in-sample Sharpe Ratio is associated with a 0.85-unit decrease in the out-of-sample Sharpe Ratio. The 'adjR2' (adjusted R-squared) value of 0.08 means that only 8% of the variation in out-of-sample performance is explained by the in-sample performance in this model. This indicates a statistically significant trend (as p-value for slope is 0, mentioned in text) but one that explains a small portion of the OOS variance compared to the global constraint scenario in Figure 6 (which had adjR2=0.85).
  • Core Implication: Serial Correlation as a Compensation Effect: The figure demonstrates that even without a strict 'global constraint' like in Figure 6, inherent properties of the performance data, such as strong serial correlation, can act as a compensation effect. When strategies are selected based on high in-sample performance from such serially correlated series, their out-of-sample performance tends to be negatively related to their in-sample success. The implication is that what appears to be a good strategy in-sample might systematically underperform out-of-sample due to these data dynamics.
Scientific Validity
  • ✅ Appropriate Visualization Method: The scatter plot with a regression line is a suitable method for visualizing the relationship between SR IS and SR OOS under the influence of first-order serial correlation in the performance series.
  • ✅ Supports Claim on Serial Correlation: The figure supports the claim that serial correlation in performance data can act as a compensation effect, leading to a negative relationship between selected IS performance and subsequent OOS performance. The negative slope of the regression is consistent with this.
  • 💡 Parameter Specificity: The specific parameters of the autoregressive process (μ=0, σ=1, φ=0.995, as stated in the text) are crucial for interpreting this figure. The high value of φ indicates strong serial dependence. The validity of the conclusion is tied to this specific parameterization.
  • ✅ Accurately Reflects Weaker (but Present) Effect: The adjusted R-squared of 0.08, while indicating a statistically significant trend (p-value for slope is 0 according to the text), shows that serial correlation, in this specific setup, explains a much smaller portion of OOS performance variance compared to the global constraint in Figure 6 (adjR2=0.85). This correctly reflects it as a 'less restrictive' example of a compensation effect, as stated in the text.
  • 💡 Consider Adding P-value for Slope to Figure: The text (page 465) states the p-value for the slope (-0.85) is 0. Including this p-value directly on the figure would immediately convey the statistical significance of the negative coefficient, which is important given the relatively low adjR2.
  • ✅ Important Insight on Data Dynamics: This figure valuably demonstrates that compensation effects leading to overfitting are not limited to explicit external constraints but can also arise from the inherent time-series properties of the performance data itself, which is an important insight.
Communication
  • ✅ Clear Axis Labels and Plot Title: The x-axis ('SR a priori') and y-axis ('SR a posteriori') are clearly labeled. The plot title 'OOS Perf. Degradation' effectively communicates the figure's theme.
  • ✅ Direct Inclusion of Regression Statistics: The inclusion of the regression equation '[SR a posteriori]=-0.04+-0.85*[SR a priori]+err | adjR2=0.08' directly on the plot provides immediate quantitative details about the observed relationship and its strength.
  • ✅ Visual Representation of Trend: The scatter plot with the overlaid regression line makes the negative trend between in-sample and out-of-sample performance discernible, even if the relationship is weaker than in Figure 6.
  • ✅ Informative Detailed Caption: The detailed caption below the figure clearly explains the context (serially correlated performance via an autoregressive process with φ = 0.995) and the key takeaway that this introduces another form of compensation effect.
  • 💡 Consistent Terminology for Axis Labels: For improved consistency and immediate clarity, consider using 'SR IS' (In-Sample Sharpe Ratio) for the x-axis label and 'SR OOS' (Out-of-Sample Sharpe Ratio) for the y-axis label directly on the plot, aligning with terminology frequently used in the main text, instead of 'SR a priori' and 'SR a posteriori'.
  • ✅ Reflects Weaker Relationship Appropriately: The visual spread of the data points is wider compared to Figure 6, which accurately reflects the lower adjusted R-squared (0.08). The regression line helps guide the eye, but the scatter indicates more unexplained variance, which is an important aspect of this specific compensation effect.
Figure 8. Backtested performance of a seasonal strategy (Example 6).
Figure/Table Image (Page 10)
Figure 8. Backtested performance of a seasonal strategy (Example 6).
First Reference in Text
Figure 8 plots the random series, as well as the performance associated with the optimal parameter combination: Entry_day = 11, Holding_period = 4, Stop_loss = -1 and Side = 1.
Description
  • Axes, Lines, and Variables: The figure presents a line graph with two y-axes. The x-axis displays time, from March 2000 to September 2004. The left y-axis, labeled 'Performance', ranges from 0 to 50 and corresponds to the blue line representing the cumulative profit/loss of a 'seasonal strategy'. A seasonal strategy is one that attempts to capitalize on patterns that repeat over certain time intervals. The right y-axis, labeled 'Prices', ranges from -30 to 30 and corresponds to the green line, representing the price movement of an 'underlying' financial series. The text clarifies this underlying series is a random walk, meaning its price movements are inherently unpredictable.
  • Strategy Performance Trend: The blue line ('Strategy') shows a generally upward trend over the entire period, starting near 0 and ending around a performance value of 45-50. This indicates that the backtested seasonal strategy, when applied to the historical random price series, would have appeared profitable.
  • Underlying Price Series: The green line ('Underlying') depicts the fluctuations of the random price series itself. The strategy's performance (blue line) is derived from applying specific trading rules to this underlying series.
  • Key Performance Metrics (SR, PSR, Freq): Key performance metrics for the strategy are displayed at the top of the plot: 'SR=1.27 PSR=2.83 Freq=57.7'. 'SR' stands for Sharpe Ratio, a measure of risk-adjusted return; a value of 1.27 is typically considered good. 'PSR' refers to the Probabilistic Sharpe Ratio statistic; a value of 2.83, as explained in the text, implies a very low probability (less than 1%) that the true Sharpe Ratio of the strategy is zero or negative, suggesting the observed SR of 1.27 is statistically significant. 'Freq=57.7' is another characteristic of the optimized strategy; its precise meaning (e.g., number of trades, or another optimized parameter) isn't fully defined by the figure alone but is part of the strategy's identified optimal parameters.
  • Optimal Strategy Parameters: The reference text specifies that this strategy's performance is the result of finding an 'optimal parameter combination' for a monthly trading rule involving: 'Entry_day = 11' (day of the month to enter a trade), 'Holding_period = 4' (days to hold the trade), 'Stop_loss = -1' (a parameter for setting a loss limit), and 'Side = 1' (likely indicating a 'buy' or 'long' strategy). These parameters were selected from 8,800 possible combinations.
  • Core Implication: Overfitting on Random Data: The core message of this figure, in the context of Example 6, is to demonstrate that even with a purely random underlying price series (which has no real exploitable patterns), it's possible to find a set of trading rule parameters that produce what appears to be a highly profitable and statistically significant strategy in a backtest (a simulation on past data). This highlights the danger of 'overfitting' – finding apparent patterns by chance due to extensive searching, rather than discovering a genuinely effective strategy.
Scientific Validity
  • ✅ Appropriate Visualization Method: The figure appropriately uses a dual-axis line chart to compare the strategy's derived performance against the underlying (random) price series, which is a standard way to visualize backtest results.
  • ✅ Effectively Demonstrates Overfitting: The figure effectively demonstrates the central point of Example 6: that by searching through many parameter combinations (8,800 as stated in the text for Example 6), one can identify a 'seasonal strategy' with a high in-sample Sharpe Ratio (1.27) and a statistically significant Probabilistic Sharpe Ratio (PSR-Stat = 2.83) even when the underlying data is a random walk (i.e., contains no true seasonality or predictability). This strongly supports the paper's thesis on backtest overfitting.
  • ✅ Highlights Deceptiveness of Spurious Results: The reported PSR-Stat of 2.83 is crucial. As the text explains, this implies a <1% probability that the true Sharpe ratio is below zero. This makes the spurious finding particularly deceptive, as it would pass common statistical significance tests, highlighting the inadequacy of relying solely on such metrics without considering the selection process.
  • ✅ Illustrates Type I Error from Multiple Testing: The figure represents the outcome for the 'optimal parameter combination' found after testing many. It's a clear illustration of a Type I error (false positive) in the context of multiple hypothesis testing (the multiple trials being the different parameter combinations).
  • 💡 Clarification of 'Stop_loss = -1' Parameter: The term 'Stop_loss = -1' is given as an optimal parameter. The exact interpretation of a negative stop-loss value within the authors' framework should be clearly defined in the methods or Example 6 description to ensure full reproducibility and understanding of the strategy mechanics. For instance, does it mean no stop-loss, or a stop-loss defined relative to some baseline in the opposite direction of a typical stop?
  • 💡 Definition/Derivation of 'Freq=57.7': The 'Freq=57.7' parameter is displayed but its derivation or meaning in the context of the four optimized parameters (Entry_day, Holding_period, Stop_loss, Side) is not immediately clear from the figure or its direct reference text. While Example 6 might provide more detail, a brief clarification associated with the figure would be helpful.
Communication
  • ✅ Effective Use of Dual Y-Axes: The use of dual y-axes for 'Performance' (left) and 'Prices' (right) is appropriate for comparing the strategy's cumulative profit/loss against the underlying price series. The labels are clear.
  • ✅ Clear Legend: The legend clearly distinguishes between the 'Strategy' performance line (blue) and the 'Underlying' price series line (green).
  • ✅ Direct Annotation of Key Metrics: Displaying key metrics (SR=1.27, PSR=2.83, Freq=57.7) directly on the plot is highly effective for immediate understanding of the backtested strategy's apparent quality.
  • ✅ Clear X-axis Timeframe: The x-axis labels representing dates (Mar 2000 to Sep 2004) clearly define the timeframe of the backtest.
  • ✅ Informative Title: The title is concise and informative, clearly stating the figure's content as the backtested performance of a seasonal strategy from Example 6.
  • 💡 Clarity of 'Freq' Parameter: The term 'Freq=57.7' is annotated on the plot. While its specific meaning isn't immediately obvious from the figure alone, the main text (Example 6) explains that four parameters (Entry_day, Holding_period, Stop_loss, Side) were optimized. 'Freq' might be a derived characteristic or another optimized parameter not detailed in the brief reference text for the figure. Clarifying its exact nature in the figure's immediate context or ensuring it's clearly defined in Example 6 is important.

Discussion

Key Aspects

Strengths

Suggestions for Improvement