White’s reality check – The Financial Hacker

Build Better Strategies, Part 6: Evaluation

jcl — Thu, 05 Feb 2026 16:54:41 +0000

Developing a successful strategy is a process with many steps, described in the Build Better Strategies article series. At some point you have coded a first, raw version of the strategy. At that stage you’re usually experimenting with different functions for market detection or trade signals. The problem: How can you determine which indicator, filter, or machine learning method works best with which markets and which time frames? Manually testing all combinations is very time consuming, close to impossible. Here’s a way to run that process automated with a single mouse click.

A robust trading strategy has to meet several criteria:

It must exploit a real and significant market inefficiency. Random-walk markets cannot be algo traded.
It must work in all market situations. A trend follower must survive a mean reverting regime.
It must work under many different optimization settings and parameter ranges.
It must be unaffected by random events and price fluctuations.

There are metrics and algorithms to test all this. The robustness under different market situations can be determined through the R2 coefficient or the deviations between the walk forward cycles. The parameter range robustness can be tested with a WFO profile (aka cluster analysis), the price fluctuation robustness with oversampling. A Montecarlo analysis finds out whether the strategy is based on a real market inefficiency.

Some platforms, such as Zorro, have functions for all this. But they require dedicated code in the strategy, often more than for the algorithm itself. In this article I’m going to describe an evaluation framework – a ‘shell’ – that skips the coding part. The shell is included in the latest Zorro version. It can be simply attached to any strategy script. It makes all strategy variables accessible in a panel and adds stuff that’s common to all strategies – optimization, money management, support for multiple assets and algos, cluster and montecarlo analysis. It evaluates all strategy variants in an automated process and builds the optimal portfolio of combinations from different algorithms, assets, and timeframes.

The process involves these steps:

The first step of strategy evaluation is generating sets of parameter settings, named jobs. Any job is a variant of the strategy that you want to test and possibly include in the final portfolio. Parameters can be switches that select between different indicators, or variables (such as timeframes) with optimization ranges. All parameters can be edited in the user interface of the shell, then saved with a mouse click as a job.

The next step is an automated process that runs through all previously stored jobs, trains and tests any of them with different asset, algo, and time frame combinations, and stores their results in a summary. The summary is a CSV list with the performance metrics of all jobs. It is automatically sorted – the best performing job variants are at the top – and looks like this:

So you can see at a glance which parameter combinations work with which assets and time frames, and which are not worth to examine further. You can repeat this step with different global settings, such as bar period or optimization method, and generate multiple summaries in this way.

The next step in the process is cluster analysis. Every job in a selected summary is optimized multiple times with different walk-forward settings. The result with any job variant is stored in WFO profiles or heatmaps:

After this process, you likely ended up with a couple survivors in the top of the summary. The surviving jobs have all a positive return, a steady rising equity curve, shallow drawdowns, and robust parameter ranges since they passed the cluster analysis. But any selection process generates selection bias. Your perfect portfolio will likely produce a great backtest, but will it perform equally well in live trading? To find out, you run a Montecarlo analysis, aka ‘Reality Check’.

This is the most important test of all, since it can determine whether your strategy exploits a real market inefficiency. If the Montecarlo analysis fails with the final portfolio, it will likely also fail with any other parameter combination, so you need to run it only close to the end. If your system passes Montecarlo with a p-value below 5%, you can be relatively confident that the system will return good and steady profit in live trading. Otherwise, back to the drawing board.

The evaluation shell is included in Zorro version 3.01 or above. Usage and details are described under https://zorro-project.com/manual/en/shell.htm. Attaching the shell to a strategy is described under https://zorro-project.com/manual/en/shell2.htm.

Why 90% of Backtests Fail

jcl — Mon, 04 Apr 2022 15:55:47 +0000

About 9 out of 10 backtests produce wrong or misleading results. This is the number one reason why carefully developed algorithmic trading systems often fail in live trading. Even with out-of-sample data and even with cross-validation or walk-forward analysis, backtest results are often way off to the optimistic side. The majority of trading systems with a positive backtest are in fact unprofitable. In this article I’ll discuss the cause of this phenomenon, and how to fix it.

Suppose you’re developing an algorithmic trading strategy, following all rules of proper system development. But you are not aware that your trading algorithm has no statistical edge. The strategy is worthless, the trading rules equivalent to random trading, the profit expectancy – aside from transaction costs – is zero. The problem: you will rarely get a zero result in a backtest. A random trading strategy will in 50% of cases produce a negative backtest result, in 50% a positive result. But if the result is negative, you’re normally tempted to tweak the code or select assets and time frames until you finally got a profitable backtest. Which will happen relatively soon even when applying random modifications to the system. That’s why there are so many unprofitable strategies around, with nevertheless great backtest performances.

Does this mean that backtests are worthless? Not at all. But it is essential to know whether you can trust the test, or not.

The test-the-backtest experiment

There are several methods for verifying a backtest. None of them is perfect, but all give insights from different viewpoints. We’ll use the Zorro algo trading software, and run our experiments with the following test system that is optimized and backtested with walk-forward analysis:

function run()
{
  set(PARAMETERS,TESTNOW,PLOTNOW,LOGFILE);
  BarPeriod = 1440;
  LookBack = 100;
  StartDate = 2012;
  NumWFOCycles = 10;

  assetList("AssetsIB");
  asset("SPY");

  vars Signals = series(LowPass(seriesC(),optimize(10,2,20,2)));
  vars MMIFast = series(MMI(seriesC(),optimize(50,40,60,5)));
  vars MMISlow = series(LowPass(MMIFast,100));

  MaxLong = 1;
  if(falling(MMISlow)) {
    if(valley(Signals))
      enterLong();
    else if(peak(Signals))
      exitLong();
  }
}

This is a classic trend following algorithm. It uses a lowpass filter for trading at the peaks and valleys of the smoothed price curve, and a MMI filter (Market Meanness Index) for distinguishing trending from non-trending market periods. It only trades when the market has switched to rend regime, which is essential for profitable trend following systems. It opens only long positions. Lowpass and MMI filter periods are optimized, and the backtest is a walk-forward analysis with 10 cycles.

The placebo trading system

It is standard for experiments to compare the real stuff with a placebo. For this we’re using a trading system that has obviously no edge, but was tweaked with the evil intention to appear profitable in a walk-forward analysis. This is our placebo system:

void run()
{
  set(PARAMETERS,TESTNOW,PLOTNOW,LOGFILE);
  BarPeriod = 1440;
  StartDate = 2012;
  setf(TrainMode,BRUTE);
  NumWFOCycles = 9;

  assetList("AssetsIB");
  asset("SPY");

  int Pause = optimize(5,1,15,1);
  LifeTime = optimize(5,1,15,1);

// trade after a pause...
  static int NextEntry;
  if(Init) NextEntry = 0;
  if(NextEntry-- <= 0) {
    NextEntry = LifeTime+Pause;
    enterLong();
  }
}

This system opens a position, keeps it a while, then closes it and pauses for a while. The trade and pause durations are walk-forward optimized between 1 day and 3 weeks. LifeTime is a predefined variable that closes the position after the given time. If you don’t believe in lucky trade patterns, you can rightfully assume that this system is equivalent to random trading. Let’s see how it fares in comparison to the trend trading system.

Trend trading vs. placebo trading

This is the equity curve with the trend trading system from a walk forward analysis from 2012 up to 3/2022:

The plot begins 2015 because the preceding 3 years are used for the training and lookback periods. SPY follows the S&P500 index and rises in the long term, so we could expect anyway some profit with a long-only system. But this system, with profit factor 3 and R2 coefficient 0.65 appears a lot better than random trading. Let’s compare it with the placebo system:

The placebo system produced profit factor 2 and R2 coefficient 0.77. Slightly less than the real system, but in the same performance range. And this result was also from a walk-forward analysis, although with 9 cycles – therefore the later start of the test period. Aside from that, it seems impossible to determine solely from the equity curve and performance data which system is for real, and which is a placebo.

Checking the reality

Methods to verify backtest results are named ‘reality check’. They are specific to the asset and algorithm; in a multi-asset, multi-algo portfolio, you need to enable only the component you want to test. Let’s first see how the WFO split affects the backtest. In this way we can find out whether our backtest result was just due to lucky trading in a particular WFO cycle. We’re going to plot a WFO profile that displays the effect of the number of walk-forward cycles on the result. For this we outcomment the NumWFOCycles = … line in the code, and run it in training mode with the WFOProfile.c script:

#define run strategy
#include "trend.c" // <= your script
#undef run
#define CYCLES 20 // max WFO cycles

function run()
{
  set(TESTNOW);
  NumTotalCycles = CYCLES-1;
  NumWFOCycles = TotalCycle+1;
  strategy();
}

function evaluate()
{
  var Perf = ifelse(LossTotal > 0,WinTotal/LossTotal,10);
  if(Perf > 1)
    plotBar("WFO+",NumWFOCycles,NumWFOCycles,Perf,BARS,BLACK);
  else
    plotBar("WFO-",NumWFOCycles,NumWFOCycles,Perf,BARS,RED);
}

We’re redefining the run function to a different name. This allows us to just include the tested script and train it with WFO cycles from 2 up to the number defined by CYCLES. A backtest is executed after training. If an evaluate function is present, Zorro runs it automatically after any backtest. It plots a histogram bar of the profit factor (y axis) from each number of WFO cycles. First, the WFO profile of the trend trading system:

We can see that the performance rises with the number of cycles. This is typical for a system that adapts to the market. All results are positive with a profit factor > 1. Our arbitrary choice of 10 cycles produced a less than average result. So we can at least be sure that this backtest result was not caused by a particularly lucky number of WFO cycles.

The WFO profile of the placebo system:

This time the number of WFO cycles had a strong random effect on the performance. And it is now obvious why I used 9 WFO cycles for that system. For the same reason I used brute force optimization, since it increases WFO variance and thus the chance to get lucky WFO cycle numbers. That’s the opposite of what we normally do when developing algorithmic trading strategies.

WFO profiles give insight into WFO cycle dependency, but not into randomness or overfitting by other means. For this, more in-depth tests are required. Zorro supports two methods, the Montecarlo Reality Check (MRC) with randomized price curves, and White’s Reality Check (WRC) with detrended and bootstrapped equity curves of strategy variants. Both methods have their advantages and disadvantages. But since strategy variants from optimizing can only be created without walk-forward analysis, we’re using the MRC here.

The Montecarlo Reality Check

First we test both systems with random price curves. Randomizing removes short-term price correlations and market inefficiencies, but keeps the long-term trend. Then we compare our original backtest result with the randomized results. This yields a p-value, a metric of the probability that our test result was caused by randomness. The lower the p-Value, the more confidence we can have in the backtest result. In statistics we normally consider a result significant when its p-Value is below 5%.

The basic algorithm of the Montecarlo Reality Check (MRC):

Train your system and run a backtest. Store the profit factor (or any other performance metric that you want to compare).
Randomize the price curve by randomly swapping price changes (shuffle without replacement).
Train your system again with the randomized data and run a backtest. Store the performance metric.
Repeat steps 2 and 3 1000 times.
Determine the number N of randomized tests that have a better result than the original test. The p-Value is N/1000.

If our backtest result was affected by an overall upwards trending price curve, which is certainly the case for this SPY system, the randomized tests will be likewise affected. The MRC code:

#define run strategy
#include "trend.c" // <= your script
#undef run
#define CYCLES 1000

function run()
{
  set(PRELOAD,TESTNOW);
  NumTotalCycles = CYCLES;
  if(TotalCycle == 1) // first cycle = original
    seed(12345); // always same random sequence
  else
    Detrend = SHUFFLE;
  strategy();
  set(LOGFILE|OFF); // don't export files
}

function evaluate()
{
  static var OriginalProfit, Probability;
  var PF = ifelse(LossTotal > 0,WinTotal/LossTotal,10);
  if(TotalCycle == 1) {
    OriginalProfit = PF;
    Probability = 0;
  } else {
    if(PF < 2*OriginalProfit) // clip image at double range
      plotHistogram("Random",PF,OriginalProfit/50,1,RED);
    if(PF > OriginalProfit)
      Probability += 100./NumTotalCycles;
  }
  if(TotalCycle == NumTotalCycles) { // last cycle
    plotHistogram("Original",
     OriginalProfit,OriginalProfit/50,sqrt(NumTotalCycles),BLACK);
    printf("\n-------------------------------------------");
    printf("\nP-Value %.1f%%",Probability);
    printf("\nResult is ");
    if(Probability <= 1)
      printf("highly significant") ;
    else if(Probability <= 5)
      printf("significant");
    else if(Probability <= 15)
      printf("maybe significant");
    else
      printf("statistically insignificant");
    printf("\n-------------------------------------------");
  }
}

This code sets up the Zorro platform to train and test the system 1000 times. The seed setting ensures that you get the same result on any MRC run. From the second cycle on, the historical data is shuffled without replacement. For calculating the p-value and plotting a histogram of the MRC, we use the evaluate function again. It calculates the p-value by counting the backtests resulting in higher profit factors than the original system. Depending on the system, training and testing the strategy a thousand times will take several minutes with Zorro. The resulting MRC histogram of the trend following system:

The height of a red bar represents the number of shuffled backtests that ended at the profit factor shown on the x axis. The black bar on the right (height is irrelevant, only the x axis position matters) is the profit factor with the original price curve. We can see that most shuffled tests came out positive, due to the long-term upwards trend of the SPY price. But our test system came out even more positive. The p-Value is below 1%, meaning a high significance of our backtest. This gives us some confidence that the simple trend follower can achieve a similar result in real trading.

This cannot be said from the MRC histogram of the placebo system:

The backtest profit factors now extend over a wider range, and many were more profitable than the original system. The backtest with the real price curve is indistinguishable from the randomized tests, with a p-value in the 40% area. The original backtest result of the placebo system, even though achieved with walk-forward analysis, is therefore meaningless.

It should be mentioned that the MRC cannot detect all invalid backtests. A system that was explicitly fitted to a particular price curve, for instance by knowing in advance its peaks and valleys, would get a low p-value by the MRC. No reality check could distinguish such a system from a system with a real edge. Therefore, neither MRC nor WRC can give absolute guarantee that a system works when it passes the check. But when it does not pass, you’re advised to better not trade it with real money.

I have uploaded the strategies to the 2022 script repository. The MRC and WFOProfile scripts are included in Zorro version 2.47.4 and above. You will need Zorro S for the brute force optimization of the placebo system.

Boosting Strategies with MMI

jcl — Mon, 28 Sep 2015 15:49:35 +0000

We will now repeat our experiment with the 900 trend trading strategies, but this time with trades filtered by the Market Meanness Index. In our first experiment we found many profitable strategies, some even with high profit factors, but none of them passed White’s Reality Check. So they all would probably fail in real trading in spite of their great results in the backtest. This time we hope that the MMI improves most systems by filtering out trades in non-trending market situations.

900 systems experiment revisited

I have been informed by readers that I committed two mistakes, or at least inaccuracies, in the previous experiment. First, I didn’t detrend the price data. Second, I used the equity curves instead of balance curves for determining the profit factor. I didn’t detrend the prices because the systems traded long/short in a symmetric way, and I supposed that this would eliminate any trend bias. But even if this was true back then, it is now not true anymore: filtering trades by MMI or other means can introduce asymmetry. Also, calculating the profit factor from the balance curve makes indeed more sense because you’re interested in the end profit of the trades, not in their interim behavior. Therefore and for the sake of comparable results I will now and in the future use detrended trade returns and balance curves for such experiments.

The original test, repeated with the modifications, produced a wider profit factor distribution due to eliminating intermediate returns. But the outcome of the experiment was the same. The statistic (including trade costs) did not change much, however the profit factor distribution (without trade costs) did. This is the new WRC histogram of the original 900 systems (best system vs. bootstrap-randomized returns of all systems):

900 trend systems (no MMI)

Although the best system (black bar, a system using ALMA) is at the right side of the distribution, still 11% of random systems were better. The system does not pass the WRC at the required 95% confidence level. This turned out very different when filtering trades with the MMI.

The MMI experiment

This is our script TrendMMI.c for the new experiment:

// helper function: remove systems that exceed the 4 months lookback period
int checkLookBack(int Period) 
{
  if(Period >= LookBack/TimeFrame) {
    StepNext = 0;	// abort optimization
    return LookBack/TimeFrame; // reduce the period
  } else
    return Period;
}

// calculate profit factor and remove systems with not enough trades 
var objective() 
{ 
  if(NumWinTotal < 30 || NumLossTotal < 30) { 
    StepNext = 0;     // abort optimization
    return 0;         // don't store this system
  } else
      return WinTotal/LossTotal; // Profit factor
}

var filter(var* Data,int Period);

void run()
{
  set(PARAMETERS|LOGFILE);
  Curves = "DailyBalance.bin";
  StartDate = 2010;
  BarPeriod = 15;
  LookBack = 80*4*24; // ~ 4 months
  Detrend = TRADES;   // detrend trade results
  while(asset(loop("EUR/USD","SPX500","XAG/USD")))
  while(algo(loop("MM15","MH1","MH4")))
  {
    TimeFrame = 1;
    if(Algo == "MH1")
      TimeFrame = 1*4;
    else if(Algo == "MH4")
      TimeFrame = 4*4;	

// no trade costs
    Spread = Commission = RollLong = RollShort = Slippage = 0;

    int Periods[10] = { 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000 };
    int Period = Periods[round(optimize(1,1,10,1),1)-1];
		
    var *Price = series(price());
    var *Smoothed = series(filter(Price,Period));

    bool DoTrade = true;
    int MMIPeriod = optimize(0,200,500,100);
    if(MMIPeriod) {
      MMIPeriod = checkLookBack(MMIPeriod);
      var *MMI_Raw = series(MMI(Price,MMIPeriod));
      var *MMI_Smooth = series(LowPass(MMI_Raw,MMIPeriod));
      DoTrade = falling(MMI_Smooth);
    }

    if(DoTrade) {
      if(valley(Smoothed))
        enterLong();
      else if(peak(Smoothed))
        enterShort();
    }
  } 
}

The 10 trend trading scripts with the 10 different indicators remain unchanged, aside from now including TrendMMI.c instead of Trend.c. Trading is now dependent on a boolean variable DoTrade. The length of the MMI range is varied between 200, 300, 400, and 500 bars. As most parameters in a strategy, the MMI range is a compromise: It should be no less than 200 bars for getting any accuracy, but it should not be too long for preventing that different market regimes fall in the same MMI range. At the default range of 0, no MMI is applied and trading is not filtered. This way we’re including all the previous systems in the test. This is required for properly detecting Data Mining Bias, which must consider all systems that were discarded based on their result.

We’re running the MMI return value through a lowpass filter that uses the same period as the MMI range. This gives us a smooth MMI value that does not jump around. This value is now used for trade filtering: trades are opened and closed only when the smoothed MMI is falling, meaning that the market has entered trending mode within the last 200 to 500 bars. The MMI is only applied to one of the systems resulting from the prior period variation (the optimize function automatically selects the parameter of the “most robust” system before optimizing the next parameter). So now we’re testing in fact not 900, but 1260 systems: 900 without MMI and each 90 with MMI ranges of 200, 300, 400, and 500 bars. The systems with not enough trades or a too-long lookback period are again removed from the pool, so the real number of tested systems is about 1100.

Depending on the speed of your PC, Zorro will need about 1 hour to test all systems. At the end of every system test, Zorro produces the parameter histograms. We have now two parameters. The histogram of the first one, the price smoothing filter period, looks as before because MMI was switched off during optimization. The second histogram displays the MMI range in combination with the best value from the first histogram. “Best” is here not the highest bar from the previous histogram, but the value that Zorro deems the most robust and least sensitive to market changes. A typical MMI histograms look like this:

The first bar, marked “100”, is the best system without MMI. We can see that it is unprofitable: The profit factor (left scale) is only about 0.8. Using the MMI with a range of 200 and 300 makes the system in fact worse, and reduces the profit factor t0 0.7. However the last two MMI ranges, 400 and 500, shift the system into the profit zone. This was just a random example, but how does the MMI affect all the other systems? Here are the statistics from the MMI experiment:

Asset, Period, Indicator	Success Rate	Winning	Losing
EUR/USD	46% (+8%)	154	185
S&P 500	4% (+3%)	15	318
Silver	27% (+7%)	87	240
15 Minutes	18% (+7%)	71	322
1 Hour	27% (+9%)	92	251
4 Hours	35% (+2%)	93	170
ALMA	22% (+6)%	22	79
Decycle	21% (+8%)	23	89
EMA	23% (+5%)	24	79
HMA	34% (+9%)	33	66
Laguerre	33% (+3%)	20	38
LinearReg	29% (+6%)	31	77
Lowpass	24% (+5%)	26	82
SMA	26% (+5%)	27	76
Smooth	26% (+7%)	23	67
ZMA	22% (+8)%	27	90

The Rate column shows the percentage of successful systems, and in parentheses the difference to the percentage without MMI. We can see that the MMI increased the number of successful systems in all markets, time frames, and indicators. However the numbers are not really representative: the MMI only affected a quarter of the tested systems, but the upper quarter, so some increase in the number of profitable systems was to be expected anyway. A more meaningful measure is the WRC. We’re using the same Bootstrap.c script as in the previous experiment, we only need to increase the CURVES number to 1260. This is the WRC histogram of systems with MMI (again, best system vs. bootstrapped returns of all systems):

900 trend systems (with MMI)

The MMI filter now shifted the best system (black) far to the right side of the histogram. It got a p-value of 0.02, meaning that it is better than 98% of the best randomized systems, and thus well above the 95% significance level. Using the MMI for filtering trades, the method of trading on curve peaks and valleys passed White’s Reality Check. In fact two of the 1260 systems got p-values above the significance level.

The best systems of the experiment had some things in common: They traded with silver and used either the ALMA or the lowpass filter. This is a surprising result, because neither silver nor ALMA and lowpass had the highest number of profitable systems. From the above table, one would assume that EUR/USD and the HMA or Laguerre filter are the most promising. They indeed produced many apparently good systems with profit factors above 2 (without trade costs), but none of them passed the WRC.

Conclusion

The MMI improved trend following systems by 5%…10% average with all tested markets, time frames, and indicators. Best systems were improved by more than 50%.
Trend following systems using the MMI can pass White’s Reality Check.
From the 10 tested smoothing indicators, ALMA produced the best results, although within a relatively small parameter range.
To do: Test more trend filters, f.i. the Hurst Exponent or Ehlers’ Trend/ Cycle decomposition.
To do: Create a real trading system by combining the best trend systems and adding the usual system components such as stop loss, trailing algorithm, profit lock, money management, and so on.

I’ve added the scripts to the 2015 scripts collection.

White’s Reality Check

jcl — Mon, 14 Sep 2015 08:48:17 +0000

This is the third part of the Trend Experiment article series. We now want to evaluate if the positive results from the 900 tested trend following strategies are for real, or just caused by Data Mining Bias. But what is Data Mining Bias, after all? And what is this ominous White’s Reality Check?

Suppose you want to trade by moon phases. But you’re not sure if you shall buy at full moon and sell at new moon, or the other way around. So you do a series of moon phase backtests and find out that the best system, which opens positions at any first quarter moon, achieves 30% annual return. Is this finally the proof that astrology works?

A trade system based on a nonexisting effect has normally a profit expectancy of zero (minus trade costs). But you won’t get zero when you backtest variants of such a system. Due to statistical fluctuations, some of them will produce a positive and some a negative return. When you now pick the best performer, such as the first quarter moon trading system, you might get a high return and an impressive equity curve in the backtest. Sadly, its test result is not necessarily caused by clever trading. It might be just by clever selecting the random best performer from a pool of useless systems.

For finding out if the 30% return by quarter moon trading are for real or just the fool’s gold of Data Mining Bias, Halbert White (1959-2012) invented a test method in 2000. White’s Reality Check (aka Bootstrap Reality Check) is explained in detail in the book ‘Evidence-Based Technical Analysis’ by David Aronson. It works this way:

Develop a strategy. During the development process, keep a record of all strategy variants that were tested and discarded because of their test results, including all abandoned algorithms, ideas, methods, and parameters. It does not matter if they were discarded by human decision or by a computer search or optimizing process.
Produce balance curves of all strategy variants, using detrended trade results and no trade costs. Note down the profit P of the best strategy.
Detrend all balance curves by subtracting the mean return per bar (not to be confused with detrending the trade results!). This way you get a series of curves with the same characteristics of the tested systems, but zero profit.
Randomize all curves by bootstrap with replacement. This produces new curves from the random returns of the old curves. Because the same bars can be selected multiple times, most new curves now produce losses or profits different from zero.
Select the best performer from the randomized curves, and note down its profit.
Repeat steps 4 and 5 a couple 1000 times.
You now have a list of several 1000 best profits. The median M of that list is the Data Mining Bias by your strategy development process.
Check where the original best profit P appears in the list. The percentage of best bootstrap profits greater than P is the so-called p-Value of the best strategy. You want the p-Value to be as low as possible. If P is better than 95% of the best bootstrap profits, the best strategy has a real edge with 95% probability.
P minus M minus trade costs is the result to be expected in real trading the best strategy.

The method is not really intuitive, but mathematically sound. However, it suffers from a couple problems that makes WRC difficult to use in real strategy development:

You can see the worst problem already in step 1. During strategy development you’re permanently testing ideas, adding or removing parameters, or checking out different assets and time frames. Putting aside all discarded variants and producing balance curves of all combinations of them is a cumbersome process. It gets even more diffcult with machine learning algorithms that optimize weight factors and usually do not produce discarded variants. However, the good news are that you can easily apply the WRC when your strategy variants are produced by a transparent mechanical process with no human decisions involved. That’s fortunately the case for our trend experiment.
WRC tends to type II errors. That means it can reject strategies although they have an edge. When more irrelevant variants – systems with random trading and zero profit expectancy – are added to the pool, more positive results can be produced in steps 4 and 5, which reduces the probability that your selected strategy survives the test. WRC can determine rather good that a system is profitable, but can not as well determine that it is worthless.
It gets worse when variants have a negative expectancy. WRC can then over-estimate Data Mining Bias (see paper 2 at the end of the article). This could theoretically also happen with our trend systems, as some variants may suffer from a phase reversal due to the delay by the smoothing indicators, and thus in fact trade against the trend instead of with it.

The Experiment

First you need to collect daily return curves from all tested strategies. This requires adding a few lines to the Trend.c script from the previous article:

// some global variables
int Period;
var Daily[3000];
...
// in the run function, set all trading costs to zero
 Spread = Commission = RollLong = RollShort = Slippage = 0;
...
// store daily results in an equity curve
  Daily[Day] = Equity;
}
...
// in the objective function, save the curves in a file for later evaluation
string FileName = "Log\\TrendDaily.bin";
string Name = strf("%s_%s_%s_%i",Script,Asset,Algo,Period);
int Size = Day*sizeof(var); 
file_append(FileName,Name,strlen(Name)+1);
file_append(FileName,&Size,sizeof(int));
file_append(FileName,Daily,Size);

The second part of the above code stores the equity at the end of any day in the Daily array. The third part stores a string with the name of the strategy, the length of the curve, and the equity values itself in a file named TrendDaily.bin in the Log folder. After running the 10 trend scripts, all 900 resulting curves are collected together in the file.

The next part of our experiment is the Bootstrap.c script that applies White’s Reality Check. I’ll write it in two parts. The first part just reads the 900 curves from the TrendDaily.bin file, stores them for later evaluation, finds the best one and displays a histogram of the profit factors. Once we got that, we did already 80% of the work for the Reality Check. This is the code:

void _plotHistogram(string Name,var Value,var Step,int Color)
{
  var Bucket = floor(Value/Step);
  plotBar(Name,Bucket,Step*Bucket,1,SUM+BARS+LBL2,Color);
}

typedef struct curve
{
  string Name;
  int Length;
  var *Values;
} curve;

curve Curve[900];
var Daily[3000];

void main()
{
  byte *Content = file_content("Log\\TrendDaily.bin");
  int i,j,N = 0;
  int MaxN = 0;
  var MaxPerf = 0.0;
	
  while(N<900 && *Content)
  {
// extract the next curve from the file
    string Name = Content;
    Content += strlen(Name)+1;
    int Size = *((int*)Content);
    int Length = Size/sizeof(var); // number of values
    Content += 4;
    var *Values = Content;
    Content += Size;

// store and plot the curve		
    Curve[N].Name = Name;
    Curve[N].Length = Length;
    Curve[N].Values = Values;
    var Performance = 1.0/ProfitFactor(Values,Length);
    printf("\n%s: %.2f",Name,Performance);
    _plotHistogram("Profit",Performance,0.005,RED);

// find the best curve		
    if(MaxPerf < Performance) {
      MaxN = N; MaxPerf = Performance;
    }
    N++;
  }
  printf("\n\nBenchmark: %s, %.2f",Curve[MaxN].Name,MaxPerf); 
}

Most of the code is just for reading and storing all the equity curves. The indicator ProfitFactor calculates the profit factor of the curve, the sum of all daily wins divided by the sum of all daily losses. However, here we need to consider the array order. Like many platforms, Zorro stores time series in reverse chronological order, with the most recent data at the begin. However we stored the daily equity curve in straight chronological order. So the losses are actually wins and the wins actually losses, which is why we need to inverse the profit factor. The curve with the best profit factor will be our benchmark for the test.

This is the resulting histogram, the profit factors of all 900 (or rather, 705 due to the trade number minimum) equity curves:

Profit factor distribution (without trade costs)

Note that the profit factors are slightly different to the parameter charts of the previous article because they were now calculated from daily returns, not from trade results. We removed trading costs, so the histogram is centered at a profit factor 1.0, aka zero profit. Only a few systems achieved a profit factor in the 1.2 range, the two best made about 1.3. Now we’ll see what White has to say to that. This is the rest of the main function in Bootstrap.c that finally applies his Reality Check:

plotBar("Benchmark",MaxPerf/0.005,MaxPerf,50,BARS+LBL2,BLUE);	
printf("\nBootstrap - please wait");
int Worse = 0, Better = 0;
for(i=0; i<1000; i++) {
  var MaxBootstrapPerf = 0;
  for(j=0; j MaxBootstrapPerf)
    Better++;
  else
    Worse++;
  _plotHistogram("Profit",MaxBootstrapPerf,0.005,RED);
  progress(100*i/SAMPLES,0);
}
printf("\nBenchmark beats %.0f%% of samples!",
  (var)Better*100./(Better+Worse));

This code needs about 3 minutes to run; we’re sampling the 705 curves 1000 times. The randomize function will shuffle the daily returns by bootstrap with replacement; the DETREND flag tells it to subtract the mean return from all returns before. The number of random curves that are better and worse than the benchmark is stored, for printing the percentage at the end. The progress function moves the progress bar while Zorro grinds through the 705,000 curves. And this is the result:

Bootstrap results (red), benchmark system (black)

Hmm. We can see that the best system – the black bar – is at the right side of the histogram, indicating that it might be significant. But only with about 80% probability (the script gives a slightly different result every time due to the randomizing). 20% of the random curves achieve better profit factors than the best system from the experiment. The median of the randomized samples is about 1.26. Only the two best systems from the original distribution (first image) have profit factors above 1.26 – all the rest is at or below the bootstrap median.

So we have to conclude that this simple way of trend trading does not really work. Interestingly, one of those 900 tested systems is a system that I use for myself since 2012, although with additional filters and conditions. This system has produced good live returns and a positive result by the end of every year so far. And there’s still the fact that EUR/USD and silver in all variants produced better statistics than S&P500. This hints that some trend effect exists in their price curves, but the profit factors by the simple algorithms are not high enough to pass White’s Reality Check. We need a better approach for trend exploitation. For instance, a filter that detects if trend is there or not. This will be the topic of the next article of this series. We will see that a filter can have a surprising effect on reality checks. Since we now have the Bootstrap script for applying White’s Reality Check, we can quickly do further experiments.

The Bootstrap.c script has been added to the 2015 script collection downloadable on the sidebar.

Conclusion

None of the 10 tested low-lag indicators, and none of the 3 tested markets shows significant positive expectancy with trend trading.
There is evidence of a trend effect in currencies and commodities, but it is too weak or too infrequent for being effectively exploited with simple trade signals by a filtered price curve.
We have now a useful code framework for comparing indicators and assets, and for further experiments with trade strategies.

Papers

Original paper by Dr. H. White: White2000
WRC modification by P. Hansen: Hansen2005
Stepwise WRC Testing by J. Romano, M. Wolf: RomanoWolf2005
Technical Analysis examined with WRC, by P. Hsu, C. Kuan: HsuKuan2006
WRC and its Extensions by V. Corradi, N. Swanson: CorradiSwanson2011