Walk forward analysis – The Financial Hacker

Build Better Strategies, Part 6: Evaluation

jcl — Thu, 05 Feb 2026 16:54:41 +0000

Developing a successful strategy is a process with many steps, described in the Build Better Strategies article series. At some point you have coded a first, raw version of the strategy. At that stage you’re usually experimenting with different functions for market detection or trade signals. The problem: How can you determine which indicator, filter, or machine learning method works best with which markets and which time frames? Manually testing all combinations is very time consuming, close to impossible. Here’s a way to run that process automated with a single mouse click.

A robust trading strategy has to meet several criteria:

It must exploit a real and significant market inefficiency. Random-walk markets cannot be algo traded.
It must work in all market situations. A trend follower must survive a mean reverting regime.
It must work under many different optimization settings and parameter ranges.
It must be unaffected by random events and price fluctuations.

There are metrics and algorithms to test all this. The robustness under different market situations can be determined through the R2 coefficient or the deviations between the walk forward cycles. The parameter range robustness can be tested with a WFO profile (aka cluster analysis), the price fluctuation robustness with oversampling. A Montecarlo analysis finds out whether the strategy is based on a real market inefficiency.

Some platforms, such as Zorro, have functions for all this. But they require dedicated code in the strategy, often more than for the algorithm itself. In this article I’m going to describe an evaluation framework – a ‘shell’ – that skips the coding part. The shell is included in the latest Zorro version. It can be simply attached to any strategy script. It makes all strategy variables accessible in a panel and adds stuff that’s common to all strategies – optimization, money management, support for multiple assets and algos, cluster and montecarlo analysis. It evaluates all strategy variants in an automated process and builds the optimal portfolio of combinations from different algorithms, assets, and timeframes.

The process involves these steps:

The first step of strategy evaluation is generating sets of parameter settings, named jobs. Any job is a variant of the strategy that you want to test and possibly include in the final portfolio. Parameters can be switches that select between different indicators, or variables (such as timeframes) with optimization ranges. All parameters can be edited in the user interface of the shell, then saved with a mouse click as a job.

The next step is an automated process that runs through all previously stored jobs, trains and tests any of them with different asset, algo, and time frame combinations, and stores their results in a summary. The summary is a CSV list with the performance metrics of all jobs. It is automatically sorted – the best performing job variants are at the top – and looks like this:

So you can see at a glance which parameter combinations work with which assets and time frames, and which are not worth to examine further. You can repeat this step with different global settings, such as bar period or optimization method, and generate multiple summaries in this way.

The next step in the process is cluster analysis. Every job in a selected summary is optimized multiple times with different walk-forward settings. The result with any job variant is stored in WFO profiles or heatmaps:

After this process, you likely ended up with a couple survivors in the top of the summary. The surviving jobs have all a positive return, a steady rising equity curve, shallow drawdowns, and robust parameter ranges since they passed the cluster analysis. But any selection process generates selection bias. Your perfect portfolio will likely produce a great backtest, but will it perform equally well in live trading? To find out, you run a Montecarlo analysis, aka ‘Reality Check’.

This is the most important test of all, since it can determine whether your strategy exploits a real market inefficiency. If the Montecarlo analysis fails with the final portfolio, it will likely also fail with any other parameter combination, so you need to run it only close to the end. If your system passes Montecarlo with a p-value below 5%, you can be relatively confident that the system will return good and steady profit in live trading. Otherwise, back to the drawing board.

The evaluation shell is included in Zorro version 3.01 or above. Usage and details are described under https://zorro-project.com/manual/en/shell.htm. Attaching the shell to a strategy is described under https://zorro-project.com/manual/en/shell2.htm.

Why 90% of Backtests Fail

jcl — Mon, 04 Apr 2022 15:55:47 +0000

About 9 out of 10 backtests produce wrong or misleading results. This is the number one reason why carefully developed algorithmic trading systems often fail in live trading. Even with out-of-sample data and even with cross-validation or walk-forward analysis, backtest results are often way off to the optimistic side. The majority of trading systems with a positive backtest are in fact unprofitable. In this article I’ll discuss the cause of this phenomenon, and how to fix it.

Suppose you’re developing an algorithmic trading strategy, following all rules of proper system development. But you are not aware that your trading algorithm has no statistical edge. The strategy is worthless, the trading rules equivalent to random trading, the profit expectancy – aside from transaction costs – is zero. The problem: you will rarely get a zero result in a backtest. A random trading strategy will in 50% of cases produce a negative backtest result, in 50% a positive result. But if the result is negative, you’re normally tempted to tweak the code or select assets and time frames until you finally got a profitable backtest. Which will happen relatively soon even when applying random modifications to the system. That’s why there are so many unprofitable strategies around, with nevertheless great backtest performances.

Does this mean that backtests are worthless? Not at all. But it is essential to know whether you can trust the test, or not.

The test-the-backtest experiment

There are several methods for verifying a backtest. None of them is perfect, but all give insights from different viewpoints. We’ll use the Zorro algo trading software, and run our experiments with the following test system that is optimized and backtested with walk-forward analysis:

function run()
{
  set(PARAMETERS,TESTNOW,PLOTNOW,LOGFILE);
  BarPeriod = 1440;
  LookBack = 100;
  StartDate = 2012;
  NumWFOCycles = 10;

  assetList("AssetsIB");
  asset("SPY");

  vars Signals = series(LowPass(seriesC(),optimize(10,2,20,2)));
  vars MMIFast = series(MMI(seriesC(),optimize(50,40,60,5)));
  vars MMISlow = series(LowPass(MMIFast,100));

  MaxLong = 1;
  if(falling(MMISlow)) {
    if(valley(Signals))
      enterLong();
    else if(peak(Signals))
      exitLong();
  }
}

This is a classic trend following algorithm. It uses a lowpass filter for trading at the peaks and valleys of the smoothed price curve, and a MMI filter (Market Meanness Index) for distinguishing trending from non-trending market periods. It only trades when the market has switched to rend regime, which is essential for profitable trend following systems. It opens only long positions. Lowpass and MMI filter periods are optimized, and the backtest is a walk-forward analysis with 10 cycles.

The placebo trading system

It is standard for experiments to compare the real stuff with a placebo. For this we’re using a trading system that has obviously no edge, but was tweaked with the evil intention to appear profitable in a walk-forward analysis. This is our placebo system:

void run()
{
  set(PARAMETERS,TESTNOW,PLOTNOW,LOGFILE);
  BarPeriod = 1440;
  StartDate = 2012;
  setf(TrainMode,BRUTE);
  NumWFOCycles = 9;

  assetList("AssetsIB");
  asset("SPY");

  int Pause = optimize(5,1,15,1);
  LifeTime = optimize(5,1,15,1);

// trade after a pause...
  static int NextEntry;
  if(Init) NextEntry = 0;
  if(NextEntry-- <= 0) {
    NextEntry = LifeTime+Pause;
    enterLong();
  }
}

This system opens a position, keeps it a while, then closes it and pauses for a while. The trade and pause durations are walk-forward optimized between 1 day and 3 weeks. LifeTime is a predefined variable that closes the position after the given time. If you don’t believe in lucky trade patterns, you can rightfully assume that this system is equivalent to random trading. Let’s see how it fares in comparison to the trend trading system.

Trend trading vs. placebo trading

This is the equity curve with the trend trading system from a walk forward analysis from 2012 up to 3/2022:

The plot begins 2015 because the preceding 3 years are used for the training and lookback periods. SPY follows the S&P500 index and rises in the long term, so we could expect anyway some profit with a long-only system. But this system, with profit factor 3 and R2 coefficient 0.65 appears a lot better than random trading. Let’s compare it with the placebo system:

The placebo system produced profit factor 2 and R2 coefficient 0.77. Slightly less than the real system, but in the same performance range. And this result was also from a walk-forward analysis, although with 9 cycles – therefore the later start of the test period. Aside from that, it seems impossible to determine solely from the equity curve and performance data which system is for real, and which is a placebo.

Checking the reality

Methods to verify backtest results are named ‘reality check’. They are specific to the asset and algorithm; in a multi-asset, multi-algo portfolio, you need to enable only the component you want to test. Let’s first see how the WFO split affects the backtest. In this way we can find out whether our backtest result was just due to lucky trading in a particular WFO cycle. We’re going to plot a WFO profile that displays the effect of the number of walk-forward cycles on the result. For this we outcomment the NumWFOCycles = … line in the code, and run it in training mode with the WFOProfile.c script:

#define run strategy
#include "trend.c" // <= your script
#undef run
#define CYCLES 20 // max WFO cycles

function run()
{
  set(TESTNOW);
  NumTotalCycles = CYCLES-1;
  NumWFOCycles = TotalCycle+1;
  strategy();
}

function evaluate()
{
  var Perf = ifelse(LossTotal > 0,WinTotal/LossTotal,10);
  if(Perf > 1)
    plotBar("WFO+",NumWFOCycles,NumWFOCycles,Perf,BARS,BLACK);
  else
    plotBar("WFO-",NumWFOCycles,NumWFOCycles,Perf,BARS,RED);
}

We’re redefining the run function to a different name. This allows us to just include the tested script and train it with WFO cycles from 2 up to the number defined by CYCLES. A backtest is executed after training. If an evaluate function is present, Zorro runs it automatically after any backtest. It plots a histogram bar of the profit factor (y axis) from each number of WFO cycles. First, the WFO profile of the trend trading system:

We can see that the performance rises with the number of cycles. This is typical for a system that adapts to the market. All results are positive with a profit factor > 1. Our arbitrary choice of 10 cycles produced a less than average result. So we can at least be sure that this backtest result was not caused by a particularly lucky number of WFO cycles.

The WFO profile of the placebo system:

This time the number of WFO cycles had a strong random effect on the performance. And it is now obvious why I used 9 WFO cycles for that system. For the same reason I used brute force optimization, since it increases WFO variance and thus the chance to get lucky WFO cycle numbers. That’s the opposite of what we normally do when developing algorithmic trading strategies.

WFO profiles give insight into WFO cycle dependency, but not into randomness or overfitting by other means. For this, more in-depth tests are required. Zorro supports two methods, the Montecarlo Reality Check (MRC) with randomized price curves, and White’s Reality Check (WRC) with detrended and bootstrapped equity curves of strategy variants. Both methods have their advantages and disadvantages. But since strategy variants from optimizing can only be created without walk-forward analysis, we’re using the MRC here.

The Montecarlo Reality Check

First we test both systems with random price curves. Randomizing removes short-term price correlations and market inefficiencies, but keeps the long-term trend. Then we compare our original backtest result with the randomized results. This yields a p-value, a metric of the probability that our test result was caused by randomness. The lower the p-Value, the more confidence we can have in the backtest result. In statistics we normally consider a result significant when its p-Value is below 5%.

The basic algorithm of the Montecarlo Reality Check (MRC):

Train your system and run a backtest. Store the profit factor (or any other performance metric that you want to compare).
Randomize the price curve by randomly swapping price changes (shuffle without replacement).
Train your system again with the randomized data and run a backtest. Store the performance metric.
Repeat steps 2 and 3 1000 times.
Determine the number N of randomized tests that have a better result than the original test. The p-Value is N/1000.

If our backtest result was affected by an overall upwards trending price curve, which is certainly the case for this SPY system, the randomized tests will be likewise affected. The MRC code:

#define run strategy
#include "trend.c" // <= your script
#undef run
#define CYCLES 1000

function run()
{
  set(PRELOAD,TESTNOW);
  NumTotalCycles = CYCLES;
  if(TotalCycle == 1) // first cycle = original
    seed(12345); // always same random sequence
  else
    Detrend = SHUFFLE;
  strategy();
  set(LOGFILE|OFF); // don't export files
}

function evaluate()
{
  static var OriginalProfit, Probability;
  var PF = ifelse(LossTotal > 0,WinTotal/LossTotal,10);
  if(TotalCycle == 1) {
    OriginalProfit = PF;
    Probability = 0;
  } else {
    if(PF < 2*OriginalProfit) // clip image at double range
      plotHistogram("Random",PF,OriginalProfit/50,1,RED);
    if(PF > OriginalProfit)
      Probability += 100./NumTotalCycles;
  }
  if(TotalCycle == NumTotalCycles) { // last cycle
    plotHistogram("Original",
     OriginalProfit,OriginalProfit/50,sqrt(NumTotalCycles),BLACK);
    printf("\n-------------------------------------------");
    printf("\nP-Value %.1f%%",Probability);
    printf("\nResult is ");
    if(Probability <= 1)
      printf("highly significant") ;
    else if(Probability <= 5)
      printf("significant");
    else if(Probability <= 15)
      printf("maybe significant");
    else
      printf("statistically insignificant");
    printf("\n-------------------------------------------");
  }
}

This code sets up the Zorro platform to train and test the system 1000 times. The seed setting ensures that you get the same result on any MRC run. From the second cycle on, the historical data is shuffled without replacement. For calculating the p-value and plotting a histogram of the MRC, we use the evaluate function again. It calculates the p-value by counting the backtests resulting in higher profit factors than the original system. Depending on the system, training and testing the strategy a thousand times will take several minutes with Zorro. The resulting MRC histogram of the trend following system:

The height of a red bar represents the number of shuffled backtests that ended at the profit factor shown on the x axis. The black bar on the right (height is irrelevant, only the x axis position matters) is the profit factor with the original price curve. We can see that most shuffled tests came out positive, due to the long-term upwards trend of the SPY price. But our test system came out even more positive. The p-Value is below 1%, meaning a high significance of our backtest. This gives us some confidence that the simple trend follower can achieve a similar result in real trading.

This cannot be said from the MRC histogram of the placebo system:

The backtest profit factors now extend over a wider range, and many were more profitable than the original system. The backtest with the real price curve is indistinguishable from the randomized tests, with a p-value in the 40% area. The original backtest result of the placebo system, even though achieved with walk-forward analysis, is therefore meaningless.

It should be mentioned that the MRC cannot detect all invalid backtests. A system that was explicitly fitted to a particular price curve, for instance by knowing in advance its peaks and valleys, would get a low p-value by the MRC. No reality check could distinguish such a system from a system with a real edge. Therefore, neither MRC nor WRC can give absolute guarantee that a system works when it passes the check. But when it does not pass, you’re advised to better not trade it with real money.

I have uploaded the strategies to the 2022 script repository. The MRC and WFOProfile scripts are included in Zorro version 2.47.4 and above. You will need Zorro S for the brute force optimization of the placebo system.

Better Strategies 5: A Short-Term Machine Learning System

jcl — Fri, 12 Aug 2016 09:42:38 +0000

It’s time for the 5th and final part of the Build Better Strategies series. In part 3 we’ve discussed the development process of a model-based system, and consequently we’ll conclude the series with developing a data-mining system. The principles of data mining and machine learning have been the topic of part 4. For our short-term trading example we’ll use a deep learning algorithm, a stacked autoencoder, but it will work in the same way with many other machine learning algorithms. With today’s software tools, only about 20 lines of code are needed for a machine learning strategy. I’ll try to explain all steps in detail.

Our example will be a research project – a machine learning experiment for answering two questions. Does a more complex algorithm – such as, more neurons and deeper learning – produce a better prediction? And are short-term price moves predictable by short-term price history? The last question came up due to my scepticism about price action trading in the previous part of this series. I got several emails asking about the “trading system generators” or similar price action tools that are praised on some websites. There is no hard evidence that such tools ever produced any profit (except for their vendors) – but does this mean that they all are garbage? We’ll see.

Our experiment is simple: We collect information from the last candles of a price curve, feed it in a deep learning neural net, and use it to predict the next candles. My hypothesis is that a few candles don’t contain any useful predictive information. Of course, a nonpredictive outcome of the experiment won’t mean that I’m right, since I could have used wrong parameters or prepared the data badly. But a predictive outcome would be a hint that I’m wrong and price action trading can indeed be profitable.

Machine learning strategy development
Step 1: The target variable

To recap the previous part: a supervised learning algorithm is trained with a set of features in order to predict a target variable. So the first thing to determine is what this target variable shall be. A popular target, used in most papers, is the sign of the price return at the next bar. Better suited for prediction, since less susceptible to randomness, is the price difference to a more distant prediction horizon, like 3 bars from now, or same day next week. Like almost anything in trading systems, the prediction horizon is a compromise between the effects of randomness (less bars are worse) and predictability (less bars are better).

Sometimes you’re not interested in directly predicting price, but in predicting some other parameter – such as the current leg of a Zigzag indicator – that could otherwise only be determined in hindsight. Or you want to know if a certain market inefficiency will be present in the next time, especially when you’re using machine learning not directly for trading, but for filtering trades in a model-based system. Or you want to predict something entirely different, for instance the probability of a market crash tomorrow. All this is often easier to predict than the popular tomorrow’s return.

In our price action experiment we’ll use the return of a short-term price action trade as target variable. Once the target is determined, next step is selecting the features.

Step 2: The features

A price curve is the worst case for any machine learning algorithm. Not only does it carry little signal and mostly noise, it is also nonstationary and the signal/noise ratio changes all the time. The exact ratio of signal and noise depends on what is meant with “signal”, but it is normally too low for any known machine learning algorithm to produce anything useful. So we must derive features from the price curve that contain more signal and less noise. Signal, in that context, is any information that can be used to predict the target, whatever it is. All the rest is noise.

Thus, selecting the features is critical for success – even more critical than deciding which machine learning algorithm you’re going to use. There are two approaches for selecting features. The first and most common is extracting as much information from the price curve as possible. Since you do not know where the information is hidden, you just generate a wild collection of indicators with a wide range of parameters, and hope that at least a few of them will contain the information that the algorithm needs. This is the approach that you normally find in the literature. The problem of this method: Any machine learning algorithm is easily confused by nonpredictive predictors. So it won’t do to just throw 150 indicators at it. You need some preselection algorithm that determines which of them carry useful information and which can be omitted. Without reducing the features this way to maybe eight or ten, even the deepest learning algorithm won’t produce anything useful.

The other approach, normally for experiments and research, is using only limited information from the price curve. This is the case here: Since we want to examine price action trading, we only use the last few prices as inputs, and must discard all the rest of the curve. This has the advantage that we don’t need any preselection algorithm since the number of features is limited anyway. Here are the two simple predictor functions that we use in our experiment (in C):

var change(int n)
{
	return scale((priceClose(0) - priceClose(n))/priceClose(0),100)/100;
}

var range(int n)
{
	return scale((HH(n) - LL(n))/priceClose(0),100)/100;
}

The two functions are supposed to carry the necessary information for price action: per-bar movement and volatility. The change function is the difference of the current price to the price of n bars before, divided by the current price. The range function is the total high-low distance of the last n candles, also in divided by the current price. And the scale function centers and compresses the values to the +/-100 range, so we divide them by 100 for getting them normalized to +/-1. We remember that normalizing is needed for machine learning algorithms.

Step 3: Preselecting predictors

When you have selected a large number of indicators or other signals as features for your algorithm, you must determine which of them is useful and which not. There are many methods for reducing the number of features, for instance:

Determine the correlations between the signals. Remove those with a strong correlation to other signals, since they do not contribute to the information.
Compare the information content of signals directly, with algorithms like information entropy or decision trees.
Determine the information content indirectly by comparing the signals with randomized signals; there are some software libraries for this, such as the R Boruta package.
Use an algorithm like Principal Components Analysis (PCA) for generating a new signal set with reduced dimensionality.
Use genetic optimization for determining the most important signals just by the most profitable results from the prediction process. Great for curve fitting if you want to publish impressive results in a research paper.

Reducing the number of features is important for most machine learning algorithms, including shallow neural nets. For deep learning it’s less important, since deep nets with many neurons are normally able to process huge feature sets and discard redundant features. For our experiment we do not preselect or preprocess the features, but you can find useful information about this in articles (1), (2), and (3) listed at the end of the page.

Step 4: Select the machine learning algorithm

R offers many different ML packages, and any of them offers many different algorithms with many different parameters. Even if you already decided about the method – here, deep learning – you have still the choice among different approaches and different R packages. Most are quite new, and you can find not many empirical information that helps your decision. You have to try them all and gain experience with different methods. For our experiment we’ve choosen the Deepnet package, which is probably the simplest and easiest to use deep learning library. This keeps our code short. We’re using its Stacked Autoencoder (SAE) algorithm for pre-training the network. Deepnet also offers a Restricted Boltzmann Machine (RBM) for pre-training, but I could not get good results from it. There are other and more complex deep learning packages for R, so you can spend a lot of time checking out all of them.

How pre-training works is easily explained, but why it works is a different matter. As to my knowledge, no one has yet come up with a solid mathematical proof that it works at all. Anyway, imagine a large neural net with many hidden layers:

Training the net means setting up the connection weights between the neurons. The usual method is error backpropagation. But it turns out that the more hidden layers you have, the worse it works. The backpropagated error terms get smaller and smaller from layer to layer, causing the first layers of the net to learn almost nothing. Which means that the predicted result becomes more and more dependent of the random initial state of the weights. This severely limited the complexity of layer-based neural nets and therefore the tasks that they can solve. At least until 10 years ago.

In 2006 scientists in Toronto first published the idea to pre-train the weights with an unsupervised learning algorithm, a restricted Boltzmann machine. This turned out a revolutionary concept. It boosted the development of artificial intelligence and allowed all sorts of new applications from Go-playing machines to self-driving cars. Meanwhile, several new improvements and algorithms for deep learning have been found. A stacked autoencoder works this way:

Select the hidden layer to train; begin with the first hidden layer. Connect its outputs to a temporary output layer that has the same structure as the network’s input layer.
Feed the network with the training samples, but without the targets. Train it so that the first hidden layer reproduces the input signal – the features – at its outputs as exactly as possible. The rest of the network is ignored. During training, apply a ‘weight penalty term’ so that as few connection weights as possible are used for reproducing the signal.
Now feed the outputs of the trained hidden layer to the inputs of the next untrained hidden layer, and repeat the training process so that the input signal is now reproduced at the outputs of the next layer.
Repeat this process until all hidden layers are trained. We have now a ‘sparse network’ with very few layer connections that can reproduce the input signals.
Now train the network with backpropagation for learning the target variable, using the pre-trained weights of the hidden layers as a starting point.

The hope is that the unsupervised pre-training process produces an internal noise-reduced abstraction of the input signals that can then be used for easier learning the target. And this indeed appears to work. No one really knows why, but several theories – see paper (4) below – try to explain that phenomenon.

Step 5: Generate a test data set

We first need to produce a data set with features and targets so that we can test our prediction process and try out parameters. The features must be based on the same price data as in live trading, and for the target we must simulate a short-term trade. So it makes sense to generate the data not with R, but with our trading platform, which is anyway a lot faster. Here’s a small Zorro script for this, DeepSignals.c:

function run()
{
	StartDate = 20140601; // start two years ago
	BarPeriod = 60; // use 1-hour bars
	LookBack = 100; // needed for scale()

	set(RULES);   // generate signals
	LifeTime = 3; // prediction horizon
	Spread = RollLong = RollShort = Commission = Slippage = 0;
	
	adviseLong(SIGNALS+BALANCED,0,
		change(1),change(2),change(3),change(4),
		range(1),range(2),range(3),range(4));
	enterLong(); 
}

We’re generating 2 years of data with features calculated by our above defined change and range functions. Our target is the result of a trade with 3 bars life time. Trading costs are set to zero, so in this case the result is equivalent to the sign of the price difference at 3 bars in the future. The adviseLong function is described in the Zorro manual; it is a mighty function that automatically handles training and predicting and allows to use any R-based machine learning algorithm just as if it were a simple indicator.

In our code, the function uses the next trade return as target, and the price changes and ranges of the last 4 bars as features. The SIGNALS flag tells it not to train the data, but to export it to a .csv file. The BALANCED flag makes sure that we get as many positive as negative returns; this is important for most machine learning algorithms. Run the script in [Train] mode with our usual test asset EUR/USD selected. It generates a spreadsheet file named DeepSignalsEURUSD_L.csv that contains the features in the first 8 columns, and the trade return in the last column.

Step 6: Calibrate the algorithm

Complex machine learning algorithms have many parameters to adjust. Some of them offer great opportunities to curve-fit the algorithm for publications. Still, we must calibrate parameters since the algorithm rarely works well with its default settings. For this, here’s an R script that reads the previously created data set and processes it with the deep learning algorithm (DeepSignal.r):

library('deepnet', quietly = T) 
library('caret', quietly = T)

neural.train = function(model,XY) 
{
  XY <- as.matrix(XY)
  X <- XY[,-ncol(XY)]
  Y <- XY[,ncol(XY)]
  Y <- ifelse(Y > 0,1,0)
  Models[[model]] <<- sae.dnn.train(X,Y,
      hidden = c(50,100,50), 
      activationfun = "tanh", 
      learningrate = 0.5, 
      momentum = 0.5, 
      learningrate_scale = 1.0, 
      output = "sigm", 
      sae_output = "linear", 
      numepochs = 100, 
      batchsize = 100,
      hidden_dropout = 0, 
      visible_dropout = 0)
}

neural.predict = function(model,X) 
{
  if(is.vector(X)) X <- t(X)
  return(nn.predict(Models[[model]],X))
}

neural.init = function()
{
  set.seed(365)
  Models <<- vector("list")
}

TestOOS = function() 
{
  neural.init()
  XY <<- read.csv('C:/Zorro/Data/DeepSignalsEURUSD_L.csv',header = F)
  splits <- nrow(XY)*0.8
  XY.tr <<- head(XY,splits);
  XY.ts <<- tail(XY,-splits)
  neural.train(1,XY.tr)
  X <<- XY.ts[,-ncol(XY.ts)]
  Y <<- XY.ts[,ncol(XY.ts)]
  Y.ob <<- ifelse(Y > 0,1,0)
  Y <<- neural.predict(1,X)
  Y.pr <<- ifelse(Y > 0.5,1,0)
  confusionMatrix(Y.pr,Y.ob)
}

We’ve defined three functions neural.train, neural.predict, and neural.init for training, predicting, and initializing the neural net. The function names are not arbitrary, but follow the convention used by Zorro’s advise(NEURAL,..) function. It doesn’t matter now, but will matter later when we use the same R script for training and trading the deep learning strategy. A fourth function, TestOOS, is used for out-of-sample testing our setup.

The function neural.init seeds the R random generator with a fixed value (365 is my personal lucky number). Otherwise we would get a slightly different result any time, since the neural net is initialized with random weights. It also creates a global R list named “Models”. Most R variable types don’t need to be created beforehand, some do (don’t ask me why). The ‘<<-‘ operator is for accessing a global variable from within a function.

The function neural.train takes as input a model number and the data set to be trained. The model number identifies the trained model in the “Models” list. A list is not really needed for this test, but we’ll need it for more complex strategies that train more than one model. The matrix containing the features and target is passed to the function as second parameter. If the XY data is not a proper matrix, which frequently happens in R depending on how you generated it, it is converted to one. Then it is split into the features (X) and the target (Y), and finally the target is converted to 1 for a positive trade outcome and 0 for a negative outcome.

The network parameters are then set up. Some are obvious, others are free to play around with:

The network structure is given by the hidden vector: c(50,100,50) defines 3 hidden layers, the first with 50, second with 100, and third with 50 neurons. That’s the parameter that we’ll later modify for determining whether deeper is better.
The activation function converts the sum of neuron input values to the neuron output; most often used are sigmoid that saturates to 0 or 1, or tanh that saturates to -1 or +1.

We use tanh here since our signals are also in the +/-1 range. The output of the network is a sigmoid function since we want a prediction in the 0..1 range. But the SAE output must be “linear” so that the Stacked Autoencoder can reproduce the analog input signals on the outputs. Recently in fashion came RLUs, Rectified Linear Units, as activation functions for internal layers. RLUs are faster and partially overcome the above mentioned backpropagation problem, but are not supported by deepnet.

The learning rate controls the step size for the gradient descent in training; a lower rate means finer steps and possibly more precise prediction, but longer training time.
Momentum adds a fraction of the previous step to the current one. It prevents the gradient descent from getting stuck at a tiny local minimum or saddle point.
The learning rate scale is a multiplication factor for changing the learning rate after each iteration (I am not sure for what this is good, but there may be tasks where a lower learning rate on higher epochs improves the training).
An epoch is a training iteration over the entire data set. Training will stop once the number of epochs is reached. More epochs mean better prediction, but longer training.
The batch size is a number of random samples – a mini batch – taken out of the data set for a single training run. Splitting the data into mini batches speeds up training since the weight gradient is then calculated from fewer samples. The higher the batch size, the better is the training, but the more time it will take.
The dropout is a number of randomly selected neurons that are disabled during a mini batch. This way the net learns only with a part of its neurons. This seems a strange idea, but can effectively reduce overfitting.

All these parameters are common for neural networks. Play around with them and check their effect on the result and the training time. Properly calibrating a neural net is not trivial and might be the topic of another article. The parameters are stored in the model together with the matrix of trained connection weights. So they need not to be given again in the prediction function, neural.predict. It takes the model and a vector X of features, runs it through the layers, and returns the network output, the predicted target Y. Compared with training, prediction is pretty fast since it only needs a couple thousand multiplications. If X was a row vector, it is transposed and this way converted to a column vector, otherwise the nn.predict function won’t accept it.

Use RStudio or some similar environment for conveniently working with R. Edit the path to the .csv data in the file above, source it, install the required R packages (deepnet, e1071, and caret), then call the TestOOS function from the command line. If everything works, it should print something like that:

> TestOOS()
begin to train sae ......
training layer 1 autoencoder ...
####loss on step 10000 is : 0.000079
training layer 2 autoencoder ...
####loss on step 10000 is : 0.000085
training layer 3 autoencoder ...
####loss on step 10000 is : 0.000113
sae has been trained.
begin to train deep nn ......
####loss on step 10000 is : 0.123806
deep nn has been trained.
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1231  808
         1  512  934
                                          
               Accuracy : 0.6212          
                 95% CI : (0.6049, 0.6374)
    No Information Rate : 0.5001          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.2424          
 Mcnemar's Test P-Value : 4.677e-16       
                                          
            Sensitivity : 0.7063          
            Specificity : 0.5362          
         Pos Pred Value : 0.6037          
         Neg Pred Value : 0.6459          
             Prevalence : 0.5001          
         Detection Rate : 0.3532          
   Detection Prevalence : 0.5851          
      Balanced Accuracy : 0.6212          
                                          
       'Positive' Class : 0               
                                          
>

TestOOS reads first our data set from Zorro’s Data folder. It splits the data in 80% for training (XY.tr) and 20% for out-of-sample testing (XY.ts). The training set is trained and the result stored in the Models list at index 1. The test set is further split in features (X) and targets (Y). Y is converted to binary 0 or 1 and stored in Y.ob, our vector of observed targets. We then predict the targets from the test set, convert them again to binary 0 or 1 and store them in Y.pr. For comparing the observation with the prediction, we use the confusionMatrix function from the caret package.

A confusion matrix of a binary classifier is simply a 2×2 matrix that tells how many 0’s and how many 1’s had been predicted wrongly and correctly. A lot of metrics are derived from the matrix and printed in the lines above. The most important at the moment is the 62% prediction accuracy. This may hint that I bashed price action trading a little prematurely. But of course the 62% might have been just luck. We’ll see that later when we run a WFO test.

A final advice: R packages are occasionally updated, with the possible consequence that previous R code suddenly might work differently, or not at all. This really happens, so test carefully after any update.

Step 7: The strategy

Now that we’ve tested our algorithm and got some prediction accuracy above 50% with a test data set, we can finally code our machine learning strategy. In fact we’ve already coded most of it, we just must add a few lines to the above Zorro script that exported the data set. This is the final script for training, testing, and (theoretically) trading the system (DeepLearn.c):

#include 

function run()
{
	StartDate = 20140601;
	BarPeriod = 60;	// 1 hour
	LookBack = 100;

	WFOPeriod = 252*24; // 1 year
	DataSplit = 90;
	NumCores = -1;  // use all CPU cores but one

	set(RULES);
	Spread = RollLong = RollShort = Commission = Slippage = 0;
	LifeTime = 3;
	if(Train) Hedge = 2;
	
	if(adviseLong(NEURAL+BALANCED,0,
		change(1),change(2),change(3),change(4),
		range(1),range(2),range(3),range(4)) > 0.5) 
		enterLong();
	if(adviseShort() > 0.5) 
		enterShort();
}

We’re using a WFO cycle of one year, split in a 90% training and a 10% out-of-sample test period. You might ask why I have earlier used two year’s data and a different split, 80/20, for calibrating the network in step 5. This is for using differently composed data for calibrating and for walk forward testing. If we used exactly the same data, the calibration might overfit it and compromise the test.

The selected WFO parameters mean that the system is trained with about 225 days data, followed by a 25 days test or trade period. Thus, in live trading the system would retrain every 25 days, using the prices from the previous 225 days. In the literature you’ll sometimes find the recommendation to retrain a machine learning system after any trade, or at least any day. But this does not make much sense to me. When you used almost 1 year’s data for training a system, it can obviously not deteriorate after a single day. Or if it did, and only produced positive test results with daily retraining, I would strongly suspect that the results are artifacts by some coding mistake.

Training a deep network takes really a long time, in our case about 10 minutes for a network with 3 hidden layers and 200 neurons. In live trading this would be done by a second Zorro process that is automatically started by the trading Zorro. In the backtest, the system trains at any WFO cycle. Therefore using multiple cores is recommended for training many cycles in parallel. The NumCores variable at -1 activates all CPU cores but one. Multiple cores are only available in Zorro S, so a complete walk forward test with all WFO cycles can take several hours with the free version.

In the script we now train both long and short trades. For this we have to allow hedging in Training mode, since long and short positions are open at the same time. Entering a position is now dependent on the return value from the advise function, which in turn calls either the neural.train or the neural.predict function from the R script. So we’re here entering positions when the neural net predicts a result above 0.5.

The R script is now controlled by the Zorro script (for this it must have the same name, DeepLearn.r, only with different extension). It is identical to our R script above since we’re using the same network parameters. Only one additional function is needed for supporting a WFO test:

neural.save = function(name)
{
  save(Models,file=name)  
}

The neural.save function stores the Models list – it now contains 2 models for long and for short trades – after every training run in Zorro’s Data folder. Since the models are stored for later use, we do not need to train them again for repeated test runs.

This is the WFO equity curve generated with the script above (EUR/USD, without trading costs):

EUR/USD equity curve with 50-100-50 network structure

Although not all WFO cycles get a positive result, it seems that there is some predictive effect. The curve is equivalent to an annual return of 89%, achieved with a 50-100-50 hidden layer structure. We’ll check in the next step how different network structures affect the result.

Since the neural.init, neural.train, neural.predict, and neural.save functions are automatically called by Zorro’s adviseLong/adviseShort functions, there are no R functions directly called in the Zorro script. Thus the script can remain unchanged when using a different machine learning method. Only the DeepLearn.r script must be modified and the neural net, for instance, replaced by a support vector machine. For trading such a machine learning system live on a VPS, make sure that R is also installed on the VPS, the needed R packages are installed, and the path to the R terminal set up in Zorro’s ini file. Otherwise you’ll get an error message when starting the strategy.

Step 8: The experiment

If our goal had been developing a strategy, the next steps would be the reality check, risk and money management, and preparing for live trading just as described under model-based strategy development. But for our experiment we’ll now run a series of tests, with the number of neurons per layer increased from 10 to 100 in 3 steps, and 1, 2, or 3 hidden layers (deepnet does not support more than 3). So we’re looking into the following 9 network structures: c(10), c(10,10), c(10,10,10), c(30), c(30,30), c(30,30,30), c(100), c(100,100), c(100,100,100). For this experiment you need an afternoon even with a fast PC and in multiple core mode. Here are the results (SR = Sharpe ratio, R2 = slope linearity):

	* 10 neurons	* 30 neurons	* 100 neurons
1	SR = 0.55 R2 = 0.00	SR = 1.02 R2 = 0.51	SR = 1.18 R2 = 0.84
2	SR = 0.98 R2 = 0.57	SR = 1.22 R2 = 0.70	SR = 0.84 R2 = 0.60
3	SR = 1.24 R2 = 0.79	SR = 1.28 R2 = 0.87	SR = 1.33 R2 = 0.83

We see that a simple net with only 10 neurons in a single hidden layer won’t work well for short-term prediction. Network complexity clearly improves the performance, however only up to a certain point. A good result for our system is already achieved with 3 layers x 30 neurons. Even more neurons won’t help much and sometimes even produce a worse result. This is no real surprise, since for processing only 8 inputs, 300 neurons can likely not do a better job than 100.

Conclusion

Our goal was determining if a few candles can have predictive power and how the results are affected by the complexity of the algorithm. The results seem to suggest that short-term price movements can indeed be predicted sometimes by analyzing the changes and ranges of the last 4 candles. The prediction is not very accurate – it’s in the 58%..60% range, and most systems of the test series become unprofitable when trading costs are included. Still, I have to reconsider my opinion about price action trading. The fact that the prediction improves with network complexity is an especially convincing argument for short-term price predictability.

It would be interesting to look into the long-term stability of predictive price patterns. For this we had to run another series of experiments and modify the training period (WFOPeriod in the script above) and the 90% IS/OOS split. This takes longer time since we must use more historical data. I have done a few tests and found so far that a year seems to be indeed a good training period. The system deteriorates with periods longer than a few years. Predictive price patterns, at least of EUR/USD, have a limited lifetime.

Where can we go from here? There’s a plethora of possibilities, for instance:

Use inputs from more candles and process them with far bigger networks with thousands of neurons.
Use oversampling for expanding the training data. Prediction always improves with more training samples.
Compress time series f.i. with spectal analysis and analyze not the candles, but their frequency representation with machine learning methods.
Use inputs from many candles – such as, 100 – and pre-process adjacent candles with one-dimensional convolutional network layers.
Use recurrent networks. Especially LSTM could be very interesting for analyzing time series – and as to my knowledge, they have been rarely used for financial prediction so far.
Use an ensemble of neural networks for prediction, such as Aronson’s “oracles” and “comitees”.

Papers / Articles

(1) A.S.Sisodiya, Reducing Dimensionality of Data
(2) K.Longmore, Machine Learning for Financial Prediction
(3) V.Perervenko, Selection of Variables for Machine Learning
(4) D.Erhan et al, Why Does Pre-training Help Deep Learning?

I’ve added the C and R scripts to the 2016 script repository. You need both in Zorro’s Strategy folder. Zorro version 1.474, and R version 3.2.5 (64 bit) was used for the experiment, but it should also work with other versions.

Build Better Strategies! Part 3: The Development Process

jcl — Mon, 22 Feb 2016 16:46:32 +0000

This is the third part of the Build Better Strategies series. In the previous part we’ve discussed the 10 most-exploited market inefficiencies and gave some examples of their trading strategies. In this part we’ll analyze the general process of developing a model-based trading system. As almost anything, you can do trading strategies in (at least) two different ways: There’s the ideal way, and there’s the real way. We begin with the ideal development process, broken down to 10 steps.

The ideal model-based strategy development
Step 1: The model

Select one of the known market inefficiencies listed in the previous part, or discover a new one. You could eyeball through price curves and look for something suspicious that can be explained by a certain market behavior. Or the other way around, theoretize about a behavior pattern and check if you can find it reflected in the prices. If you discover something new, feel invited to post it here! But be careful: Models of non-existing inefficiencies (such as Elliott Waves) already outnumber real inefficiencies by a large amount. It is not likely that a real inefficiency remains unknown to this day.

Once you’ve decided for a model, determine which price curve anomaly it would produce, and describe it with a quantitative formula or at least a qualitative criteria. You’ll need that for the next step. As an example we’re using the Cycles Model from the previous part:

y_t ~=~ \hat{y} + \sum_{i}{a_i sin(2 \pi t/C_i+D_i)} + \epsilon

(Cycles are not to be underestimated. One of the most successful funds in history – Jim Simons’ Renaissance Medallion fund – is rumored to exploit cycles in price curves by analyzing their lengths (C_i), phases (D_i) and amplitudes (a_i) with a Hidden Markov Model. Don’t worry, we’ll use a somewhat simpler approach in our example.)

Step 2: Research

Find out if the hypothetical anomaly really appears in the price curves of the assets that you want to trade. For this you first need enough historical data of the traded assets – D1, M1, or Tick data, dependent on the time frame of the anomaly. How far back? As far as possible, since you want to find out the lifetime of your anomaly and the market conditions under which it appears. Write a script to detect and display the anomaly in price data. For our Cycles Model, this would be the frequency spectrum:

EUR/USD frequency spectrum, cycle amplitude vs. cycle length in bars

Check out how the spectrum changes over the months and years. Compare with the spectrum of random data (with Zorro you can use the Detrend function for randomizing price curves). If you find no clear signs of the anomaly, or no significant difference to random data, improve your detection method. And if you then still don’t succeed, go back to step 1.

Step 3: The algorithm

Write an algorithm that generates the trade signals for buying in the direction of the anomaly. A market inefficiency has normally only a very weak effect on the price curve. So your algorithm must be really good in distinguishing it from random noise. At the same time it should be as simple as possible, and rely on as few free parameters as possible. In our example with the Cycles Model, the script reverses the position at every valley and peak of a sine curve that runs ahead of the dominant cycle:

function run()
{
  vars Price = series(price());
  var Phase = DominantPhase(Price,10);
  vars Signal = series(sin(Phase+PI/4));

  if(valley(Signal))
    reverseLong(1); 
  else if(peak(Signal))
    reverseShort(1);
}

This is the core of the system. Now it’s time for a first backtest. The precise performance does not matter much at this point – just determine whether the algorithm has an edge or not. Can it produce a series of profitable trades at least in certain market periods or situations? If not, improve the algorithm or write a another one that exploits the same anomaly with a different method. But do not yet use any stops, trailing, or other bells and whistles. They would only distort the result, and give you the illusion of profit where none is there. Your algorithm must be able to produce positive returns either with pure reversal, or at least with a timed exit.

In this step you must also decide about the backtest data. You normally need M1 or tick data for a realistic test. Daily data won’t do. The data amount depends on the lifetime (determined in step 2) and the nature of the price anomaly. Naturally, the longer the period, the better the test – but more is not always better. Normally it makes no sense to go further back than 10 years, at least not when your system exploits some real market behavior. Markets change extremely in a decade. Outdated historical price data can produce very misleading results. Most systems that had an edge 15 years ago will fail miserably on today’s markets. But they can deceive you with a seemingly profitable backtest.

Step 4: The filter

No market inefficiency exits all the time. Any market goes through periods of random behavior. It is essential for any system to have a filter mechanism that detects if the inefficiency is present or not. The filter is at least as important as the trade signal, if not more – but it’s often forgotten in trade systems. This is our example script with a filter:

function run()
{
  vars Price = series(price());
  var Phase = DominantPhase(Price,10);
  vars Signal = series(sin(Phase+PI/4));
  vars Dominant = series(BandPass(Price,rDominantPeriod,1));
  var Threshold = 1*PIP;
  ExitTime = 10*rDominantPeriod;
	
  if(Amplitude(Dominant,100) > Threshold) {
    if(valley(Signal))
      reverseLong(1); 
    else if(peak(Signal))
      reverseShort(1);
  }
}

We apply a bandpass filter centered at the dominant cycle period to the price curve and measure its amplitude. If the amplitude is above a threshold, we conclude that the inefficiency is there, and we trade. The trade duration is now also restricted to a maximum of 10 cycles since we found in step 2 that dominant cycles appear and disappear in relatively short time.

What can go wrong in this step is falling to the temptation to add a filter just because it improves the test result. Any filter must have a rational reason in the market behavior or in the used signal algorithm. If your algorithm only works by adding irrational filters: back to step 3.

Step 5: Optimizing (but not too much!)

All parameters of a system affect the result, but only a few directly determine entry and exit points of trades dependent on the price curve. These ‘adaptable’ parameters should be identified and optimized. In the above example, trade entry is determined by the phase of the forerunning sine curve and by the filter threshold, and trade exit is determined by the exit time. Other parameters – such as the filter constants of the DominantPhase and the BandPass functions – need not be adapted since their values do not depend on the market situation.

Adaption is an optimizing procdure, and a big opportunity to fail without even noticing it. Often, genetic or brute force methods are applied for finding the “best” parameter combination at a profit peak in the parameter space. Many platforms even have “optimizers” for this purpose. Although this method indeed produces the best backtest result, it won’t help at all for the live performance of the system. In fact, a recent study (Wiecki et.al. 2016) showed that the better you optimize your parameters, the worse your system will fare in live trading! The reason of this paradoxical effect is that optimizing to maximum profit fits your system mostly to the noise in the historical price curve, since noise affects result peaks much more than market inefficiencies.

Rather than generating top backtest results, correct optimizing has other purposes:

It can determine the susceptibility of your system to its parameters. If the system is great with a certain parameter combination, but loses its edge when their values change a tiny bit: back to step 3.
It can identify the parameter’s sweet spots. The sweet spot is the area of highest parameter robustness, i.e. where small parameter changes have little effect on the return. They are not the peaks, but the centers of broad hills in the parameter space.
It can adapt the system to different assets, and enable it to trade a portfolio of assets with slightly different parameters. It can also extend the lifetime of the system by adapting it to the current market situation in regular time intervals, parallel to live trading.

This is our example script with entry parameter optimization:

function run()
{
  vars Price = series(price());
  var Phase = DominantPhase(Price,10);
  vars Signal = series(sin(Phase+optimize(1,0.7,2)*PI/4));
  vars Dominant = series(BandPass(Price,rDominantPeriod,1));
  ExitTime = 10*rDominantPeriod;
  var Threshold = optimize(1,0.7,2)*PIP;
	
  if(Amplitude(Dominant,100) > Threshold) {
    if(valley(Signal))
      reverseLong(1); 
    else if(peak(Signal))
      reverseShort(1);
  }
}

The two optimize calls use a start value (1.0 in both cases) and a range (0.7..2.0) for determining the sweet spots of the two essential parameters of the system. You can identify the spots in the profit factor curves (red bars) of the two parameters that are generated by the optimization process:

Sine phase in pi/4 units

Amplitude threshold in pips

In this case the optimizer would select a parameter value of about 1.3 for the sine phase and about 1.0 (not the peak at 0.9) for the amplitude threshold for the current asset (EUR/USD). The exit time is not optimized in this step, as we’ll do that later together with the other exit parameters when risk management is implemented.

Step 6: Out-of-sample analysis

Of course the parameter optimization improved the backtest performance of the strategy, since the system was now better adapted to the price curve. So the test result so far is worthless. For getting an idea of the real performance, we first need to split the data into in-sample and out-of-sample periods. The in-sample periods are used for training, the out-of-sample periods for testing. The best method for this is Walk Forward Analysis. It uses a rolling window into the historical data for separating test and training periods.

Unfortunately, WFA adds two more parameters to the system: the training time and the test time of a WFA cycle. The test time should be long enough for trades to properly open and close, and small enough for the parameters to stay valid. The training time is more critical. Too short training will not get enough price data for effective optimization, training too long will also produce bad results since the market can already undergo changes during the training period. So the training time itself is a parameter that had to be optimized.

A five cycles walk forward analysis (add “NumWFOCycles = 5;” to the above script) reduces the backtest performance from 100% annual return to a more realistic 60%. For preventing that WFA still produces too optimistic results just by a lucky selection of test and training periods, it makes also sense to perform WFA several times with slightly different starting points of the simulation. If the system has an edge, the results should be not too different. If they vary wildly: back to step 3.

Step 7: Reality Check

Even though the test is now out-of-sample, the mere development process – selecting algorithms, assets, test periods and other ingredients by their performance – has added a lot of selection bias to the results. Are they caused by a real edge of the system, or just by biased development? Determining this with some certainty is the hardest part of strategy development.

The best way to find out is White’s Reality Check. But it’s also the least practical because it requires strong discipline in parameter and algorithm selection. Other methods are not as good, but easier to apply:

Montecarlo. Randomize the price curve by shuffling without replacement, then train and test again. Repeat this many times. Plot a distribution of the results (an example of this method can be found in chapter 6 of the Börsenhackerbuch). Randomizing removes all price anomalies, so you hope for significantly worse performance. But if the result from the real price curve lies not far east of the random distribution peak, it is probably also caused by randomness. That would mean: back to step 3.
Variants. It’s the opposite of the Montecarlo method: Apply the trained system on variants of the price curve and hope for positive results. Variants that maintain most anomalies are oversampling, detrending, or inverting the price curve. If the system stays profitable with those variants, but not with randomized prices, you might really have found a solid system.
Really-out-of-sample (ROOS) Test. While developing the system, ignore the last year (2015) completely. Even delete all 2015 price history from your PC. Only when the system is completely finished, download the data and run a 2015 test. Since the 2015 data can be only used once this way and is then tainted, you can not modify the system anymore if it fails in 2015. Just abandon it. Assemble all your metal strength and go back to step 1.

Step 8: Risk management

Your system has so far survived all tests. Now you can concentrate on reducing its risk and improving its performance. Do not touch anymore the entry algorithm and its parameters. You’re now optimizing the exit. Instead of the simple timed and reversal exits that we’ve used during the development phase, we can now apply various trailing stop mechanisms. For instance:

Instead of exiting after a certain time, raise the stop loss by a certain amount per hour. This has the same effect, but will close unprofitable trades sooner and profitable trades later.
When a trade has won a certain amount, place the stop loss at a distance above the break even point. Even when locking a profit percentage does not improve the total performance, it’s good for your health. Seeing profitable trades wander back into the losing zone can cause serious ulcers.

This is our example script with the initial timed exit replaced by a stop loss limit that rises at every bar:

function run()
{
  vars Price = series(price());
  var Phase = DominantPhase(Price,10);
  vars Signal = series(sin(Phase+optimize(1,0.7,2)*PI/4));
  vars Dominant = series(BandPass(Price,rDominantPeriod,1));
  var Threshold = optimize(1,0.7,2)*PIP;

  Stop = ATR(100);
  for(open_trades)
    TradeStopLimit -= TradeStopDiff/(10*rDominantPeriod);
	
  if(Amplitude(Dominant,100) > Threshold) {
    if(valley(Signal))
      reverseLong(1); 
    else if(peak(Signal))
      reverseShort(1);
  }
}

The for(open_trades) loop increases the stop level of all open trades by a fraction of the initial stop loss distance at the end of every bar.

Of course you now have to optimize and run a walk forward analysis again with the exit parameters. If the performance didn’t improve, think about better exit methods.

Step 9: Money management

Money management serves three purposes. First, reinvesting your profits. Second, distributing your capital among portfolio components. And third, quickly finding out if a trading book is useless. Open the “Money Management” chapter and read the author’s investment advice. If it’s “invest 1% of your capital per trade”, you know why he’s writing trading books. He probably has not yet earned money with real trading.

Suppose your trade volume at a given time t is V(t). If your system is profitable, on average your capital C will rise proportionally to V with a growth factor c:

\frac{dC}{dt} = c V(t)~~\rightarrow~~ C(t) = C_0 + c \int_{0}^{t}{V(t) dt}

When you follow trading book advices and always invest a fixed percentage p of your capital, so that V(t) = p C(t), your capital will grow exponentially with exponent p c:

\frac{dC}{dt} ~=~ c p C(t) ~~\rightarrow~~ C(t) ~=~ C_0 e^{p c t}

Unfortunately your capital will also undergo random fluctuations, named Drawdowns. Drawdowns are proportial to the trade volume V(t). On leveraged accounts with no limit to drawdowns, it can be shown from statistical considerations that the maximum drawdown depth D_max grows proportional to the square root of time t:

{D_{max}}(t) ~=~ q V(t) \sqrt{t}

So, with the fixed percentage investment:

{D_{max}}(t) ~=~ q p C(t) \sqrt{t}

and at the time T = 1/(q p)²:

{D_{max}}(T) ~=~ q p C(T) \frac{1}{q p} ~=~ C(T)

You can see that around the time T = 1/(q p)² a drawdown will eat up all your capital C(T), no matter how profitable your strategy is and how you’ve choosen p! That’s why the 1% rule is a bad advice. And why I advise clients not to raise the trade volume proportionally to their accumulated profit, but to its square root – at least on leveraged accounts. Then, as long as the strategy does not deteriorate, they keep a safe distance from a margin call.

Dependent on whether you trade a single asset and algorithm or a portfolio of both, you can calculate the optimal investment with several methods. There’s the OptimalF formula by Ralph Vince, the Kelly formula by Ed Thorp, or mean/variance optimization by Harry Markowitz. Usually you won’t hard code reinvesting in your strategy, but calculate the investment volume externally, since you might want to withdraw or deposit money from time to time. This requires the overall volume to be set up manually, not by an automated process. A formula for proper reinvesting and withdrawing can be found in the Black Book.

Step 10: Preparation for live trading

You can now define the user interface of your trading system. Determine which parameters you want to change in real time, and which ones only at start of the system. Provide a method to control the trade volume, and a ‘Panic Button’ for locking profit or cashing out in case of bad news. Display all trading relevant parameters in real time. Add buttons for re-training the system, and provide a method for comparing live results with backtest results, such as the Cold Blood Index. Make sure that you can supervise the system from whereever you are, for instance through an online status page. Don’t be tempted to look onto it every five minutes. But you can make a mighty impression when you pull out your mobile phone on the summit of Mt. Ararat and explain to your fellow climbers: “Just checking my trades.”

The real strategy development

So far the theory. All fine and dandy, but how do you really develop a trading system? Everyone knows that there’s a huge gap between theory and practice. This is the real development process as testified by many seasoned algo traders:

Step 1. Visit trader forums and find the thread about the new indicator with the fabulous returns.

Step 2. Get the indicator working with a test system after a long coding session. Ugh, the backtest result does not look this good. You must have made some coding mistake. Debug. Debug some more.

Step 3. Still no good result, but you have more tricks up your sleeve. Add a trailing stop. The results now look already better. Run a week analysis. Tuesday is a particular bad day for this strategy? Add a filter that prevents trading on Tuesday. Add more filters that prevent trades between 10 and 12 am, and when the price is below $14.50, and at full moon except on Fridays. Wait a long time for the simulation to finish. Wow, finally the backtest is in the green!

Step 4. Of course you’re not fooled by in-sample results. After optimizing all 23 parameters, run a walk forward analysis. Wait a long time for the simulation to finish. Ugh, the result does not look this good. Try different WFA cycles. Try different bar periods. Wait a long time for the simulation to finish. Finally, with a 19-minutes bar period and 31 cycles, you get a sensational backtest result! And this completely out of sample!

Step 5. Trade the system live.

Step 6. Ugh, the result does not look this good.

Step 7. Wait a long time for your bank account to recover. Inbetween, write a trading book.

I’ve added the example script to the 2016 script repository. In the next part of this series we’ll look into the data mining approach with machine learning systems. We will examine price pattern detection, regression, neural networks, deep learning, decision trees, and support vector machines.

⇒ Build Better Strategies – Machine Learning

⇒ Build Better Strategies – Evaluation

Better Tests with Oversampling

jcl — Mon, 23 Nov 2015 14:14:52 +0000

The more data you use for testing or training your strategy, the less bias will affect the test result and the more accurate will be the training. The problem: price data is always in short supply. Even shorter since you must put aside some part for out-of-sample tests. Extending the test or training period far into the past is not always a solution. The markets of the 1990s or 1980s were very different from today, so their price data can cause misleading results.
In this article I’ll describe a simple method to produce more trades for testing and training from the same amount of price data. As a side effect, you’ll get an additional metric for the robustness of your strategy.

The price curve is normally divided into equal sections, named bars. Any bar has an associated candle with an open, close, high, low, and average price which are used by the system for detecting patterns and generating trade signals. When the raw price data has a higher time resolution than a bar, which is normally the case, the candle prices are sampled like this:

$Open ~=~ y_{t-dt}$
$High ~=~ max(y_{t-dt}~...~y_t)$
$Low ~=~ min(y_{t-dt}~...~y_t)$
$Close ~=~ y_t$
$Avg ~=~ 1/n \sum_{t-dt}^{t}{y_i}$

where y_t is the raw price tick at time t, and dt is the bar period. If we now subdivide the bar period into m partitions and resample the bars with the time shifted by dt/m, we can produce m slightly different price curves from the same high resolution curve:

$Open_j ~=~ y_{t-j/m dt-dt}$
$High_j ~=~ max(y_{t-j/m dt-dt}~...~y_{t-j/m dt})$
$Low_j ~=~ min(y_{t-j/m dt-dt}~...~y_{t-j/m dt})$
$Close_j ~=~ y_{t-j/m dt}$
$Avg_j ~=~ 1/n \sum_{t-j/m dt-dt}^{t-j/m dt}{y_i}$

m is the oversampling factor. The price curve index j runs from 0 to m-1. Any curve j has obviously the same properties as the original price curve, but has slightly different candles. Some candles can even be extremely different in volatile market situations. Testing a strategy will thus normally produce a different result on any curve.

Time series oversampling can also can also be used with price-based bars, such as Renko Bars or Range Bars, although the above equations then change to the composition formula of the specific bar type. I found that the profit factors of strategies can differ by up to 30% between oversampled price curves. A large variance in results hints that something with the strategy may be wrong – maybe it’s too sensitive to randomness and thus subject to improvement. A strategy that produces large losses on some curves should better be discarded, even if the overall result is positive. But you can safely assume that live training results are best represented by the worst of all the oversampled price curves.

Price action example

Time series oversampling is supported by the Zorro platform. This allows us to quickly check its pros and cons with example strategies. We’ll look into a simple price action strategy with candle patterns. This is a strategy of the data mining flavor. It is not based on a market model, since no good model can explain a predictive power of candle patterns (if you know one, please let me know too!). This trading method therefore has an irrational touch, although it’s said to have worked for Japanese rice traders 300 years ago, maybe due to trading habits or behavior patterns of large market participants. Whatever the reason: while trading the old rice candle patterns in today’s markets can not be really recommended, tests indeed hint at a weak and short-lived predictive power of 3-candle patterns in some currency pairs, such as EUR/USD. The emphasis is on short-lived: Trading habits change and thus predictive candle patterns expire within a few years, while new patterns emerge.

Here’s the Zorro script of such a strategy. In the training run it generates trading rules with 3-candle patterns that preceded profitable trades. In testing and live trading, a position is opened whenever the generated rule detects such a potentially profitable pattern. A walk forward test is used for curve fitting prevention, which is mandatory for all data mining systems:

function run()
{
  BarPeriod = 60; // 1-hour bars
  set(RULES+ALLCYCLES);
  NumYears = 10;
  NumWFOCycles = 10;

  if(Train) {
    Hedge = 2;	  // allow simultaneous long + short trades 
    Detrend = TRADES; // detrend on trade level
    MaxLong = MaxShort = 0; // no limit
  } else {
    Hedge = 1;	// long trade closes short and vice versa
    Detrend = 0;
    MaxLong = MaxShort = 1; // only 1 open position	
  }
	
  LifeTime = 3; // 3 hours trade time
  if(between(lhour(CET),9,13))  // European business hours
  {
    if(adviseLong(PATTERN+FAST+2,0, // train patterns with trade results
      priceHigh(2),priceLow(2),priceClose(2),
      priceHigh(1),priceLow(1),priceClose(1),
      priceHigh(1),priceLow(1),priceClose(1),
      priceHigh(0),priceLow(0),priceClose(0)) > 50)
        enterLong();	
			
    if(adviseShort(PATTERN+FAST+2) > 50)
      enterShort();
  }
}

The core of the script is the adviseLong/adviseShort call, Zorro’s machine learning function (details are better explained in the Zorro tutorial). The function is fed with patterns of 3 candles; the high, low, and close prices of adjacent candles are compared with each other (the open price is not used as it’s identical to the previous close in 24-hour traded assets). Training target is the return of a 3-hours trade after the appearance of a pattern. We’re using 3 hours trade time because the patterns consist of 3 bars, and it makes some sense to have a prediction horizon similar to the pattern length. Since we’re trading EUR/USD, we’re limiting the trades to European business hours. So the last trade must be entered at 13:00 for being closed at 16:00.

But when we train and test the above script with EUR/USD, we get no profitable strategy – at least not with realistic trading costs (an FXCM microlot account is simulated by default):

Price action without oversampling, P&L curve

We can see that the script seems to enter trades mostly at random, so the equity drops continuously at about the rate of the trading costs. The script performs a 10-years walk-forward test in 10 cycles. The default training/test split is 85%, so the test time is about 9 months after a training time of 4 years. 4 years are roughly equivalent to 4*250*24 = 24000 patterns to check. That’s apparently not enough to significantly distinguish profitable from random patterns.

The problem: We can not simply extend the training time. When we do that, we’ll find that the result does not get better. The reason is the limited pattern lifetime. It makes no sense to train past the half-life of the found patterns. So this is not the solution. But what happens when we train and test the same strategy with 4-fold oversampling?

NumSampleCycles = 4;

When we add this line to the script, the training process gets four times more patterns. Although many of them are similar, the amount of data is now enough to distinguish profitable from random patterns with some accuracy. We can see this in the now positive P&L curve over the likewise extended test periods:

Price action with 4-fold oversampling, P&L curve

Also you’ll now get an additional section in the performance report, like this:

Sample Cycles    Best    Worst    Avg  StdDev
Net Profit       5362$   4001$   4730$   4094$
Profit Factor     1.58    1.45    1.51    0.05
Num Trades        1256    1273    1240
Win Rate           41%     39%     40%

Large deviations between the sample cycles will tell you that your strategy is unstable against random price curve fluctuations.

I’ve added the above script to the 2015 repository. But although it generates some profit, be aware that it’s for demonstration only and no ‘industrial quality’. Sharpe ratio and R2 are not good, drawdowns are long, and essential ingredients such as stops, trailing, money management, portfolio diversification, filters, and DMB measurement are not included. So better don’t trade it live.

Conclusion

Admittedly the price action system is a drastic and somewhat dubious example of the benefits of oversampling. But I found that 4-fold or 6-fold oversampling improves optimization and training of almost all strategies, and also increases the quality of backtests by making them less susceptible to extreme candles and outliers.

Oversampling is certainly not the one-fits-all solution. It will not work when the system relies on a specific time for opening and closing trades, as for gap trading or for systems based on daily bars. And it does not help either when single candles have little effect on the result, for instance when trade signals are generated from moving averages with very long time periods. But in most cases it noticeably improves testing and trading.

Walk forward analysis – The Financial Hacker

Build Better Strategies, Part 6: Evaluation

Why 90% of Backtests Fail

The test-the-backtest experiment

The placebo trading system

Trend trading vs. placebo trading

Checking the reality

The Montecarlo Reality Check

Better Strategies 5: A Short-Term Machine Learning System

Machine learning strategy development Step 1: The target variable

Step 2: The features

Step 3: Preselecting predictors

Step 4: Select the machine learning algorithm

Step 5: Generate a test data set

Step 6: Calibrate the algorithm

Step 7: The strategy

Step 8: The experiment

Conclusion

Papers / Articles

Build Better Strategies! Part 3: The Development Process

The ideal model-based strategy development Step 1: The model

Step 2: Research

Step 3: The algorithm

Step 4: The filter

Step 5: Optimizing (but not too much!)

Step 6: Out-of-sample analysis

Step 7: Reality Check

Step 8: Risk management

Step 9: Money management

Step 10: Preparation for live trading

The real strategy development

Better Tests with Oversampling

Price action example

Conclusion

Machine learning strategy development
Step 1: The target variable

The ideal model-based strategy development
Step 1: The model