Data mining bias – The Financial Hacker

Better Strategies 5: A Short-Term Machine Learning System

jcl — Fri, 12 Aug 2016 09:42:38 +0000

It’s time for the 5th and final part of the Build Better Strategies series. In part 3 we’ve discussed the development process of a model-based system, and consequently we’ll conclude the series with developing a data-mining system. The principles of data mining and machine learning have been the topic of part 4. For our short-term trading example we’ll use a deep learning algorithm, a stacked autoencoder, but it will work in the same way with many other machine learning algorithms. With today’s software tools, only about 20 lines of code are needed for a machine learning strategy. I’ll try to explain all steps in detail.

Our example will be a research project – a machine learning experiment for answering two questions. Does a more complex algorithm – such as, more neurons and deeper learning – produce a better prediction? And are short-term price moves predictable by short-term price history? The last question came up due to my scepticism about price action trading in the previous part of this series. I got several emails asking about the “trading system generators” or similar price action tools that are praised on some websites. There is no hard evidence that such tools ever produced any profit (except for their vendors) – but does this mean that they all are garbage? We’ll see.

Our experiment is simple: We collect information from the last candles of a price curve, feed it in a deep learning neural net, and use it to predict the next candles. My hypothesis is that a few candles don’t contain any useful predictive information. Of course, a nonpredictive outcome of the experiment won’t mean that I’m right, since I could have used wrong parameters or prepared the data badly. But a predictive outcome would be a hint that I’m wrong and price action trading can indeed be profitable.

Machine learning strategy development
Step 1: The target variable

To recap the previous part: a supervised learning algorithm is trained with a set of features in order to predict a target variable. So the first thing to determine is what this target variable shall be. A popular target, used in most papers, is the sign of the price return at the next bar. Better suited for prediction, since less susceptible to randomness, is the price difference to a more distant prediction horizon, like 3 bars from now, or same day next week. Like almost anything in trading systems, the prediction horizon is a compromise between the effects of randomness (less bars are worse) and predictability (less bars are better).

Sometimes you’re not interested in directly predicting price, but in predicting some other parameter – such as the current leg of a Zigzag indicator – that could otherwise only be determined in hindsight. Or you want to know if a certain market inefficiency will be present in the next time, especially when you’re using machine learning not directly for trading, but for filtering trades in a model-based system. Or you want to predict something entirely different, for instance the probability of a market crash tomorrow. All this is often easier to predict than the popular tomorrow’s return.

In our price action experiment we’ll use the return of a short-term price action trade as target variable. Once the target is determined, next step is selecting the features.

Step 2: The features

A price curve is the worst case for any machine learning algorithm. Not only does it carry little signal and mostly noise, it is also nonstationary and the signal/noise ratio changes all the time. The exact ratio of signal and noise depends on what is meant with “signal”, but it is normally too low for any known machine learning algorithm to produce anything useful. So we must derive features from the price curve that contain more signal and less noise. Signal, in that context, is any information that can be used to predict the target, whatever it is. All the rest is noise.

Thus, selecting the features is critical for success – even more critical than deciding which machine learning algorithm you’re going to use. There are two approaches for selecting features. The first and most common is extracting as much information from the price curve as possible. Since you do not know where the information is hidden, you just generate a wild collection of indicators with a wide range of parameters, and hope that at least a few of them will contain the information that the algorithm needs. This is the approach that you normally find in the literature. The problem of this method: Any machine learning algorithm is easily confused by nonpredictive predictors. So it won’t do to just throw 150 indicators at it. You need some preselection algorithm that determines which of them carry useful information and which can be omitted. Without reducing the features this way to maybe eight or ten, even the deepest learning algorithm won’t produce anything useful.

The other approach, normally for experiments and research, is using only limited information from the price curve. This is the case here: Since we want to examine price action trading, we only use the last few prices as inputs, and must discard all the rest of the curve. This has the advantage that we don’t need any preselection algorithm since the number of features is limited anyway. Here are the two simple predictor functions that we use in our experiment (in C):

var change(int n)
{
	return scale((priceClose(0) - priceClose(n))/priceClose(0),100)/100;
}

var range(int n)
{
	return scale((HH(n) - LL(n))/priceClose(0),100)/100;
}

The two functions are supposed to carry the necessary information for price action: per-bar movement and volatility. The change function is the difference of the current price to the price of n bars before, divided by the current price. The range function is the total high-low distance of the last n candles, also in divided by the current price. And the scale function centers and compresses the values to the +/-100 range, so we divide them by 100 for getting them normalized to +/-1. We remember that normalizing is needed for machine learning algorithms.

Step 3: Preselecting predictors

When you have selected a large number of indicators or other signals as features for your algorithm, you must determine which of them is useful and which not. There are many methods for reducing the number of features, for instance:

Determine the correlations between the signals. Remove those with a strong correlation to other signals, since they do not contribute to the information.
Compare the information content of signals directly, with algorithms like information entropy or decision trees.
Determine the information content indirectly by comparing the signals with randomized signals; there are some software libraries for this, such as the R Boruta package.
Use an algorithm like Principal Components Analysis (PCA) for generating a new signal set with reduced dimensionality.
Use genetic optimization for determining the most important signals just by the most profitable results from the prediction process. Great for curve fitting if you want to publish impressive results in a research paper.

Reducing the number of features is important for most machine learning algorithms, including shallow neural nets. For deep learning it’s less important, since deep nets with many neurons are normally able to process huge feature sets and discard redundant features. For our experiment we do not preselect or preprocess the features, but you can find useful information about this in articles (1), (2), and (3) listed at the end of the page.

Step 4: Select the machine learning algorithm

R offers many different ML packages, and any of them offers many different algorithms with many different parameters. Even if you already decided about the method – here, deep learning – you have still the choice among different approaches and different R packages. Most are quite new, and you can find not many empirical information that helps your decision. You have to try them all and gain experience with different methods. For our experiment we’ve choosen the Deepnet package, which is probably the simplest and easiest to use deep learning library. This keeps our code short. We’re using its Stacked Autoencoder (SAE) algorithm for pre-training the network. Deepnet also offers a Restricted Boltzmann Machine (RBM) for pre-training, but I could not get good results from it. There are other and more complex deep learning packages for R, so you can spend a lot of time checking out all of them.

How pre-training works is easily explained, but why it works is a different matter. As to my knowledge, no one has yet come up with a solid mathematical proof that it works at all. Anyway, imagine a large neural net with many hidden layers:

Training the net means setting up the connection weights between the neurons. The usual method is error backpropagation. But it turns out that the more hidden layers you have, the worse it works. The backpropagated error terms get smaller and smaller from layer to layer, causing the first layers of the net to learn almost nothing. Which means that the predicted result becomes more and more dependent of the random initial state of the weights. This severely limited the complexity of layer-based neural nets and therefore the tasks that they can solve. At least until 10 years ago.

In 2006 scientists in Toronto first published the idea to pre-train the weights with an unsupervised learning algorithm, a restricted Boltzmann machine. This turned out a revolutionary concept. It boosted the development of artificial intelligence and allowed all sorts of new applications from Go-playing machines to self-driving cars. Meanwhile, several new improvements and algorithms for deep learning have been found. A stacked autoencoder works this way:

Select the hidden layer to train; begin with the first hidden layer. Connect its outputs to a temporary output layer that has the same structure as the network’s input layer.
Feed the network with the training samples, but without the targets. Train it so that the first hidden layer reproduces the input signal – the features – at its outputs as exactly as possible. The rest of the network is ignored. During training, apply a ‘weight penalty term’ so that as few connection weights as possible are used for reproducing the signal.
Now feed the outputs of the trained hidden layer to the inputs of the next untrained hidden layer, and repeat the training process so that the input signal is now reproduced at the outputs of the next layer.
Repeat this process until all hidden layers are trained. We have now a ‘sparse network’ with very few layer connections that can reproduce the input signals.
Now train the network with backpropagation for learning the target variable, using the pre-trained weights of the hidden layers as a starting point.

The hope is that the unsupervised pre-training process produces an internal noise-reduced abstraction of the input signals that can then be used for easier learning the target. And this indeed appears to work. No one really knows why, but several theories – see paper (4) below – try to explain that phenomenon.

Step 5: Generate a test data set

We first need to produce a data set with features and targets so that we can test our prediction process and try out parameters. The features must be based on the same price data as in live trading, and for the target we must simulate a short-term trade. So it makes sense to generate the data not with R, but with our trading platform, which is anyway a lot faster. Here’s a small Zorro script for this, DeepSignals.c:

function run()
{
	StartDate = 20140601; // start two years ago
	BarPeriod = 60; // use 1-hour bars
	LookBack = 100; // needed for scale()

	set(RULES);   // generate signals
	LifeTime = 3; // prediction horizon
	Spread = RollLong = RollShort = Commission = Slippage = 0;
	
	adviseLong(SIGNALS+BALANCED,0,
		change(1),change(2),change(3),change(4),
		range(1),range(2),range(3),range(4));
	enterLong(); 
}

We’re generating 2 years of data with features calculated by our above defined change and range functions. Our target is the result of a trade with 3 bars life time. Trading costs are set to zero, so in this case the result is equivalent to the sign of the price difference at 3 bars in the future. The adviseLong function is described in the Zorro manual; it is a mighty function that automatically handles training and predicting and allows to use any R-based machine learning algorithm just as if it were a simple indicator.

In our code, the function uses the next trade return as target, and the price changes and ranges of the last 4 bars as features. The SIGNALS flag tells it not to train the data, but to export it to a .csv file. The BALANCED flag makes sure that we get as many positive as negative returns; this is important for most machine learning algorithms. Run the script in [Train] mode with our usual test asset EUR/USD selected. It generates a spreadsheet file named DeepSignalsEURUSD_L.csv that contains the features in the first 8 columns, and the trade return in the last column.

Step 6: Calibrate the algorithm

Complex machine learning algorithms have many parameters to adjust. Some of them offer great opportunities to curve-fit the algorithm for publications. Still, we must calibrate parameters since the algorithm rarely works well with its default settings. For this, here’s an R script that reads the previously created data set and processes it with the deep learning algorithm (DeepSignal.r):

library('deepnet', quietly = T) 
library('caret', quietly = T)

neural.train = function(model,XY) 
{
  XY <- as.matrix(XY)
  X <- XY[,-ncol(XY)]
  Y <- XY[,ncol(XY)]
  Y <- ifelse(Y > 0,1,0)
  Models[[model]] <<- sae.dnn.train(X,Y,
      hidden = c(50,100,50), 
      activationfun = "tanh", 
      learningrate = 0.5, 
      momentum = 0.5, 
      learningrate_scale = 1.0, 
      output = "sigm", 
      sae_output = "linear", 
      numepochs = 100, 
      batchsize = 100,
      hidden_dropout = 0, 
      visible_dropout = 0)
}

neural.predict = function(model,X) 
{
  if(is.vector(X)) X <- t(X)
  return(nn.predict(Models[[model]],X))
}

neural.init = function()
{
  set.seed(365)
  Models <<- vector("list")
}

TestOOS = function() 
{
  neural.init()
  XY <<- read.csv('C:/Zorro/Data/DeepSignalsEURUSD_L.csv',header = F)
  splits <- nrow(XY)*0.8
  XY.tr <<- head(XY,splits);
  XY.ts <<- tail(XY,-splits)
  neural.train(1,XY.tr)
  X <<- XY.ts[,-ncol(XY.ts)]
  Y <<- XY.ts[,ncol(XY.ts)]
  Y.ob <<- ifelse(Y > 0,1,0)
  Y <<- neural.predict(1,X)
  Y.pr <<- ifelse(Y > 0.5,1,0)
  confusionMatrix(Y.pr,Y.ob)
}

We’ve defined three functions neural.train, neural.predict, and neural.init for training, predicting, and initializing the neural net. The function names are not arbitrary, but follow the convention used by Zorro’s advise(NEURAL,..) function. It doesn’t matter now, but will matter later when we use the same R script for training and trading the deep learning strategy. A fourth function, TestOOS, is used for out-of-sample testing our setup.

The function neural.init seeds the R random generator with a fixed value (365 is my personal lucky number). Otherwise we would get a slightly different result any time, since the neural net is initialized with random weights. It also creates a global R list named “Models”. Most R variable types don’t need to be created beforehand, some do (don’t ask me why). The ‘<<-‘ operator is for accessing a global variable from within a function.

The function neural.train takes as input a model number and the data set to be trained. The model number identifies the trained model in the “Models” list. A list is not really needed for this test, but we’ll need it for more complex strategies that train more than one model. The matrix containing the features and target is passed to the function as second parameter. If the XY data is not a proper matrix, which frequently happens in R depending on how you generated it, it is converted to one. Then it is split into the features (X) and the target (Y), and finally the target is converted to 1 for a positive trade outcome and 0 for a negative outcome.

The network parameters are then set up. Some are obvious, others are free to play around with:

The network structure is given by the hidden vector: c(50,100,50) defines 3 hidden layers, the first with 50, second with 100, and third with 50 neurons. That’s the parameter that we’ll later modify for determining whether deeper is better.
The activation function converts the sum of neuron input values to the neuron output; most often used are sigmoid that saturates to 0 or 1, or tanh that saturates to -1 or +1.

We use tanh here since our signals are also in the +/-1 range. The output of the network is a sigmoid function since we want a prediction in the 0..1 range. But the SAE output must be “linear” so that the Stacked Autoencoder can reproduce the analog input signals on the outputs. Recently in fashion came RLUs, Rectified Linear Units, as activation functions for internal layers. RLUs are faster and partially overcome the above mentioned backpropagation problem, but are not supported by deepnet.

The learning rate controls the step size for the gradient descent in training; a lower rate means finer steps and possibly more precise prediction, but longer training time.
Momentum adds a fraction of the previous step to the current one. It prevents the gradient descent from getting stuck at a tiny local minimum or saddle point.
The learning rate scale is a multiplication factor for changing the learning rate after each iteration (I am not sure for what this is good, but there may be tasks where a lower learning rate on higher epochs improves the training).
An epoch is a training iteration over the entire data set. Training will stop once the number of epochs is reached. More epochs mean better prediction, but longer training.
The batch size is a number of random samples – a mini batch – taken out of the data set for a single training run. Splitting the data into mini batches speeds up training since the weight gradient is then calculated from fewer samples. The higher the batch size, the better is the training, but the more time it will take.
The dropout is a number of randomly selected neurons that are disabled during a mini batch. This way the net learns only with a part of its neurons. This seems a strange idea, but can effectively reduce overfitting.

All these parameters are common for neural networks. Play around with them and check their effect on the result and the training time. Properly calibrating a neural net is not trivial and might be the topic of another article. The parameters are stored in the model together with the matrix of trained connection weights. So they need not to be given again in the prediction function, neural.predict. It takes the model and a vector X of features, runs it through the layers, and returns the network output, the predicted target Y. Compared with training, prediction is pretty fast since it only needs a couple thousand multiplications. If X was a row vector, it is transposed and this way converted to a column vector, otherwise the nn.predict function won’t accept it.

Use RStudio or some similar environment for conveniently working with R. Edit the path to the .csv data in the file above, source it, install the required R packages (deepnet, e1071, and caret), then call the TestOOS function from the command line. If everything works, it should print something like that:

> TestOOS()
begin to train sae ......
training layer 1 autoencoder ...
####loss on step 10000 is : 0.000079
training layer 2 autoencoder ...
####loss on step 10000 is : 0.000085
training layer 3 autoencoder ...
####loss on step 10000 is : 0.000113
sae has been trained.
begin to train deep nn ......
####loss on step 10000 is : 0.123806
deep nn has been trained.
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1231  808
         1  512  934
                                          
               Accuracy : 0.6212          
                 95% CI : (0.6049, 0.6374)
    No Information Rate : 0.5001          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.2424          
 Mcnemar's Test P-Value : 4.677e-16       
                                          
            Sensitivity : 0.7063          
            Specificity : 0.5362          
         Pos Pred Value : 0.6037          
         Neg Pred Value : 0.6459          
             Prevalence : 0.5001          
         Detection Rate : 0.3532          
   Detection Prevalence : 0.5851          
      Balanced Accuracy : 0.6212          
                                          
       'Positive' Class : 0               
                                          
>

TestOOS reads first our data set from Zorro’s Data folder. It splits the data in 80% for training (XY.tr) and 20% for out-of-sample testing (XY.ts). The training set is trained and the result stored in the Models list at index 1. The test set is further split in features (X) and targets (Y). Y is converted to binary 0 or 1 and stored in Y.ob, our vector of observed targets. We then predict the targets from the test set, convert them again to binary 0 or 1 and store them in Y.pr. For comparing the observation with the prediction, we use the confusionMatrix function from the caret package.

A confusion matrix of a binary classifier is simply a 2×2 matrix that tells how many 0’s and how many 1’s had been predicted wrongly and correctly. A lot of metrics are derived from the matrix and printed in the lines above. The most important at the moment is the 62% prediction accuracy. This may hint that I bashed price action trading a little prematurely. But of course the 62% might have been just luck. We’ll see that later when we run a WFO test.

A final advice: R packages are occasionally updated, with the possible consequence that previous R code suddenly might work differently, or not at all. This really happens, so test carefully after any update.

Step 7: The strategy

Now that we’ve tested our algorithm and got some prediction accuracy above 50% with a test data set, we can finally code our machine learning strategy. In fact we’ve already coded most of it, we just must add a few lines to the above Zorro script that exported the data set. This is the final script for training, testing, and (theoretically) trading the system (DeepLearn.c):

#include 

function run()
{
	StartDate = 20140601;
	BarPeriod = 60;	// 1 hour
	LookBack = 100;

	WFOPeriod = 252*24; // 1 year
	DataSplit = 90;
	NumCores = -1;  // use all CPU cores but one

	set(RULES);
	Spread = RollLong = RollShort = Commission = Slippage = 0;
	LifeTime = 3;
	if(Train) Hedge = 2;
	
	if(adviseLong(NEURAL+BALANCED,0,
		change(1),change(2),change(3),change(4),
		range(1),range(2),range(3),range(4)) > 0.5) 
		enterLong();
	if(adviseShort() > 0.5) 
		enterShort();
}

We’re using a WFO cycle of one year, split in a 90% training and a 10% out-of-sample test period. You might ask why I have earlier used two year’s data and a different split, 80/20, for calibrating the network in step 5. This is for using differently composed data for calibrating and for walk forward testing. If we used exactly the same data, the calibration might overfit it and compromise the test.

The selected WFO parameters mean that the system is trained with about 225 days data, followed by a 25 days test or trade period. Thus, in live trading the system would retrain every 25 days, using the prices from the previous 225 days. In the literature you’ll sometimes find the recommendation to retrain a machine learning system after any trade, or at least any day. But this does not make much sense to me. When you used almost 1 year’s data for training a system, it can obviously not deteriorate after a single day. Or if it did, and only produced positive test results with daily retraining, I would strongly suspect that the results are artifacts by some coding mistake.

Training a deep network takes really a long time, in our case about 10 minutes for a network with 3 hidden layers and 200 neurons. In live trading this would be done by a second Zorro process that is automatically started by the trading Zorro. In the backtest, the system trains at any WFO cycle. Therefore using multiple cores is recommended for training many cycles in parallel. The NumCores variable at -1 activates all CPU cores but one. Multiple cores are only available in Zorro S, so a complete walk forward test with all WFO cycles can take several hours with the free version.

In the script we now train both long and short trades. For this we have to allow hedging in Training mode, since long and short positions are open at the same time. Entering a position is now dependent on the return value from the advise function, which in turn calls either the neural.train or the neural.predict function from the R script. So we’re here entering positions when the neural net predicts a result above 0.5.

The R script is now controlled by the Zorro script (for this it must have the same name, DeepLearn.r, only with different extension). It is identical to our R script above since we’re using the same network parameters. Only one additional function is needed for supporting a WFO test:

neural.save = function(name)
{
  save(Models,file=name)  
}

The neural.save function stores the Models list – it now contains 2 models for long and for short trades – after every training run in Zorro’s Data folder. Since the models are stored for later use, we do not need to train them again for repeated test runs.

This is the WFO equity curve generated with the script above (EUR/USD, without trading costs):

EUR/USD equity curve with 50-100-50 network structure

Although not all WFO cycles get a positive result, it seems that there is some predictive effect. The curve is equivalent to an annual return of 89%, achieved with a 50-100-50 hidden layer structure. We’ll check in the next step how different network structures affect the result.

Since the neural.init, neural.train, neural.predict, and neural.save functions are automatically called by Zorro’s adviseLong/adviseShort functions, there are no R functions directly called in the Zorro script. Thus the script can remain unchanged when using a different machine learning method. Only the DeepLearn.r script must be modified and the neural net, for instance, replaced by a support vector machine. For trading such a machine learning system live on a VPS, make sure that R is also installed on the VPS, the needed R packages are installed, and the path to the R terminal set up in Zorro’s ini file. Otherwise you’ll get an error message when starting the strategy.

Step 8: The experiment

If our goal had been developing a strategy, the next steps would be the reality check, risk and money management, and preparing for live trading just as described under model-based strategy development. But for our experiment we’ll now run a series of tests, with the number of neurons per layer increased from 10 to 100 in 3 steps, and 1, 2, or 3 hidden layers (deepnet does not support more than 3). So we’re looking into the following 9 network structures: c(10), c(10,10), c(10,10,10), c(30), c(30,30), c(30,30,30), c(100), c(100,100), c(100,100,100). For this experiment you need an afternoon even with a fast PC and in multiple core mode. Here are the results (SR = Sharpe ratio, R2 = slope linearity):

	* 10 neurons	* 30 neurons	* 100 neurons
1	SR = 0.55 R2 = 0.00	SR = 1.02 R2 = 0.51	SR = 1.18 R2 = 0.84
2	SR = 0.98 R2 = 0.57	SR = 1.22 R2 = 0.70	SR = 0.84 R2 = 0.60
3	SR = 1.24 R2 = 0.79	SR = 1.28 R2 = 0.87	SR = 1.33 R2 = 0.83

We see that a simple net with only 10 neurons in a single hidden layer won’t work well for short-term prediction. Network complexity clearly improves the performance, however only up to a certain point. A good result for our system is already achieved with 3 layers x 30 neurons. Even more neurons won’t help much and sometimes even produce a worse result. This is no real surprise, since for processing only 8 inputs, 300 neurons can likely not do a better job than 100.

Conclusion

Our goal was determining if a few candles can have predictive power and how the results are affected by the complexity of the algorithm. The results seem to suggest that short-term price movements can indeed be predicted sometimes by analyzing the changes and ranges of the last 4 candles. The prediction is not very accurate – it’s in the 58%..60% range, and most systems of the test series become unprofitable when trading costs are included. Still, I have to reconsider my opinion about price action trading. The fact that the prediction improves with network complexity is an especially convincing argument for short-term price predictability.

It would be interesting to look into the long-term stability of predictive price patterns. For this we had to run another series of experiments and modify the training period (WFOPeriod in the script above) and the 90% IS/OOS split. This takes longer time since we must use more historical data. I have done a few tests and found so far that a year seems to be indeed a good training period. The system deteriorates with periods longer than a few years. Predictive price patterns, at least of EUR/USD, have a limited lifetime.

Where can we go from here? There’s a plethora of possibilities, for instance:

Use inputs from more candles and process them with far bigger networks with thousands of neurons.
Use oversampling for expanding the training data. Prediction always improves with more training samples.
Compress time series f.i. with spectal analysis and analyze not the candles, but their frequency representation with machine learning methods.
Use inputs from many candles – such as, 100 – and pre-process adjacent candles with one-dimensional convolutional network layers.
Use recurrent networks. Especially LSTM could be very interesting for analyzing time series – and as to my knowledge, they have been rarely used for financial prediction so far.
Use an ensemble of neural networks for prediction, such as Aronson’s “oracles” and “comitees”.

Papers / Articles

(1) A.S.Sisodiya, Reducing Dimensionality of Data
(2) K.Longmore, Machine Learning for Financial Prediction
(3) V.Perervenko, Selection of Variables for Machine Learning
(4) D.Erhan et al, Why Does Pre-training Help Deep Learning?

I’ve added the C and R scripts to the 2016 script repository. You need both in Zorro’s Strategy folder. Zorro version 1.474, and R version 3.2.5 (64 bit) was used for the experiment, but it should also work with other versions.

Better Strategies 4: Machine Learning

jcl — Thu, 31 Mar 2016 13:43:37 +0000

Deep Blue was the first computer that won a chess world championship. That was 1996, and it took 20 years until another program, AlphaGo, could defeat the best human Go player. Deep Blue was a model based system with hardwired chess rules. AlphaGo is a data-mining system, a deep neural network trained with thousands of Go games. Not improved hardware, but a breakthrough in software was essential for the step from beating top Chess players to beating top Go players.
In this 4th part of the mini-series we’ll look into the data mining approach for developing trading strategies. This method does not care about market mechanisms. It just scans price curves or other data sources for predictive patterns. Machine learning or “Artificial Intelligence” is not always involved in data-mining strategies. In fact the most popular – and surprisingly profitable – data mining method works without any fancy neural networks or support vector machines.

Machine learning principles

A learning algorithm is fed with data samples, normally derived in some way from historical prices. Each sample consists of n variables x₁ .. x_n, commonly named predictors, features, signals, or simply input. These predictors can be the price returns of the last n bars, or a collection of classical indicators, or any other imaginable functions of the price curve (I’ve even seen the pixels of a price chart image used as predictors for a neural network!). Each sample also normally includes a target variable y, like the return of the next trade after taking the sample, or the next price movement. In the literature you can find y also named label or objective. In a training process, the algorithm learns to predict the target y from the predictors x₁ .. x_n. The learned ‘memory’ is stored in a data structure named model that is specific to the algorithm (not to be confused with a financial model for model based strategies!). A machine learning model can be a function with prediction rules in C code, generated by the training process. Or it can be a set of connection weights of a neural network.

Training: x₁ .. x_n, y => model

Prediction: x₁ .. x_n, model => y

The predictors, features, or whatever you call them, must carry information sufficient to predict the target y with some accuracy. They must also often fulfill two formal requirements. First, all predictor values should be in the same range, like -1 .. +1 (for most R algorithms) or -100 .. +100 (for Zorro or TSSB algorithms). So you need to normalize them in some way before sending them to the machine. Second, the samples should be balanced, i.e. equally distributed over all values of the target variable. So there should be about as many winning as losing samples. If you do not observe these two requirements, you’ll wonder why you’re getting bad results from the machine learning algorithm.

Regression algorithms predict a numeric value, like the magnitude and sign of the next price move. Classification algorithms predict a qualitative sample class, for instance whether it’s preceding a win or a loss. Some algorithms, such as neural networks, decision trees, or support vector machines, can be run in both modes.

A few algorithms learn to divide samples into classes without needing any target y. That’s unsupervised learning, as opposed to supervised learning using a target. Somewhere inbetween is reinforcement learning, where the system trains itself by running simulations with the given features, and using the outcome as training target. AlphaZero, the successor of AlphaGo, used reinforcement learning by playing millions of Go games against itself. In finance there are few applications for unsupervised or reinforcement learning. 99% of machine learning strategies use supervised learning.

Whatever signals we’re using for predictors in finance, they will most likely contain much noise and little information, and will be nonstationary on top of it. Therefore financial prediction is one of the hardest tasks in machine learning. More complex algorithms do not necessarily achieve better results. The selection of the predictors is critical to the success. It is no good idea to use lots of predictors, since this simply causes overfitting and failure in out of sample operation. Therefore data mining strategies often apply a preselection algorithm that determines a small number of predictors out of a pool of many. The preselection can be based on correlation between predictors, on significance, on information content, or simply on prediction success with a test set. Practical experiments with feature selection can be found in a recent article on the Robot Wealth blog.

Here’s a list of the most popular data mining methods used in finance.

1. Indicator soup

Most trading systems we’re programming for clients are not based on a financial model. The client just wanted trade signals from certain technical indicators, filtered with other technical indicators in combination with more technical indicators. When asked how this hodgepodge of indicators could be a profitable strategy, he normally answered: “Trust me. I’m trading it manually, and it works.”

It did indeed. At least sometimes. Although most of those systems did not pass a WFA test (and some not even a simple backtest), a surprisingly large number did. And those were also often profitable in real trading. The client had systematically experimented with technical indicators until he found a combination that worked in live trading with certain assets. This way of trial-and-error technical analysis is a classical data mining approach, just executed by a human and not by a machine. I can not really recommend this method – and a lot of luck, not to speak of money, is probably involved – but I can testify that it sometimes leads to profitable systems.

2. Candle patterns

Not to be confused with those Japanese Candle Patterns that had their best-before date long, long ago. The modern equivalent is price action trading. You’re still looking at the open, high, low, and close of candles. You’re still hoping to find a pattern that predicts a price direction. But you’re now data mining contemporary price curves for collecting those patterns. There are software packages for that purpose. They search for patterns that are profitable by some user-defined criterion, and use them to build a specific pattern detection function. It could look like this one (from Zorro’s pattern analyzer):

int detect(double* sig)
{
  if(sig[1]
This C function returns 1 when the signals match one of the patterns, otherwise 0. You can see from the lengthy code that this is not the fastest way to detect patterns. A better method, used by Zorro when the detection function needs not be exported, is sorting the signals by their magnitude and checking the sort order. An example of such a system can be found here.
Can price action trading really work? Just like the indicator soup, it’s not based on any rational financial model. One can at best imagine that sequences of price movements cause market participants to react in a certain way, this way establishing a temporary predictive pattern. However the number of patterns is quite limited when you only look at sequences of a few adjacent candles. The next step is comparing candles that are not adjacent, but arbitrarily selected within a longer time period. This way you’re getting an almost unlimited number of patterns – but at the cost of finally leaving the realm of the rational. It is hard to imagine how a price move can be predicted by some candle patterns from weeks ago.
Still, a lot effort is going into that. A fellow blogger, Daniel Fernandez, runs a subscription website (Asirikuy) specialized on data mining candle patterns. He refined pattern trading down to the smallest details, and if anyone would ever achieve any profit this way, it would be him. But to his subscribers’ disappointment, trading his patterns live (QuriQuant) produced very different results than his wonderful backtests. If profitable price action systems really exist, apparently no one has found them yet.
3. Linear regression
The simple basis of many complex machine learning algorithms: Predict the target variable y by a linear combination of the predictors  x₁.. x_n.
[latex]y = a_0 + a_1 x_1 + … + a_n x_n[/latex]
The coefficients a_n are the model. They are calculated for minimizing the sum of squared differences between the true y values from the training samples and their predicted y from the above formula:
[latex]Minimize(\sum (y_i-\hat{y_i})^2)[/latex]
For normal distributed samples, the minimizing is possible with some matrix arithmetic, so no iterations are required. In the case n = 1 – with only one predictor variable x – the regression formula is reduced to
[latex]y = a + b x[/latex]
which is simple linear regression, as opposed to multivariate linear regression where n > 1. Simple linear regression is available in most trading platforms, f.i. with the LinReg indicator in the TA-Lib. With y = price and x = time it’s often used as an alternative to a moving average. Multivariate linear regression is available in the R platform through the lm(..) function that comes with the standard installation. A variant is polynomial regression. Like simple regression it uses only one predictor variable x, but also its square and higher degrees, so that x_n == xⁿ:
[latex]y = a_0 + a_1 x + a_2 x^2 + … + a_n x^n[/latex]
With n = 2 or n = 3, polynomial regression is often used to predict the next average price from the smoothed prices of the last bars. The polyfit function of MatLab, R, Zorro, and many other platforms can be used for polynomial regression.
4. Perceptron
Often referred to as a neural network with only one neuron. In fact a perceptron is a regression function like above, but with a binary result, thus called logistic regression. It’s not regression though, it’s a classification algorithm. Zorro’s advise(PERCEPTRON, …) function generates C code that returns either 100 or -100, dependent on whether the predicted result is above a threshold or not:


int predict(double* sig)
{
  if(-27.99*sig[0] + 1.24*sig[1] - 3.54*sig[2] > -21.50)
    return 100;
  else
    return -100;
}
You can see that the sig array is equivalent to the features x_n in the regression formula, and the numeric factors are the coefficients a_n.
5. Neural networks
Linear or logistic regression can only solve linear problems. Many do not fall into this category – a famous example is predicting the output of a simple XOR function. And most likely also predicting prices or trade returns. An artificial neural network (ANN) can tackle nonlinear problems. It’s a bunch of perceptrons that are connected together in an array of layers. Any perceptron is a neuron of the net. Its output goes to the inputs of all neurons of the next layer, like this:

Like the perceptron, a neural network also learns by determining the coefficients that minimize the error between sample prediction and sample target. But this requires now an approximation process, normally with backpropagating the error from the output to the inputs, optimizing the weights on its way. This process imposes two restrictions. First, the neuron outputs must now be continuously differentiable functions instead of the simple perceptron threshold. Second, the network must not be too deep – it must not have too many ‘hidden layers’ of neurons between inputs and output. This second restriction limits the complexity of problems that a standard neural network can solve.
When using a neural network for predicting trades, you have a lot of parameters with which you can play around and, if you’re not careful, produce a lot of selection bias:

Number of hidden layers
Number of neurons per hidden layer
Number of backpropagation cycles, named epochs
Learning rate, the step width of an epoch
Momentum, an inertia factor for the weights adaption
Activation function

The activation function emulates the perceptron threshold. For the backpropagation you need a continuously differentiable function that generates a ‘soft’ step at a certain x value. Normally a sigmoid, tanh, or softmax function is used. Sometimes it’s also a linear function that just returns the weighted sum of all inputs. In this case the network can be used for regression, for predicting a numeric value instead of a binary outcome.
Neural networks are available in the standard R installation (nnet, a single hidden layer network) and in many packages, for instance RSNNS and FCNN4R.
6. Deep learning
Deep learning methods use neural networks with many hidden layers and thousands of neurons, which could not be effectively trained anymore by conventional backpropagation. Several methods became popular in the last years for training such huge networks. They usually pre-train the hidden neuron layers for achieving a more effective learning process. A Restricted Boltzmann Machine (RBM) is an unsupervised classification algorithm with a special network structure that has no connections between the hidden neurons. A Sparse Autoencoder (SAE) uses a conventional network structure, but pre-trains the hidden layers in a clever way by reproducing the input signals on the layer outputs with as few active connections as possible. Those methods allow very complex networks for tackling very complex learning tasks. Such as beating the world’s best human Go player.
Deep learning networks are available in the deepnet and darch R packages. Deepnet provides an autoencoder, Darch a restricted Boltzmann machine. I have not yet experimented with Darch, but here’s an example R script using the Deepnet autoencoder with 3 hidden layers for trade signals through Zorro’s neural() function:
library('deepnet', quietly = T) 
library('caret', quietly = T)

# called by Zorro for training
neural.train = function(model,XY) 
{
  XY <- as.matrix(XY)
  X <- XY[,-ncol(XY)] # predictors
  Y <- XY[,ncol(XY)]  # target
  Y <- ifelse(Y > 0,1,0) # convert -1..1 to 0..1
  Models[[model]] <<- sae.dnn.train(X,Y,
      hidden = c(50,100,50), 
      activationfun = "tanh", 
      learningrate = 0.5, 
      momentum = 0.5, 
      learningrate_scale = 1.0, 
      output = "sigm", 
      sae_output = "linear", 
      numepochs = 100, 
      batchsize = 100,
      hidden_dropout = 0, 
      visible_dropout = 0)
}

# called by Zorro for prediction
neural.predict = function(model,X) 
{
  if(is.vector(X)) X <- t(X) # transpose horizontal vector
  return(nn.predict(Models[[model]],X))
}

# called by Zorro for saving the models
neural.save = function(name)
{
  save(Models,file=name) # save trained models
}

# called by Zorro for initialization
neural.init = function()
{
  set.seed(365)
  Models <<- vector("list")
}

# quick OOS test for experimenting with the settings
Test = function() 
{
  neural.init()
  XY <<- read.csv('C:/Project/Zorro/Data/signals0.csv',header = F)
  splits <- nrow(XY)*0.8
  XY.tr <<- head(XY,splits) # training set
  XY.ts <<- tail(XY,-splits) # test set
  neural.train(1,XY.tr)
  X <<- XY.ts[,-ncol(XY.ts)]
  Y <<- XY.ts[,ncol(XY.ts)]
  Y.ob <<- ifelse(Y > 0,1,0)
  Y <<- neural.predict(1,X)
  Y.pr <<- ifelse(Y > 0.5,1,0)
  confusionMatrix(Y.pr,Y.ob) # display prediction accuracy
}
7. Support vector machines
Like a neural network, a support vector machine (SVM) is another extension of linear regression. When we look at the regression formula again,
[latex]y = a_0 + a_1 x_1 + … + a_n x_n[/latex]
we can interpret the features x_n as coordinates of a n-dimensional feature space. Setting the target variable y to a fixed value determines a plane in that space, called a hyperplane since it has more than two (in fact, n-1) dimensions. The hyperplane separates the samples with y > o from the samples with y < 0. The a_n coefficients can be calculated in a way that the distances of the plane to the nearest samples – which are called the ‘support vectors’ of the plane, hence the algorithm name – is maximum. This way we have a binary classifier with optimal separation of winning and losing samples.
The problem: normally those samples are not linearly separable – they are scattered around irregularly in the feature space. No flat plane can be squeezed between winners and losers. If it could, we had simpler methods to calculate that plane, f.i. linear discriminant analysis. But for the common case we need the SVM trick: Adding more dimensions to the feature space. For this the SVM algorithm produces more features with a kernel function that combines any two existing predictors to a new feature. This is analogous to the step above from the simple regression to polynomial regression, where also more features are added by taking the sole predictor to the n-th power. The more dimensions you add, the easier it is to separate the samples with a flat hyperplane. This plane is then transformed back to the original n-dimensional space, getting wrinkled and crumpled on the way. By clever selecting the kernel function, the process can be performed without actually computing the transformation.
Like neural networks, SVMs can be used not only for classification, but also for regression. They also offer some parameters for optimizing and possibly overfitting the prediction process:

Kernel function. You normally use a RBF kernel (radial basis function, a symmetric kernel), but you also have the choice of other kernels, such as sigmoid, polynomial, and linear.
Gamma, the width of the RBF kernel
Cost parameter C, the ‘penalty’ for wrong classifications in the training samples

An often used SVM is the libsvm library. It’s also available in R in the e1071 package. In the next and final part of this series I plan to describe a trading strategy using this SVM.
8. K-Nearest neighbor
Compared with the heavy ANN and SVM stuff, that’s a nice simple algorithm with a unique property: It needs no training. So the samples are the model. You could use this algorithm for a trading system that learns permanently by simply adding more and more samples. The nearest neighbor algorithm computes the distances in feature space from the current feature values to the k nearest samples. A distance in n-dimensional space between two feature sets (x₁ .. x_n) and (y₁ .. y_n) is calculated just as in 2 dimensions:
[latex display=”true”]D = \sqrt{(x_1-y_1)^2 + (x_2-y_2)^2 + … + (x_n-y_n)^2}[/latex]
The algorithm simply predicts the target from the average of the k target variables of the nearest samples, weighted by their inverse distances. It can be used for classification as well as for regression. Software tricks borrowed from computer graphics, such as an adaptive binary tree (ABT), can make the nearest neighbor search pretty fast. In my past life as computer game programmer, we used such methods in games for tasks like self-learning enemy intelligence. You can call the knn function in R for nearest neighbor prediction – or write a simple function in C for that purpose.
9. K-Means
This is an approximation algorithm for unsupervised classification. It has some similarity, not only its name, to k-nearest neighbor. For classifying the samples, the algorithm first places k random points in the feature space. Then it assigns to any of those points all the samples with the smallest distances to it. The point is then moved to the mean of these nearest samples. This will generate a new samples assignment, since some samples are now closer to another point. The process is repeated until the assignment does not change anymore by moving the points, i.e. each point lies exactly at the mean of its nearest samples. We now have k classes of samples, each in the neighborhood of one of the k points.
This simple algorithm can produce surprisingly good results. In R, the kmeans function does the trick. An example of the k-means algorithm for classifying candle patterns can be found here: Unsupervised candlestick classification for fun and profit.
10. Naive Bayes
This algorithm uses Bayes’ Theorem for classifying samples of non-numeric features (i.e. events), such as the above mentioned candle patterns. Suppose that an event X (for instance, that the Open of the previous bar is below the Open of the current bar) appears in 80% of all winning samples. What is then the probability that a sample is winning when it contains event X? It’s not 0.8 as you might think. The probability can be calculated with Bayes’ Theorem:
[latex display=”true”]P(Y \vert X) = \frac{P(X \vert Y) P(Y)}{P(X)}[/latex]
P(Y|X) is the probability that event Y (f.i. winning) occurs in all samples containing event X (in our example, Open(1) < Open(0)). According to the formula, it is equal to the probability of X occurring in all winning samples (here, 0.8), multiplied by the probability of Y in all samples (around 0.5 when you were following my above advice of balanced samples) and divided by the probability of X in all samples.
If we are naive and assume that all events X are independent of each other, we can calculate the overall probability that a sample is winning by simply multiplying the probabilities P(X|winning) for every event X. This way we end up with this formula:
[latex display=”true”]P(Y | X_1 .. X_n) ~=~ s~P(Y) \prod_{i}{P(X_i | Y)}[/latex]
with a scaling factor s. For the formula to work, the features should be selected in a way that they are as independent as possible, which imposes an obstacle for using Naive Bayes in trading. For instance, the two events Close(1) < Close(0) and Open(1) < Open(0) are most likely not independent of each other. Numerical predictors can be converted to events by dividing the number into separate ranges.
The Naive Bayes algorithm is available in the ubiquitous e1071 R package.
11. Decision and regression trees
Those trees predict an outcome or a numeric value based on a series of yes/no decisions, in a structure like the branches of a tree. Any decision is either the presence of an event or not (in case of non-numerical features) or a comparison of a feature value with a fixed threshold. A typical tree function, generated by Zorro’s tree builder, looks like this:
int tree(double* sig)
{
  if(sig[1] <= 12.938) {
    if(sig[0] <= 0.953) return -70;
    else {
      if(sig[2] <= 43) return 25;
      else {
        if(sig[3] <= 0.962) return -67;
        else return 15;
      }
    }
  }
  else {
    if(sig[3] <= 0.732) return -71;
    else {
      if(sig[1] > 30.61) return 27;
      else {
          if(sig[2] > 46) return 80;
          else return -62;
      }
    }
  }
}
How is such a tree produced from a set of samples? There are several methods; Zorro uses the Shannon information entropy, which already had an appearance on this blog in the Scalping article. At first it checks one of the features, let’s say x₁. It places a hyperplane with the plane formula x₁ = t into the feature space. This hyperplane separates the samples with x₁ > t from the samples with x₁ < t. The dividing threshold t is selected so that the information gain – the difference of information entropy of the whole space, to the sum of information entropies of the two divided sub-spaces – is maximum. This is the case when the samples in the subspaces are more similar to each other than the samples in the whole space.
This process is then repeated with the next feature x₂ and two hyperplanes splitting the two subspaces. Each split is equivalent to a comparison of a feature with a threshold. By repeated splitting, we soon get a huge tree with thousands of threshold comparisons. Then the process is run backwards by pruning the tree and removing all decisions that do not lead to substantial information gain. Finally we end up with a relatively small tree as in the code above.
Decision trees have a wide range of applications. They can produce excellent predictions superior to those of neural networks or support vector machines. But they are not a one-fits-all solution, since their splitting planes are always parallel to the axes of the feature space. This somewhat limits their predictions. They can be used not only for classification, but also for regression, for instance by returning the percentage of samples contributing to a certain branch of the tree. Zorro’s tree is a regression tree. The best known classification tree algorithm is C5.0, available in the C50 package for R.
For improving the prediction even further or overcoming the parallel-axis-limitation, an ensemble of trees can be used, called a random forest. The prediction is then generated by averaging or voting the predictions from the single trees. Random forests are available in R packages randomForest, ranger and Rborist.
Conclusion
There are many different data mining and machine learning methods at your disposal. The critical question: what is better, a model-based or a machine learning strategy? There is no doubt that machine learning has a lot of advantages. You don’t need to care about market microstructure, economy, trader psychology, or similar soft stuff. You can concentrate on pure mathematics. Machine learning is a much more elegant, more attractive way to generate trade systems. It has all advantages on its side but one. Despite all the enthusiastic threads on trader forums, it tends to mysteriously fail in live trading.
Every second week a new paper about trading with machine learning methods is published (a few can be found below). Please take all those publications with a grain of salt. According to some papers, phantastic win rates in the range of 70%, 80%, or even 85% have been achieved. Although win rate is not the only relevant criterion – you can lose even with a high win rate – 85% accuracy in predicting trades is normally equivalent to a profit factor above 5. With such a system the involved scientists should be billionaires meanwhile. Unfortunately I never managed to reproduce those win rates with the described method, and didn’t even come close. So maybe a lot of selection bias went into the results. Or maybe I’m just too stupid.
Compared with model based strategies, I’ve seen not many successful machine learning systems so far. And from what one hears about the algorithmic methods by successful hedge funds, machine learning seems still rarely to be used. But maybe this will change in the future with the availability of more processing power and the upcoming of new algorithms for deep learning.
Papers

Classification using deep neural networks: Dixon.et.al.2016
Predicting price direction using ANN & SVM:  Kara.et.al.2011
Empirical comparison of learning algorithms: Caruana.et.al.2006
Mining stock market tendency using GA & SVM: Yu.Wang.Lai.2005

The next part of this series will deal with the practical development of a machine learning strategy.
⇒ Build Better Strategies – Part 5

The Cold Blood Index

jcl — Mon, 26 Oct 2015 12:50:51 +0000

You’ve developed a new trading system. All tests produced impressive results. So you started it live. And are down by $2000 after 2 months. Or you have a strategy that worked for 2 years, but revently went into a seemingly endless drawdown. Situations are all too familiar to any algo trader. What now? Carry on in cold blood, or pull the brakes in panic?
Several reasons can cause a strategy to lose money right from the start. It can be already expired since the market inefficiency disappeared. Or the system is worthless and the test falsified by some bias that survived all reality checks. Or it’s a normal drawdown that you just have to sit out. In this article I propose an algorithm for deciding very early whether or not to abandon a system in such a situation.

When you start a trading strategy, you’re almost always under water for some time. This is a normal consequence of equity curve volatility. It is the very reason why you need initial capital at all for trading (aside from covering margins and transaction costs). Here you can see the typical bumpy start of a trading system:

CHF grid trader, initial live equity curve

You can estimate from the live equity curve that this system was rather profitable (it was a grid trader exploiting the CHF price cap). It started in July 2013 and had earned about 750 pips in January 2014, 7 months later. Max drawdown was ~400 pips from September until November. So the raw return of that system was about 750/400 ~= 180%. Normally an excellent value for a trade system. But you can also see from the curve that you were down 200 pips about six weeks into trading, and thus had lost almost half of your minimum initial capital. And if you had started the system in September, you had even stayed under water for more than 3 months! This is a psychologically difficult situation. Many traders panic, pull out, and this way lose money even with highly profitable systems. Algo trading unaffected by emotions? Not true.

Not so out of sample

The basic problem: you can never fully trust your test results. No matter how out-of-sample you test it, a strategy still suffers from a certain amount of Data-Snooping Bias. The standard method of measuring bias – White’s Reality Check – works well for simple mechanically generated systems, as in the Trend Experiment. But all human decisions about algorithms, asset selection, filters, training targets, stop/takeprofit mechanisms, WFO windows, money management and so on add new bias, since they are normally affected by testing. The out-of-sample data is then not so out-of-sample anymore. While the bias by training or optimization can be measured and even eliminated with walk forward methods, the bias introduced by the mere development process is unknown. The strategy might still be profitable, or not anymore, or not at all. You can only find out by comparing live results permanently with test results.

You could do that with no risk by trading on a demo account. But if the system is really profitable, demo time is sacrificed profit and thus expensive. Often very expensive, as you must demo trade a long time for some result significancy, and many strategies have a limited lifetime anyway. So you normally demo trade a system only a few weeks for making sure that the script is bug-free, then you go live with real money.

Pull-out conditions

The simplest method of comparing live results is based on the maximum drawdown in the test. This is the pull-out inequality:

[pmath size=18]E ~<~ C + G t/y – D[/pmath]

E = Current account equity
C = Initial account capital
G = Test profit
t = Live trading period
y = Test period
D = Test maximum drawdown

This formula means simply that you should pull out when the live trading drawdown exceeds the maximum drawdown from the test. Traders often check their live results this way, but there are many problems involved with this method:

The maximum backtest drawdown is more or less random.
Drawdowns grow with the test period, thus longer test periods produce worse maximum drawdowns and later pull-out signals.
The drawdown time is not considered.
The method does not work when profits are reinvested by some money management algorithm.
The method does not consider the unlikeliness that the maximum drawdown happens already at live trading start.

For those reasons, the above pullout inequality is often modified for taking the drawdown length and growth into account. The maximum drawdown is then assumed to grow with the square root of time, leading to this modified formula:

[pmath size=18]E ~<~ C + G t/y – D sqrt{{t+l}/y}[/pmath]

E = Current account equity
C = Initial account capital
G = Test profit
t = Live trading period
y = Test period
D = Maximum drawdown depth
l = Maximum drawdown length

This was in fact the algorithm that I often suggested to clients for supervising their live results. It puts the drawdown in relation to the test period and also considers the drawdown length, as the probability of being inside the worst drawdown right at live trading start is l/y. Still, the method does not work with a profit reinvesting system. And it is dependent on the rather random test drawdown. You could address the latter issue by taking the drawdown from a Montecarlo shuffled equity curve, but this produces new problems since trading results have often serial correlation.

After this lenghty introduction for motivation, here’s the proposed algorithm that overcomes the mentioned issues.

Keeping cold blood

For finding out if we really must immediately stop a strategy, we calculate the deviation of the current live trading situation from the strategy behavior in the test. For this we do not use the maximum drawdown, but the backtest equity or balance curve:

Determine a time window of length l (in days) that you want to check. It’s normally the length of the current drawdown; if your system is not in a drawdown, you’re probably in cold blood anyway. Determine the drawdown depth D, i.e. the net loss during that time.
Place a time window of same size l at the start of the test balance curve.
Determine the balance difference G from end to start of the window. Increase a counter N when G <= D.
Move the window forward by 1 day.
Repeat steps 3 and 4 until the window arrived at the end of the balance curve. Count the steps with a counter M.

Any window movement takes a sample out of the curve. We have N samples that are similar or worse, and M-N samples that are better than the current trading situation. The probability to not encounter such a drawdown in T out of M samples is a simple combinatorial equation:

[pmath size=18]1-P ~=~ {(M-N)!(M-T)! }/ {M!(M-N-T)!}[/pmath]

N = Number of G <= D occurrences
M = Total samples = y-l+1
l = Window length in days
y = Test time in days
T = Samples taken = t-l+1
t = Live trading time in days

P is the cold blood index – the similarity of the live situation with the backtest. As long as P stays above 0.1 or 0.2, probably all is still fine. But if P is very low or zero, either the backtest was strongly biased or the market has significantly changed. The system can still be profitable, just less profitable as in the test. But when the current loss D is large in comparison to the gains so far, we should stop.

Often we want to calculate P soon after the begin of live trading. The window size l is then identical to our trading time t, hence T == 1. This simplifies the equation to:

[pmath size=18]P ~=~ N/M[/pmath]

In such a situation I’d give up and pull out of a painful drawdown as soon as P drops below 5%.

The slight disadvantage of this method is that you must perform a backtest with the same capital allocation, and store its balance or equity curve in a file for later evaluation during live trading. However this should only take a few lines of code in a strategy script.

Here’s a small example script for Zorro that calculates P (in percent) from a stored balance curve when a trading time t and drawdown of length l and depth D is given:

int TradeDays = 40;    // t, Days since live start
int DrawDownDays = 30; // l, Days since you're in drawdown
var DrawDown = 100;    // D, Current drawdown depth in $

string BalanceFile = "Log\\BalanceDaily.dbl"; // stored double array

var logsum(int n)
{
  if(n <= 1) return 0;
  return log(n)+logsum(n-1);
}

void main()
{
  int CurveLength = file_length(BalanceFile)/sizeof(var);
  var *Balances = file_content(BalanceFile);

  int M = CurveLength - DrawDownDays + 1;
  int T = TradeDays - DrawDownDays + 1;
 
  if(T < 1 || M <= T) {
    printf("Not enough samples!");
    return;
  }
 
  var GMin=0., N=0.;
  int i=0;
  for(; i < M-1; i++)
  {
    var G = Balances[i+DrawDownDays] - Balances[i];
    if(G <= -DrawDown) N += 1.;
    if(G < GMin) GMin = G;
  } 

  var P;
  if(TradeDays > DrawDownDays)
    P = 1. - exp(logsum(M-N)+logsum(M-T)-logsum(M)-logsum(M-N-T));
  else
    P = N/M;

  printf("\nTest period: %i days",CurveLength);
  printf("\nWorst test drawdown: %.f",-GMin);
  printf("\nM: %i N: %i T: %i",M,(int)N,T);
  printf("\nCold Blood Index: %.1f%%",100*P);
}

Since my computer is unfortunately not good enough for calculating the factorials of some thousand samples, I’ve summed up the logarithms instead – therefore the strange logsum function in the script.

Conclusion

Finding out early whether a live trading drawdown is ‘normal’ or not can be essential for your wallet.
The backtest drawdown is a late and inaccurate criteria.
The Cold Blood Index calculates the precise probability of such a drawdown based on the backtest balance curve.

I’ve added the script above to the 2015 scripts collection. I also have suggested to the Zorro developers to implement this method for automatically analyzing drawdowns while live trading, and issue warnings when P gets dangerously low. This can also be done separately for components in a portfolio system. This feature will probably appear in a future Zorro version.

Boosting Strategies with MMI

jcl — Mon, 28 Sep 2015 15:49:35 +0000

We will now repeat our experiment with the 900 trend trading strategies, but this time with trades filtered by the Market Meanness Index. In our first experiment we found many profitable strategies, some even with high profit factors, but none of them passed White’s Reality Check. So they all would probably fail in real trading in spite of their great results in the backtest. This time we hope that the MMI improves most systems by filtering out trades in non-trending market situations.

900 systems experiment revisited

I have been informed by readers that I committed two mistakes, or at least inaccuracies, in the previous experiment. First, I didn’t detrend the price data. Second, I used the equity curves instead of balance curves for determining the profit factor. I didn’t detrend the prices because the systems traded long/short in a symmetric way, and I supposed that this would eliminate any trend bias. But even if this was true back then, it is now not true anymore: filtering trades by MMI or other means can introduce asymmetry. Also, calculating the profit factor from the balance curve makes indeed more sense because you’re interested in the end profit of the trades, not in their interim behavior. Therefore and for the sake of comparable results I will now and in the future use detrended trade returns and balance curves for such experiments.

The original test, repeated with the modifications, produced a wider profit factor distribution due to eliminating intermediate returns. But the outcome of the experiment was the same. The statistic (including trade costs) did not change much, however the profit factor distribution (without trade costs) did. This is the new WRC histogram of the original 900 systems (best system vs. bootstrap-randomized returns of all systems):

900 trend systems (no MMI)

Although the best system (black bar, a system using ALMA) is at the right side of the distribution, still 11% of random systems were better. The system does not pass the WRC at the required 95% confidence level. This turned out very different when filtering trades with the MMI.

The MMI experiment

This is our script TrendMMI.c for the new experiment:

// helper function: remove systems that exceed the 4 months lookback period
int checkLookBack(int Period) 
{
  if(Period >= LookBack/TimeFrame) {
    StepNext = 0;	// abort optimization
    return LookBack/TimeFrame; // reduce the period
  } else
    return Period;
}

// calculate profit factor and remove systems with not enough trades 
var objective() 
{ 
  if(NumWinTotal < 30 || NumLossTotal < 30) { 
    StepNext = 0;     // abort optimization
    return 0;         // don't store this system
  } else
      return WinTotal/LossTotal; // Profit factor
}

var filter(var* Data,int Period);

void run()
{
  set(PARAMETERS|LOGFILE);
  Curves = "DailyBalance.bin";
  StartDate = 2010;
  BarPeriod = 15;
  LookBack = 80*4*24; // ~ 4 months
  Detrend = TRADES;   // detrend trade results
  while(asset(loop("EUR/USD","SPX500","XAG/USD")))
  while(algo(loop("MM15","MH1","MH4")))
  {
    TimeFrame = 1;
    if(Algo == "MH1")
      TimeFrame = 1*4;
    else if(Algo == "MH4")
      TimeFrame = 4*4;	

// no trade costs
    Spread = Commission = RollLong = RollShort = Slippage = 0;

    int Periods[10] = { 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000 };
    int Period = Periods[round(optimize(1,1,10,1),1)-1];
		
    var *Price = series(price());
    var *Smoothed = series(filter(Price,Period));

    bool DoTrade = true;
    int MMIPeriod = optimize(0,200,500,100);
    if(MMIPeriod) {
      MMIPeriod = checkLookBack(MMIPeriod);
      var *MMI_Raw = series(MMI(Price,MMIPeriod));
      var *MMI_Smooth = series(LowPass(MMI_Raw,MMIPeriod));
      DoTrade = falling(MMI_Smooth);
    }

    if(DoTrade) {
      if(valley(Smoothed))
        enterLong();
      else if(peak(Smoothed))
        enterShort();
    }
  } 
}

The 10 trend trading scripts with the 10 different indicators remain unchanged, aside from now including TrendMMI.c instead of Trend.c. Trading is now dependent on a boolean variable DoTrade. The length of the MMI range is varied between 200, 300, 400, and 500 bars. As most parameters in a strategy, the MMI range is a compromise: It should be no less than 200 bars for getting any accuracy, but it should not be too long for preventing that different market regimes fall in the same MMI range. At the default range of 0, no MMI is applied and trading is not filtered. This way we’re including all the previous systems in the test. This is required for properly detecting Data Mining Bias, which must consider all systems that were discarded based on their result.

We’re running the MMI return value through a lowpass filter that uses the same period as the MMI range. This gives us a smooth MMI value that does not jump around. This value is now used for trade filtering: trades are opened and closed only when the smoothed MMI is falling, meaning that the market has entered trending mode within the last 200 to 500 bars. The MMI is only applied to one of the systems resulting from the prior period variation (the optimize function automatically selects the parameter of the “most robust” system before optimizing the next parameter). So now we’re testing in fact not 900, but 1260 systems: 900 without MMI and each 90 with MMI ranges of 200, 300, 400, and 500 bars. The systems with not enough trades or a too-long lookback period are again removed from the pool, so the real number of tested systems is about 1100.

Depending on the speed of your PC, Zorro will need about 1 hour to test all systems. At the end of every system test, Zorro produces the parameter histograms. We have now two parameters. The histogram of the first one, the price smoothing filter period, looks as before because MMI was switched off during optimization. The second histogram displays the MMI range in combination with the best value from the first histogram. “Best” is here not the highest bar from the previous histogram, but the value that Zorro deems the most robust and least sensitive to market changes. A typical MMI histograms look like this:

The first bar, marked “100”, is the best system without MMI. We can see that it is unprofitable: The profit factor (left scale) is only about 0.8. Using the MMI with a range of 200 and 300 makes the system in fact worse, and reduces the profit factor t0 0.7. However the last two MMI ranges, 400 and 500, shift the system into the profit zone. This was just a random example, but how does the MMI affect all the other systems? Here are the statistics from the MMI experiment:

Asset, Period, Indicator	Success Rate	Winning	Losing
EUR/USD	46% (+8%)	154	185
S&P 500	4% (+3%)	15	318
Silver	27% (+7%)	87	240
15 Minutes	18% (+7%)	71	322
1 Hour	27% (+9%)	92	251
4 Hours	35% (+2%)	93	170
ALMA	22% (+6)%	22	79
Decycle	21% (+8%)	23	89
EMA	23% (+5%)	24	79
HMA	34% (+9%)	33	66
Laguerre	33% (+3%)	20	38
LinearReg	29% (+6%)	31	77
Lowpass	24% (+5%)	26	82
SMA	26% (+5%)	27	76
Smooth	26% (+7%)	23	67
ZMA	22% (+8)%	27	90

The Rate column shows the percentage of successful systems, and in parentheses the difference to the percentage without MMI. We can see that the MMI increased the number of successful systems in all markets, time frames, and indicators. However the numbers are not really representative: the MMI only affected a quarter of the tested systems, but the upper quarter, so some increase in the number of profitable systems was to be expected anyway. A more meaningful measure is the WRC. We’re using the same Bootstrap.c script as in the previous experiment, we only need to increase the CURVES number to 1260. This is the WRC histogram of systems with MMI (again, best system vs. bootstrapped returns of all systems):

900 trend systems (with MMI)

The MMI filter now shifted the best system (black) far to the right side of the histogram. It got a p-value of 0.02, meaning that it is better than 98% of the best randomized systems, and thus well above the 95% significance level. Using the MMI for filtering trades, the method of trading on curve peaks and valleys passed White’s Reality Check. In fact two of the 1260 systems got p-values above the significance level.

The best systems of the experiment had some things in common: They traded with silver and used either the ALMA or the lowpass filter. This is a surprising result, because neither silver nor ALMA and lowpass had the highest number of profitable systems. From the above table, one would assume that EUR/USD and the HMA or Laguerre filter are the most promising. They indeed produced many apparently good systems with profit factors above 2 (without trade costs), but none of them passed the WRC.

Conclusion

The MMI improved trend following systems by 5%…10% average with all tested markets, time frames, and indicators. Best systems were improved by more than 50%.
Trend following systems using the MMI can pass White’s Reality Check.
From the 10 tested smoothing indicators, ALMA produced the best results, although within a relatively small parameter range.
To do: Test more trend filters, f.i. the Hurst Exponent or Ehlers’ Trend/ Cycle decomposition.
To do: Create a real trading system by combining the best trend systems and adding the usual system components such as stop loss, trailing algorithm, profit lock, money management, and so on.

I’ve added the scripts to the 2015 scripts collection.

White’s Reality Check

jcl — Mon, 14 Sep 2015 08:48:17 +0000

This is the third part of the Trend Experiment article series. We now want to evaluate if the positive results from the 900 tested trend following strategies are for real, or just caused by Data Mining Bias. But what is Data Mining Bias, after all? And what is this ominous White’s Reality Check?

Suppose you want to trade by moon phases. But you’re not sure if you shall buy at full moon and sell at new moon, or the other way around. So you do a series of moon phase backtests and find out that the best system, which opens positions at any first quarter moon, achieves 30% annual return. Is this finally the proof that astrology works?

A trade system based on a nonexisting effect has normally a profit expectancy of zero (minus trade costs). But you won’t get zero when you backtest variants of such a system. Due to statistical fluctuations, some of them will produce a positive and some a negative return. When you now pick the best performer, such as the first quarter moon trading system, you might get a high return and an impressive equity curve in the backtest. Sadly, its test result is not necessarily caused by clever trading. It might be just by clever selecting the random best performer from a pool of useless systems.

For finding out if the 30% return by quarter moon trading are for real or just the fool’s gold of Data Mining Bias, Halbert White (1959-2012) invented a test method in 2000. White’s Reality Check (aka Bootstrap Reality Check) is explained in detail in the book ‘Evidence-Based Technical Analysis’ by David Aronson. It works this way:

Develop a strategy. During the development process, keep a record of all strategy variants that were tested and discarded because of their test results, including all abandoned algorithms, ideas, methods, and parameters. It does not matter if they were discarded by human decision or by a computer search or optimizing process.
Produce balance curves of all strategy variants, using detrended trade results and no trade costs. Note down the profit P of the best strategy.
Detrend all balance curves by subtracting the mean return per bar (not to be confused with detrending the trade results!). This way you get a series of curves with the same characteristics of the tested systems, but zero profit.
Randomize all curves by bootstrap with replacement. This produces new curves from the random returns of the old curves. Because the same bars can be selected multiple times, most new curves now produce losses or profits different from zero.
Select the best performer from the randomized curves, and note down its profit.
Repeat steps 4 and 5 a couple 1000 times.
You now have a list of several 1000 best profits. The median M of that list is the Data Mining Bias by your strategy development process.
Check where the original best profit P appears in the list. The percentage of best bootstrap profits greater than P is the so-called p-Value of the best strategy. You want the p-Value to be as low as possible. If P is better than 95% of the best bootstrap profits, the best strategy has a real edge with 95% probability.
P minus M minus trade costs is the result to be expected in real trading the best strategy.

The method is not really intuitive, but mathematically sound. However, it suffers from a couple problems that makes WRC difficult to use in real strategy development:

You can see the worst problem already in step 1. During strategy development you’re permanently testing ideas, adding or removing parameters, or checking out different assets and time frames. Putting aside all discarded variants and producing balance curves of all combinations of them is a cumbersome process. It gets even more diffcult with machine learning algorithms that optimize weight factors and usually do not produce discarded variants. However, the good news are that you can easily apply the WRC when your strategy variants are produced by a transparent mechanical process with no human decisions involved. That’s fortunately the case for our trend experiment.
WRC tends to type II errors. That means it can reject strategies although they have an edge. When more irrelevant variants – systems with random trading and zero profit expectancy – are added to the pool, more positive results can be produced in steps 4 and 5, which reduces the probability that your selected strategy survives the test. WRC can determine rather good that a system is profitable, but can not as well determine that it is worthless.
It gets worse when variants have a negative expectancy. WRC can then over-estimate Data Mining Bias (see paper 2 at the end of the article). This could theoretically also happen with our trend systems, as some variants may suffer from a phase reversal due to the delay by the smoothing indicators, and thus in fact trade against the trend instead of with it.

The Experiment

First you need to collect daily return curves from all tested strategies. This requires adding a few lines to the Trend.c script from the previous article:

// some global variables
int Period;
var Daily[3000];
...
// in the run function, set all trading costs to zero
 Spread = Commission = RollLong = RollShort = Slippage = 0;
...
// store daily results in an equity curve
  Daily[Day] = Equity;
}
...
// in the objective function, save the curves in a file for later evaluation
string FileName = "Log\\TrendDaily.bin";
string Name = strf("%s_%s_%s_%i",Script,Asset,Algo,Period);
int Size = Day*sizeof(var); 
file_append(FileName,Name,strlen(Name)+1);
file_append(FileName,&Size,sizeof(int));
file_append(FileName,Daily,Size);

The second part of the above code stores the equity at the end of any day in the Daily array. The third part stores a string with the name of the strategy, the length of the curve, and the equity values itself in a file named TrendDaily.bin in the Log folder. After running the 10 trend scripts, all 900 resulting curves are collected together in the file.

The next part of our experiment is the Bootstrap.c script that applies White’s Reality Check. I’ll write it in two parts. The first part just reads the 900 curves from the TrendDaily.bin file, stores them for later evaluation, finds the best one and displays a histogram of the profit factors. Once we got that, we did already 80% of the work for the Reality Check. This is the code:

void _plotHistogram(string Name,var Value,var Step,int Color)
{
  var Bucket = floor(Value/Step);
  plotBar(Name,Bucket,Step*Bucket,1,SUM+BARS+LBL2,Color);
}

typedef struct curve
{
  string Name;
  int Length;
  var *Values;
} curve;

curve Curve[900];
var Daily[3000];

void main()
{
  byte *Content = file_content("Log\\TrendDaily.bin");
  int i,j,N = 0;
  int MaxN = 0;
  var MaxPerf = 0.0;
	
  while(N<900 && *Content)
  {
// extract the next curve from the file
    string Name = Content;
    Content += strlen(Name)+1;
    int Size = *((int*)Content);
    int Length = Size/sizeof(var); // number of values
    Content += 4;
    var *Values = Content;
    Content += Size;

// store and plot the curve		
    Curve[N].Name = Name;
    Curve[N].Length = Length;
    Curve[N].Values = Values;
    var Performance = 1.0/ProfitFactor(Values,Length);
    printf("\n%s: %.2f",Name,Performance);
    _plotHistogram("Profit",Performance,0.005,RED);

// find the best curve		
    if(MaxPerf < Performance) {
      MaxN = N; MaxPerf = Performance;
    }
    N++;
  }
  printf("\n\nBenchmark: %s, %.2f",Curve[MaxN].Name,MaxPerf); 
}

Most of the code is just for reading and storing all the equity curves. The indicator ProfitFactor calculates the profit factor of the curve, the sum of all daily wins divided by the sum of all daily losses. However, here we need to consider the array order. Like many platforms, Zorro stores time series in reverse chronological order, with the most recent data at the begin. However we stored the daily equity curve in straight chronological order. So the losses are actually wins and the wins actually losses, which is why we need to inverse the profit factor. The curve with the best profit factor will be our benchmark for the test.

This is the resulting histogram, the profit factors of all 900 (or rather, 705 due to the trade number minimum) equity curves:

Profit factor distribution (without trade costs)

Note that the profit factors are slightly different to the parameter charts of the previous article because they were now calculated from daily returns, not from trade results. We removed trading costs, so the histogram is centered at a profit factor 1.0, aka zero profit. Only a few systems achieved a profit factor in the 1.2 range, the two best made about 1.3. Now we’ll see what White has to say to that. This is the rest of the main function in Bootstrap.c that finally applies his Reality Check:

plotBar("Benchmark",MaxPerf/0.005,MaxPerf,50,BARS+LBL2,BLUE);	
printf("\nBootstrap - please wait");
int Worse = 0, Better = 0;
for(i=0; i<1000; i++) {
  var MaxBootstrapPerf = 0;
  for(j=0; j MaxBootstrapPerf)
    Better++;
  else
    Worse++;
  _plotHistogram("Profit",MaxBootstrapPerf,0.005,RED);
  progress(100*i/SAMPLES,0);
}
printf("\nBenchmark beats %.0f%% of samples!",
  (var)Better*100./(Better+Worse));

This code needs about 3 minutes to run; we’re sampling the 705 curves 1000 times. The randomize function will shuffle the daily returns by bootstrap with replacement; the DETREND flag tells it to subtract the mean return from all returns before. The number of random curves that are better and worse than the benchmark is stored, for printing the percentage at the end. The progress function moves the progress bar while Zorro grinds through the 705,000 curves. And this is the result:

Bootstrap results (red), benchmark system (black)

Hmm. We can see that the best system – the black bar – is at the right side of the histogram, indicating that it might be significant. But only with about 80% probability (the script gives a slightly different result every time due to the randomizing). 20% of the random curves achieve better profit factors than the best system from the experiment. The median of the randomized samples is about 1.26. Only the two best systems from the original distribution (first image) have profit factors above 1.26 – all the rest is at or below the bootstrap median.

So we have to conclude that this simple way of trend trading does not really work. Interestingly, one of those 900 tested systems is a system that I use for myself since 2012, although with additional filters and conditions. This system has produced good live returns and a positive result by the end of every year so far. And there’s still the fact that EUR/USD and silver in all variants produced better statistics than S&P500. This hints that some trend effect exists in their price curves, but the profit factors by the simple algorithms are not high enough to pass White’s Reality Check. We need a better approach for trend exploitation. For instance, a filter that detects if trend is there or not. This will be the topic of the next article of this series. We will see that a filter can have a surprising effect on reality checks. Since we now have the Bootstrap script for applying White’s Reality Check, we can quickly do further experiments.

The Bootstrap.c script has been added to the 2015 script collection downloadable on the sidebar.

Conclusion

None of the 10 tested low-lag indicators, and none of the 3 tested markets shows significant positive expectancy with trend trading.
There is evidence of a trend effect in currencies and commodities, but it is too weak or too infrequent for being effectively exploited with simple trade signals by a filtered price curve.
We have now a useful code framework for comparing indicators and assets, and for further experiments with trade strategies.

Papers

Original paper by Dr. H. White: White2000
WRC modification by P. Hansen: Hansen2005
Stepwise WRC Testing by J. Romano, M. Wolf: RomanoWolf2005
Technical Analysis examined with WRC, by P. Hsu, C. Kuan: HsuKuan2006
WRC and its Extensions by V. Corradi, N. Swanson: CorradiSwanson2011

Data mining bias – The Financial Hacker

Better Strategies 5: A Short-Term Machine Learning System

Machine learning strategy development Step 1: The target variable

Step 2: The features

Step 3: Preselecting predictors

Step 4: Select the machine learning algorithm

Step 5: Generate a test data set

Step 6: Calibrate the algorithm

Step 7: The strategy

Step 8: The experiment

Conclusion

Papers / Articles

Better Strategies 4: Machine Learning

Machine learning principles

1. Indicator soup

2. Candle patterns

3. Linear regression

4. Perceptron

5. Neural networks

6. Deep learning

7. Support vector machines

8. K-Nearest neighbor

9. K-Means

10. Naive Bayes

11. Decision and regression trees

Conclusion

Papers

The Cold Blood Index

Not so out of sample

Pull-out conditions

Keeping cold blood

Conclusion

Boosting Strategies with MMI

900 systems experiment revisited

The MMI experiment

Conclusion

White’s Reality Check

The Experiment

Conclusion

Papers

Machine learning strategy development
Step 1: The target variable