As most of you know I'm running a fully automated futures trading system. This system uses a number of different signals to forecast price movements, but is mostly a trend following system.
I've had quite a few requests for a simulated back-test of this system. Although the system has done very well since I began trading in April 2014, this has been at a time when most trend following systems have done very well. In particular the flagship fund of quant shop AHL (where I used to work) made about 34% last year.
(I'll be providing a more thorough review of my own performance after I've done a full year of trading, in a few weeks time).
So there is natural curiosity as to whether 2009 - 2013, much worse for trend following, would also have been bad for my system.
This will also be an educational exercise, as I'll talk through some of the issues involved in making a backtest as realistic as possible, and avoiding the deathly curse of "overfitting". Overfitted back-tests can look amazing, but they are unlikely to do well in actual trading.
I will use data from 43 futures markets to simulate the model. These have been chosen to cover a wide range of asset classes, and also based on factors like trading cost and data availability. One slight wrinkle is that I don't have a long series of price history for all my instruments. The data I get from my broker only goes back to late 2013 when I started collecting prices, and although www.quandl.com has been a great resource for backfilling longer price data, it doesn't cover every market. If anyone knows of another site for getting (free) historical daily price data for individual futures contracts, preferably which provides .csv files or an API, I'd love to hear about it.
Here you can see how many markets I have data for over time:
Notice the big jump in 2013 when I started getting broker data. This should mean the backtest is a little conservative, since you get better performance from more markets (I know this from simulating performance of similar systems in my old job where I had access to much more data).
However it also means I've needed to take care to make sure that the weights to different instruments in the portfolio is rescaled and fitted properly as new series of data arrives.
I have four main kinds of trading rules:
How do we decide how much weight to give to each trading rule? I use a technique called non parametric bootstrapping to do my portfolio optimisation. Bootstrapping automatically gives you the right weights depending on how different the underlying data is from random noise, so it produces less extreme portfolios.
This is done on an expanding window out of sample. For example to trade in 1987 I used data from 1978 to 1986 to fit my weights. For 2015 I used data from 1978 to 2014. So I'm only using the past, not forward looking data.
To avoid over fitting I pool the pre-cost returns across all the instruments for which I have data. I've rarely found enough consistent evidence that different trading rules work better pre-cost on different kinds of instrument to justify doing anything else, especially given the paucity of available data in the past.
I then work out after cost returns, so it's likely that on expensive markets there will be less weight on faster trading rules.
Other than making sure you account properly for the effect of costs the main issue to worry about is over fitting AKA data mining. As you can see I am quite careful not to use forward looking information, and bootstrapping ensures we don't over fit based on limited data.
However I can't get away from the fact that I am using trading rules that I know will work, based on my own experience and general market knowledge. So there will be some implicit data mining going on before the backtest is even run.
This issue is discussed briefly in this blog. It will be discussed more thoroughly in my forthcoming book (details to follow, but hopefully out later this year), where there will also be more information about backtesting and fitting generally.
But my rules are generally simple, and having a number of variations for each rule should minimise the bias this causes. Still I wouldn't expect to realise the backtested Sharpe Ratio that I see in this back-test (this is also because future asset returns generally aren't likely to be as high in the simulated period, when a secular in inflation caused large one off repricing gains). But its much more realistic than an overfitted version would be.
I then use a similar procedure to get weights for the instruments in my portfolio, with a few tweaks. I use weekly returns, otherwise the correlations are unrealistically low due to different market closing times (all other work is done with daily data). Obviously I don't pool data from different instruments together!
However if I don't have at least a year of data for an instrument when I start trading it I use average returns from the rest of the asset class, plus some noise such that the new asset will be 80% correlated on average with the other instruments of the same group. This gives me reasonable weights until I have enough data to fit them more precisely.
I also don't take pre cost performance into account (again there isn't much evidence that this is statistically different between markets); although because I'm bootstrapping it wouldn't change the weights much anyway.
Here are the final weights from the bootstrapping procedure, for each asset class:
Agricultural: 21.5%
Bonds and STIR: 17.5%
Equity index, including volatility: 17.3%
FX: 19.1%
Metals: 16.7%
Oil and Gas: 8.3%
These are nice and even.
I assume here that we start with £500,000; and are targeting risk such that our annualised returns will have an average volatility of 25% of this, £125,000 (this is the same percentage risk target, but not the same size portfolio as I have).
It's imperative that we know we're getting this right. Here is a an estimate of the realised rolling annualised volatility of returns. Higher peaks mean that we have strong forecasts from our trading rules, or that correlations are particularly high, or that the markets were more volatile than we hoped when we originally put on our positions. However the average is about right; and if anything is a little lower and more conservative than it should be.
(This is to do with a risk management overlay that I use in my model, which reduces risk when it thinks there is potential for large losses)
You can see that the last year has been exceptionally good. Overall though this is a good, but not unbelievable performance. It would have been very easy to get a much better curve by fitting in sample, and by using more aggressive fitting techniques. But that would prove nothing, and I'd probably be doing much worse in real trading.
Some statistics:
Sharpe Ratio: 0.88
Realised annualised standard deviation: 19%
Average drawdown: 9.2%
Ratio of winning days to losing day returns: 1.006
Proportion of winning days: 54%
Worst drawdown: 33%
Proportion of days spent in drawdown: 94%
Note that without costs the sharpe would be higher, around 0.94. So I'm paying 0.06 SR in costs. This is an outcome of how I excluded faster trading rules for more expensive instruments.
These returns assume we maintain the same risk target. However all traders should reduce their risk when they lose money. Most will also want to increase exposure as their account value grows. In the latter case the returns shown above are effectively a log graph of what your returns would be. Since the system makes 16% a year on average over 32 years the compounded returns would be pretty good.
I reduce my capital when I make losses, but keep it at a capped maximum when I am at my high water mark. This would slightly increase the Sharpe shown above and reduce the drawdowns, at the expense of a lower total gain.
Here are returns we get from the different styles of trading (don't worry about the units on the y-axis):
You can see that trend following (which contributes about 60% of my risk), as has been well documented, did poorly from 2011-2013. However the other trading rules saved the day; in particular Carry. On the other hand 2014 was a great year for trend following, and this is reflected in my overall performance and those of large funds with similar styles such as AHL, Bluetrend, Winton and Cantab.
Note that in calculating profits I always lag my trades by one day, and assume they are done at the next days closing price, paying half the usual spread on the market, and the normal commission. This is all fairly conservative.
These simulated returns don't include interest charges, gains or losses on converting FX for margin payments, or data fees. In my annual review of actual performance I'll give you some idea of how large these elements are (sneak preview, not that large).
If you'd like any more detail or stats, then please comment on this post. I hope this has been interesting.
I've had quite a few requests for a simulated back-test of this system. Although the system has done very well since I began trading in April 2014, this has been at a time when most trend following systems have done very well. In particular the flagship fund of quant shop AHL (where I used to work) made about 34% last year.
(I'll be providing a more thorough review of my own performance after I've done a full year of trading, in a few weeks time).
So there is natural curiosity as to whether 2009 - 2013, much worse for trend following, would also have been bad for my system.
This will also be an educational exercise, as I'll talk through some of the issues involved in making a backtest as realistic as possible, and avoiding the deathly curse of "overfitting". Overfitted back-tests can look amazing, but they are unlikely to do well in actual trading.
Futures markets
I will use data from 43 futures markets to simulate the model. These have been chosen to cover a wide range of asset classes, and also based on factors like trading cost and data availability. One slight wrinkle is that I don't have a long series of price history for all my instruments. The data I get from my broker only goes back to late 2013 when I started collecting prices, and although www.quandl.com has been a great resource for backfilling longer price data, it doesn't cover every market. If anyone knows of another site for getting (free) historical daily price data for individual futures contracts, preferably which provides .csv files or an API, I'd love to hear about it.
Here you can see how many markets I have data for over time:
Notice the big jump in 2013 when I started getting broker data. This should mean the backtest is a little conservative, since you get better performance from more markets (I know this from simulating performance of similar systems in my old job where I had access to much more data).
However it also means I've needed to take care to make sure that the weights to different instruments in the portfolio is rescaled and fitted properly as new series of data arrives.
Trading rules
I have four main kinds of trading rules:
- Trend following
- Carry
- Relative value (within an asset class)
- Selling volatility (this 'rule' just amounts to a modest short bias on the VIX, and V2X markets)
How do we decide how much weight to give to each trading rule? I use a technique called non parametric bootstrapping to do my portfolio optimisation. Bootstrapping automatically gives you the right weights depending on how different the underlying data is from random noise, so it produces less extreme portfolios.
This is done on an expanding window out of sample. For example to trade in 1987 I used data from 1978 to 1986 to fit my weights. For 2015 I used data from 1978 to 2014. So I'm only using the past, not forward looking data.
To avoid over fitting I pool the pre-cost returns across all the instruments for which I have data. I've rarely found enough consistent evidence that different trading rules work better pre-cost on different kinds of instrument to justify doing anything else, especially given the paucity of available data in the past.
I then work out after cost returns, so it's likely that on expensive markets there will be less weight on faster trading rules.
Over fitting and data mining
Other than making sure you account properly for the effect of costs the main issue to worry about is over fitting AKA data mining. As you can see I am quite careful not to use forward looking information, and bootstrapping ensures we don't over fit based on limited data.
However I can't get away from the fact that I am using trading rules that I know will work, based on my own experience and general market knowledge. So there will be some implicit data mining going on before the backtest is even run.
This issue is discussed briefly in this blog. It will be discussed more thoroughly in my forthcoming book (details to follow, but hopefully out later this year), where there will also be more information about backtesting and fitting generally.
But my rules are generally simple, and having a number of variations for each rule should minimise the bias this causes. Still I wouldn't expect to realise the backtested Sharpe Ratio that I see in this back-test (this is also because future asset returns generally aren't likely to be as high in the simulated period, when a secular in inflation caused large one off repricing gains). But its much more realistic than an overfitted version would be.
A portfolio of futures
I then use a similar procedure to get weights for the instruments in my portfolio, with a few tweaks. I use weekly returns, otherwise the correlations are unrealistically low due to different market closing times (all other work is done with daily data). Obviously I don't pool data from different instruments together!
However if I don't have at least a year of data for an instrument when I start trading it I use average returns from the rest of the asset class, plus some noise such that the new asset will be 80% correlated on average with the other instruments of the same group. This gives me reasonable weights until I have enough data to fit them more precisely.
I also don't take pre cost performance into account (again there isn't much evidence that this is statistically different between markets); although because I'm bootstrapping it wouldn't change the weights much anyway.
Here are the final weights from the bootstrapping procedure, for each asset class:
Agricultural: 21.5%
Bonds and STIR: 17.5%
Equity index, including volatility: 17.3%
FX: 19.1%
Metals: 16.7%
Oil and Gas: 8.3%
These are nice and even.
Risk targeting
I assume here that we start with £500,000; and are targeting risk such that our annualised returns will have an average volatility of 25% of this, £125,000 (this is the same percentage risk target, but not the same size portfolio as I have).
It's imperative that we know we're getting this right. Here is a an estimate of the realised rolling annualised volatility of returns. Higher peaks mean that we have strong forecasts from our trading rules, or that correlations are particularly high, or that the markets were more volatile than we hoped when we originally put on our positions. However the average is about right; and if anything is a little lower and more conservative than it should be.
(This is to do with a risk management overlay that I use in my model, which reduces risk when it thinks there is potential for large losses)
And the winner is...
Here is what you've all been waiting for - the veritable money shot.You can see that the last year has been exceptionally good. Overall though this is a good, but not unbelievable performance. It would have been very easy to get a much better curve by fitting in sample, and by using more aggressive fitting techniques. But that would prove nothing, and I'd probably be doing much worse in real trading.
Some statistics:
Sharpe Ratio: 0.88
Realised annualised standard deviation: 19%
Average drawdown: 9.2%
Ratio of winning days to losing day returns: 1.006
Proportion of winning days: 54%
Worst drawdown: 33%
Proportion of days spent in drawdown: 94%
Note that without costs the sharpe would be higher, around 0.94. So I'm paying 0.06 SR in costs. This is an outcome of how I excluded faster trading rules for more expensive instruments.
These returns assume we maintain the same risk target. However all traders should reduce their risk when they lose money. Most will also want to increase exposure as their account value grows. In the latter case the returns shown above are effectively a log graph of what your returns would be. Since the system makes 16% a year on average over 32 years the compounded returns would be pretty good.
I reduce my capital when I make losses, but keep it at a capped maximum when I am at my high water mark. This would slightly increase the Sharpe shown above and reduce the drawdowns, at the expense of a lower total gain.
Here are returns we get from the different styles of trading (don't worry about the units on the y-axis):
You can see that trend following (which contributes about 60% of my risk), as has been well documented, did poorly from 2011-2013. However the other trading rules saved the day; in particular Carry. On the other hand 2014 was a great year for trend following, and this is reflected in my overall performance and those of large funds with similar styles such as AHL, Bluetrend, Winton and Cantab.
Note that in calculating profits I always lag my trades by one day, and assume they are done at the next days closing price, paying half the usual spread on the market, and the normal commission. This is all fairly conservative.
These simulated returns don't include interest charges, gains or losses on converting FX for margin payments, or data fees. In my annual review of actual performance I'll give you some idea of how large these elements are (sneak preview, not that large).
If you'd like any more detail or stats, then please comment on this post. I hope this has been interesting.