<< . .

. 7
( : 30)



. . >>

EVALUATING A SYSTEM STATISTICALLY
Now that some of the basics are out of the way, let us look at how statistics are
used when developing and evaluating a trading system. The examples below
employ a system that was optimized on one sample of data (the m-sample data)
and then run (tested) on another sample of data (the out-of-sample data). The out-
of-sample evaluation of this system will be discussed before the in-sample one
because the statistical analysis was simpler for the former (which is equivalent to
the evaluation of an unoptimized trading system) in that no corrections for mul-
tiple tests or optimization were required. The system is a lunar model that trades
the S&P 500; it was published in an article we wrote (see Katz with McCormick,
June 1997). The TradeStation code for this system is shown below:
58




Example 1: Evaluating the Out-of-Sample Test
Evaluating an optimized system on a set of out-of-sample data that was never used
during the optimization process is identical to evaluating an unoptimized system.
In both cases, one test is run without adjusting any parameters. Table 4-1 illus-
trates the use of statistics to evaluate an unoptimized system: It contains the out-
of-sample or verification results together with a variety of statistics. Remember, in
this test, a fresh set of data was used; this data was not used as the basis for
adjustments in the system™s parameters.
The parameters of the trading model have already been set. A sample of data
was drawn from a period in the past, in this specific case, l/1/95 through l/1/97;
this is the out-of-sample or verification data. The model was then run on this out-
of-sample data, and it generated simulated trades. Forty-seven trades were taken.
This set of trades can itself be considered a sample of trades, one drawn from the
population of all trades that the system took in the past or will take in the future;
i.e., it is a sample of trades taken from the universe or population of all trades for
that system. At this point, some inference must be made regarding the average
profit per trade in the population as a whole, based on the sample of trades. Could
the performance obtained in the sample be due to chance alone? To find the
answer, the system must be statistically evaluated.
To begin statistically evaluating this system, the sample mean (average) for
n (the number of trades or sample size) must first be calculated. The mean is
simply the sum of the profit/loss figures for the trades generated divided by n (in
this case, 47). The sample mean was $974.47 per trade. The standard deviation
(the variability in the trade profit/loss figures) is then computed by subtracting
the sample mean from each of the profit/loss numbers for all 47 trades in the
sample; this results in 47 (n) deviations. Each of the deviations is then squared,
and then all squared deviations are added together. The sum of the squared devi-
ations is divided hy n - I (in this case, 46). By taking the square root of the
resultant number (the mean squared deviation), the sample standard deviation is
obtained. Using the sample standard deviation, the expected standard deviation
of the nean is computed: The sample standard deviation (in this case, $6,091.10)
is divided by the square root of the sample size. For this example, the expected
standard deviation of the mean was $888.48.
To determine the likelihood that the observed profitability is due to chance
alone, a simple t-test is calculated. Since the sample profitability is being compared
with no profitability, zero is subtracted from the sample mean trade profit/loss (com-
puted earlier). The resultant number is then divided by the sample standard devia-
tion to obtain the value of the t-statistic, which in this case worked out to be 1.0968.
Finally the probability of getting such a large t-statistic by chance alone (under the
assumption that the system was not profitable in the population from which the sam-
ple was drawn) is calculated: The cumulative t-distribution for that t-statistic is cotn-
puted with the appropriate degrees of freedom, which in this case was n - 1, or 46.
Statistics
CHAFDX 4




TABLE 4-I

Trades from the S&P 500 Data Sample on Which the Lunar Model Was
Verified



Enby Date Exit Dale Slatistical Analyses of Mean Profit/Loss
ProfiliLoss Cumulative
850207 850221 650 88825
66325 Sample Size 47.0000
850221 950223 -2500
950309 950323 92350 Sample Mean 974.4681
6025
950323 950324 -2500 89850 Sample SIandard Devlatlon 6091.1028
088.4787
950407 950419 -2500 a7350 Expected SD of Mean
950421 850424 -2500 84850
1.0868
850508 850516 -2500 82350 T Statislic (PiL > 0)
79850 Probability (Siiniflcance) 0.1392
850523 950524 -25W
850806 850609 -2500 77350
850620 74050 Serial CorrelaIion (lag=l) 0.2120
050622 -2500
79250 Associated T Statistic 1.4301
850704 850718 4400
0.1572
850719 950725 -2500 76750 Probability (Significance)
850603 950618 2575 79325
16.0000
850816 950901 25 78350 Number Of Wlns
hD?ntaQe Of Wins 0.3404
850901 850816 10475 89825
0.5318
950918 950829 -2600 87325 Upper 98% Bound
851002 951003 84625 Lower 89% Bound 0.1702
-2500
851017 851016 -2550 a2275
851031 951114 3150 85425
951114 951116 82925
-2500
951128 951214 6760 89675
94925
951214 851228 5250
851228 860109 -2500 92425
860112 8601 I7 -2500 69925
108625
860128 860213 18700
860213 860213 106125
-2500
960227 960227 -2500 103™325


Additional rows follow but are not shown in the table.




(Microsoft™s Excel spreadsheet provides a function to obtain probabilities based on
the t-distribution. Numen™cal Recipes in C provides the incomplete beta function,
which is very easily used to calculate probabilities based on a variety of distribu-
tions, including Student™s t.) The cumulative t-distribution calculation yields a figure
that represents the probability that the results obtained from the trading system were
due to chance. Since this figure was small, it is unlikely that the results were due to
capitalization on random features of the sample. The smaller the number, the more
likely the system performed the way it did for reasons other than chance. In this
instance, the probability was 0.1392; i.e., if a system with a true (population) profit
FIGURE 4-1

Frequency and Cumulative Distribution for In-Sample Trades




of $0 was repeatedly tested on independent samples, only about 14% of the time
would it show a profit as high as that actually observed.
Although the t-test was, in this example, calculated for a sample of trade prof-
it/loss figures, it could just as easily have been computed for a sample of daily
returns. Daily returns were employed in this way to calculate the probabilities
referred to in discussions of the substantitive tests that appear in later chapters. In
fact, the annualized risk-to-reward ratio (ARRR) that appears in many of the tables
and discussions is nothing more than a resealed t-statistic based on daily returns.
Finally, a con$dence interval on the probability of winning is estimated. In
the example, there were 16 wins in a sample of 47 trades, which yielded a per-
centage of wins equal to 0.3404. Using a particular inverse of the cumulative bino-
mial distribution, upper 99% and lower 99% boundaries are calculated. There is a
99% probability that the percentage of wins in the population as a whole is
between 0.1702 and 0.5319. In Excel, the CRITBINOM function may be used in
the calculation of confidence intervals on percentages.
The various statistics and probabilities computed above should provide the
system developer with important information regarding the behavior of the trad-
ing model-that is, if the assumptions of normality and independence are met and
CHAPTER 4 Statistics 61




if the sample is representative. Most likely, however, the assumptions underlying
the t-tests and other statistics are violated; market data deviates seriously from the
normal distribution, and trades are usually not independent. In addition, the sam-
ple might not be representative. Does this mean that the statistical evaluation just
discussed is worthless? Let™s consider the cases.

What if the Distribution Is Not Normal? An assumption in the t-test is that the
underlying distribution of the data is normal. However, the distribution of
profit/loss figures of a trading system is anything but normal, especially if there
are stops and profit targets, as can be seen in Figure 4- 1, which shows the distrib-
ution of profits and losses for trades taken by the lunar system. Think of it for a
moment. Rarely will a profit greater than the profit target occur. In fact, a lot
of trades are going to bunch up with a profit equal to that of the profit target. Other
trades are going to bunch up where the stop loss is set, with losses equal to that;
and there will be trades that will fall somewhere in between, depending on the exit
method. The shape of the distribution will not be that of the bell curve that describes
the normal distribution. This is a violation of one of the assumptions underlying the
t-test. In this case, however, the Central Limit Theorem comes to the rescue. It states
that as the number of cases in the sample increases, the distribution of the sample
mean approaches normal. By the time there is a sample size of 10, the errors result-
ing from the violation of the normality assumption will be small, and with sample
sizes greater than 20 or 30, they will have little practical significance for inferences
regarding the mean. Consequently, many statistics can be applied with reasonable
assurance that the results will be meaningful, as long as the sample size is adequate,
as was the case in the example above, which had an n of 47.

What if There Is Serial Dependence.3 A more serious violation, which makes
the above-described application of the t-test not quite cricket, is serial depen-
dence, which is when cases constituting a sample (e.g., trades) are not statistical-
ly independent of one another. Trades come from a time series. When a series of
trades that occurred over a given span of dates is used as a sample, it is not quite
a random sample. A truly random sample would mean that the 100 trades were
randomly taken from the period when the contract for the market started (e.g.,
1983 for the S&P 500) to far into the future; such a sample would not only be less
likely to suffer from serial dependence, but be more representative of the popula-
tion from which it was drawn. However, when developing trading systems, sam-
pling is usually done from one narrow point in time; consequently, each trade may
be correlated with those adjacent to it and so would not be independent,
The practical effect of this statistically is to reduce the eflective sample size.
When trying to make inferences, if there is substantial serial dependence, it may
be as if the sample contained only half or even one-fourth of the actual number of
trades or data points observed. To top it off, the extent of serial dependence can-
not definitively be determined. A rough “guestimate,” however, can be made. One
such guestimate may be obtained by computing a simple lag/lead serial correla-
tion: A correlation is computed between the profit and loss for Trade i and the
profit and loss for Trade i + I, with i ranging from 1 to n - 1. In the example, the
serial correlation was 0.2120, not very high, but a lower number would be prefer-
able. An associated t-statistic may then be calculated along with a statistical sig-
nificance for the correlation In the current case, these statistics reveal that if there
really were no serial correlation in the population, a correlation as large as the one
obtained from the sample would only occur in about 16% of such tests.
Serial dependence is a serious problem. If there is a substantial amount of it,
it would need to be compensated for by treating the sample as if it were smaller
than it actually is. Another way to deal with the effect of serial dependence is to
draw a random sample of trades from a larger sample of trades computed over a
longer period of time. This would also tend to make the sample of trades more rep-
resentative of the population,

What ifthe Markets Change? When developing trading systems, a third assump-
tion of the t-test may be inadvertently violated. There are no precautions that can
be taken to prevent it from happening or to compensate for its occurrence. The rea-
son is that the population from which the development or verification sample was
drawn may be different from the population from which future trades may be taken.
This would happen if the market underwent some real structural or other change.
As mentioned before, the population of trades of a system operating on the S&P
500 before 1983 would be different from the population after that year since, in
1983, the options and futures started trading on the S&P 500 and the market
changed. This sort of thing can devastate any method of evaluating a trading sys-
tem. No matter how much a system is back-tested, if the market changes before
trading begins, the trades will not be taken from the same market for which the sys-
tem was developed and tested; the system will fall apart. All systems, even cur-
rently profitable ones, will eventually succumb to market change. Regardless of the
market, change is inevitable. It is just a question of when it will happen. Despite
this grim fact, the use of statistics to evaluate systems remains essential, because if
the market does not change substantially shortly after trading of the system com-
mences, or if the change is not sufficient to grossly affect the system™s performance,
then a reasonable estimate of expected probabilities and returns can be calculated,

Example 2: Evaluating the In-Sample Tests
How can a system that has been fit to a data sample by the repeated adjustment of
parameters (i.e., an optimized system) be evaluated? Traders frequently optimize
systems to obtain good results. In this instance, the use of statistics is more impor-
tant than ever since the results can be analyzed, compensating for the multiplicity
of tests being performed as part of the process of optimization. Table 4-2 contains
the profit/loss figures and a variety of statistics for the in-sample trades (those
taken on the data sample used to optimize the system). The system was optimized
on data from l/1/90 through l/2/95.
Most of the statistics in Table 4-2 are identical to those in Table 4-1, which
was associated with Example 1. Two additional statistics (that differ from those in
the first example) are labeled “Optimization Tests Run” and “˜Adjusted for
Optimization.” The first statistic is simply the number of different parameter com-
binations tried, i.e., the total number of times the system was run on the data, each
time using a different set of parameters. Since the lunar system parameter, LI, was
stepped from 1 to 20 in increments of 1, 20 tests were performed; consequently,
there were 20 t-statistics, one for each test. The number of tests mn is used to make
an adjustment to the probability or significance obtained from the best t-statistic

TABLE 4-2

Trades from the S&P 500 Data Sample on Which the Lunar Model
Was Optimized




800417 900501 5750
800501 800516 11700 17450
800516 900522 -2500 14950
150 15100
800615 900702 2300 1,400
900702 800716 4550 2,950
800731 6675 28825
800731 800802 -2500 28125
800814 900828 8500 35425
SO0828 800811 575 38200
900911 ˜OOSZB 7225 43425
40825
801010 90,ow -2875 38050
*01028 80,028 -2500 35550
˜0,109 *0,,,2 -2700 32850
801128 80,211 8125 40875
801211 80,225 -875 40100
80,225 s10,02 -2500 37600
810108 910108 -2500 35100

010208 -2504
010221 4550

910322 5600
810408 -2500
9m423 -2.500
810507 3800
computed on the sample: Take 1, and subtract from it the statistical significance
obtained for the best-performing test. Take the resultant number and raise it to the
mth power (where m = the number of tests mn). Then subtract that number from
1. This provides the probability of finding, in a sample of m tests (in this case, 20),
at least one t-statistic as good as the one actually obtained for the optimized solu-
tion. The uncorrected probability that the profits observed for the best solution were
due to chance was less than 2%, a fairly significant result, Once adjusted for mul-
tiple tests, i.e., optimization, the statistical significance does not appear anywhere
near as good. Results at the level of those observed could have been obtained for
such an optimized system 3 1% of the time by chance alone. However, things are
not quite as bad as they seem. The adjustment was extremely conservative and
assumed that every test was completely independent of every other test. In actual
fact, there will be a high serial correlation between most tests since, in many trad-
ing systems, small changes in the parameters produce relatively small changes in
the results. This is exactly like serial dependence in data samples: It reduces the
effective population size, in this case, the effective number of tests run. Because
many of the tests are correlated, the 20 actual tests probably correspond to about 5
to 10 independent tests. If the serial dependence among tests is considered, the
adjusted-for-optimization probability would most likely be around 0.15, instead of
the 0.3 104 actually calculated. The nature and extent of serial dependence in the
multiple tests are never known, and therefore, a less conservative adjustment for
optimization cannot be directly calculated, only roughly reckoned.
Under certain circumstances, such as in multiple regression models, there are
exact mathematical formulas for calculating statistics that incorporate the fact that
parameters are being tit, i.e., that optimization is occurring, making corrections for
optimization unnecessary.

Interpreting the Example Statistics
In Example 1, the verification test was presented. The in-sample optimization run
was presented in Example 2. In the discussion of results, we are returning to the nat-
ural order in which the tests were run, i.e., optimization first, verification second.

Optimization Results. Table 4-2 shows the results for the in-sample period. Over
the 5 years of data on which the system was optimized, there were 118 trades (n
= 118). the mean or average trade yielded about $740.97, and the trades were
highly variable, with a sample standard deviation of around +$3,811: i.e., there
were many trades that lost several thousand dollars, as well as trades that made
many thousands. The degree of profitability can easily be seen by looking at the
profit/loss column, which contains many $2,500 losses (the stop got hit) and a sig-
nificant number of wins, many greater than $5,000, some even greater than
$10,000. The expected standard deviation of the mean suggests that if samples of
this kind were repeatedly taken, the mean would vary only about one-tenth as
much as the individual trades, and that many of the samples would have mean
profitabilities in the range of $740 + $350.
The t-statistic for the best-performing system from the set of optimization
mns was 2.1118, which has a statistical significance of 0.0184. This was a fairly
strong result. If only one test had been run (no optimizing), this good a result would
have been obtained (by chance alone) only twice in 100 tests, indicating that the
system is probably capturing some real market inefficiency and has some chance of
holding up. However, be warned: This analysis was for the best of 20 sets of para-
meter values tested. If corrected for the fact that 20 combinations of parameter val-
ues were tested, the adjusted statistical significance would only be about 0.3 1, not
very good; the performance of the system could easily have been due to chance.
Therefore, although the system may hold up, it could also, rather easily, fail.
The serial correlation between trades was only 0.0479, a value small enough
in the present context, with a significance of only 0.6083. These results strongly
suggest that there was no meaningful serial correlation between trades and that the
statistical analyses discussed above are likely to be correct.
There were 58 winning trades in the sample, which represents about a 49%
win rate. The upper 99% confidence boundary was approximately 61% and the
lower 99% confidence boundary was approximately 37%, suggesting that the true
percentage of wins in the population has a 99% likelihood of being found between
those two values. In truth, the confidence region should have been broadened by
correcting for optimization; this was not done because we were not very con-
cerned about the percentage of wins.

Results. Table 4-1, presented earlier, contains the data and statistics
Vetificution
for the out-of-sample test for the model. Since all parameters were already fixed,
and only one test was conducted, mere was no need to consider optimization or its
consequences in any manner. In the period from M/95 to t/1/97, there were 47
trades. The average trade in this sample yielded about $974, which is a greater
average profit per trade than in the optimization sample! The system apparently
did maintain profitable behavior.
At slightly over $6,000, the sample standard deviation was almost double
that of the standard deviation in me optimization sample. Consequently, the stan-
dard deviation of the sample mean was around $890, a fairly large standard error
of estimate; together with the small sample size, this yielded a lower t-statistic
than found in the optimization sample and, therefore, a lowered statistical signifi-
cance of only about 14%. These results were neither very good nor very bad:
There is better than an 80% chance that the system is capitalizing on some real
(non-chance) market inefficiency. The serial correlation in the test sample, however,
was quite a bit higher than in the optimization sample and was significant, with a
probability of 0.1572; i.e., as large a serial correlation as this would only be
expected about 16% of the time by chance alone, if no true (population) serial cor-
relation was present. Consequently, the t-test on the profit/loss figures has likely
66




overstated the statistical significance to some degree (maybe between 20 and
30%). If the sample size was adjusted downward the right amount, the t-test prob-
ability would most likely be around 0.18, instead of the 0.1392 that was calculat-
ed. The confidence interval for the percentage of wins in the population ranged
from about 17% to about 53%.
Overall, the assessment is that the system is probably going to hold up in the
future, but not with a high degree of certainty. Considering there were two inde-
pendent tests--one showing about a 31% probability (corrected for optimization)
that the profits were due to chance, the other showing a statistical significance of
approximately 14% (corrected to 18% due to the serial correlation), there is a good
chance that the average population trade is profitable and, consequently, that the
system will remain profitable in the future.


OTHER STATISTICAL TECHNIQUES AND THEIR
USE
The following section is intended only to acquaint the reader with some other sta-
tistical techniques that are available. We strongly suggest that a more thorough study
be undertaken by those serious about developing and evaluating trading systems.


Genetically Evolved Systems
We develop many systems using genetic algorithms. A popular$fimessfunction (cri-
terion used to determine whether a model is producing the desired outcome) is the
total net profit of the system. However, net profit is not the best measure of system

<< . .

. 7
( : 30)



. . >>