стр. 7 |

Now that some of the basics are out of the way, let us look at how statistics are

used when developing and evaluating a trading system. The examples below

employ a system that was optimized on one sample of data (the m-sample data)

and then run (tested) on another sample of data (the out-of-sample data). The out-

of-sample evaluation of this system will be discussed before the in-sample one

because the statistical analysis was simpler for the former (which is equivalent to

the evaluation of an unoptimized trading system) in that no corrections for mul-

tiple tests or optimization were required. The system is a lunar model that trades

the S&P 500; it was published in an article we wrote (see Katz with McCormick,

June 1997). The TradeStation code for this system is shown below:

58

Example 1: Evaluating the Out-of-Sample Test

Evaluating an optimized system on a set of out-of-sample data that was never used

during the optimization process is identical to evaluating an unoptimized system.

In both cases, one test is run without adjusting any parameters. Table 4-1 illus-

trates the use of statistics to evaluate an unoptimized system: It contains the out-

of-sample or verification results together with a variety of statistics. Remember, in

this test, a fresh set of data was used; this data was not used as the basis for

adjustments in the systemвЂ™s parameters.

The parameters of the trading model have already been set. A sample of data

was drawn from a period in the past, in this specific case, l/1/95 through l/1/97;

this is the out-of-sample or verification data. The model was then run on this out-

of-sample data, and it generated simulated trades. Forty-seven trades were taken.

This set of trades can itself be considered a sample of trades, one drawn from the

population of all trades that the system took in the past or will take in the future;

i.e., it is a sample of trades taken from the universe or population of all trades for

that system. At this point, some inference must be made regarding the average

profit per trade in the population as a whole, based on the sample of trades. Could

the performance obtained in the sample be due to chance alone? To find the

answer, the system must be statistically evaluated.

To begin statistically evaluating this system, the sample mean (average) for

n (the number of trades or sample size) must first be calculated. The mean is

simply the sum of the profit/loss figures for the trades generated divided by n (in

this case, 47). The sample mean was $974.47 per trade. The standard deviation

(the variability in the trade profit/loss figures) is then computed by subtracting

the sample mean from each of the profit/loss numbers for all 47 trades in the

sample; this results in 47 (n) deviations. Each of the deviations is then squared,

and then all squared deviations are added together. The sum of the squared devi-

ations is divided hy n - I (in this case, 46). By taking the square root of the

resultant number (the mean squared deviation), the sample standard deviation is

obtained. Using the sample standard deviation, the expected standard deviation

of the nean is computed: The sample standard deviation (in this case, $6,091.10)

is divided by the square root of the sample size. For this example, the expected

standard deviation of the mean was $888.48.

To determine the likelihood that the observed profitability is due to chance

alone, a simple t-test is calculated. Since the sample profitability is being compared

with no profitability, zero is subtracted from the sample mean trade profit/loss (com-

puted earlier). The resultant number is then divided by the sample standard devia-

tion to obtain the value of the t-statistic, which in this case worked out to be 1.0968.

Finally the probability of getting such a large t-statistic by chance alone (under the

assumption that the system was not profitable in the population from which the sam-

ple was drawn) is calculated: The cumulative t-distribution for that t-statistic is cotn-

puted with the appropriate degrees of freedom, which in this case was n - 1, or 46.

Statistics

CHAFDX 4

TABLE 4-I

Trades from the S&P 500 Data Sample on Which the Lunar Model Was

Verified

Enby Date Exit Dale Slatistical Analyses of Mean Profit/Loss

ProfiliLoss Cumulative

850207 850221 650 88825

66325 Sample Size 47.0000

850221 950223 -2500

950309 950323 92350 Sample Mean 974.4681

6025

950323 950324 -2500 89850 Sample SIandard Devlatlon 6091.1028

088.4787

950407 950419 -2500 a7350 Expected SD of Mean

950421 850424 -2500 84850

1.0868

850508 850516 -2500 82350 T Statislic (PiL > 0)

79850 Probability (Siiniflcance) 0.1392

850523 950524 -25W

850806 850609 -2500 77350

850620 74050 Serial CorrelaIion (lag=l) 0.2120

050622 -2500

79250 Associated T Statistic 1.4301

850704 850718 4400

0.1572

850719 950725 -2500 76750 Probability (Significance)

850603 950618 2575 79325

16.0000

850816 950901 25 78350 Number Of Wlns

hD?ntaQe Of Wins 0.3404

850901 850816 10475 89825

0.5318

950918 950829 -2600 87325 Upper 98% Bound

851002 951003 84625 Lower 89% Bound 0.1702

-2500

851017 851016 -2550 a2275

851031 951114 3150 85425

951114 951116 82925

-2500

951128 951214 6760 89675

94925

951214 851228 5250

851228 860109 -2500 92425

860112 8601 I7 -2500 69925

108625

860128 860213 18700

860213 860213 106125

-2500

960227 960227 -2500 103вЂ™325

Additional rows follow but are not shown in the table.

(MicrosoftвЂ™s Excel spreadsheet provides a function to obtain probabilities based on

the t-distribution. NumenвЂ™cal Recipes in C provides the incomplete beta function,

which is very easily used to calculate probabilities based on a variety of distribu-

tions, including StudentвЂ™s t.) The cumulative t-distribution calculation yields a figure

that represents the probability that the results obtained from the trading system were

due to chance. Since this figure was small, it is unlikely that the results were due to

capitalization on random features of the sample. The smaller the number, the more

likely the system performed the way it did for reasons other than chance. In this

instance, the probability was 0.1392; i.e., if a system with a true (population) profit

FIGURE 4-1

Frequency and Cumulative Distribution for In-Sample Trades

of $0 was repeatedly tested on independent samples, only about 14% of the time

would it show a profit as high as that actually observed.

Although the t-test was, in this example, calculated for a sample of trade prof-

it/loss figures, it could just as easily have been computed for a sample of daily

returns. Daily returns were employed in this way to calculate the probabilities

referred to in discussions of the substantitive tests that appear in later chapters. In

fact, the annualized risk-to-reward ratio (ARRR) that appears in many of the tables

and discussions is nothing more than a resealed t-statistic based on daily returns.

Finally, a con$dence interval on the probability of winning is estimated. In

the example, there were 16 wins in a sample of 47 trades, which yielded a per-

centage of wins equal to 0.3404. Using a particular inverse of the cumulative bino-

mial distribution, upper 99% and lower 99% boundaries are calculated. There is a

99% probability that the percentage of wins in the population as a whole is

between 0.1702 and 0.5319. In Excel, the CRITBINOM function may be used in

the calculation of confidence intervals on percentages.

The various statistics and probabilities computed above should provide the

system developer with important information regarding the behavior of the trad-

ing model-that is, if the assumptions of normality and independence are met and

CHAPTER 4 Statistics 61

if the sample is representative. Most likely, however, the assumptions underlying

the t-tests and other statistics are violated; market data deviates seriously from the

normal distribution, and trades are usually not independent. In addition, the sam-

ple might not be representative. Does this mean that the statistical evaluation just

discussed is worthless? LetвЂ™s consider the cases.

What if the Distribution Is Not Normal? An assumption in the t-test is that the

underlying distribution of the data is normal. However, the distribution of

profit/loss figures of a trading system is anything but normal, especially if there

are stops and profit targets, as can be seen in Figure 4- 1, which shows the distrib-

ution of profits and losses for trades taken by the lunar system. Think of it for a

moment. Rarely will a profit greater than the profit target occur. In fact, a lot

of trades are going to bunch up with a profit equal to that of the profit target. Other

trades are going to bunch up where the stop loss is set, with losses equal to that;

and there will be trades that will fall somewhere in between, depending on the exit

method. The shape of the distribution will not be that of the bell curve that describes

the normal distribution. This is a violation of one of the assumptions underlying the

t-test. In this case, however, the Central Limit Theorem comes to the rescue. It states

that as the number of cases in the sample increases, the distribution of the sample

mean approaches normal. By the time there is a sample size of 10, the errors result-

ing from the violation of the normality assumption will be small, and with sample

sizes greater than 20 or 30, they will have little practical significance for inferences

regarding the mean. Consequently, many statistics can be applied with reasonable

assurance that the results will be meaningful, as long as the sample size is adequate,

as was the case in the example above, which had an n of 47.

What if There Is Serial Dependence.3 A more serious violation, which makes

the above-described application of the t-test not quite cricket, is serial depen-

dence, which is when cases constituting a sample (e.g., trades) are not statistical-

ly independent of one another. Trades come from a time series. When a series of

trades that occurred over a given span of dates is used as a sample, it is not quite

a random sample. A truly random sample would mean that the 100 trades were

randomly taken from the period when the contract for the market started (e.g.,

1983 for the S&P 500) to far into the future; such a sample would not only be less

likely to suffer from serial dependence, but be more representative of the popula-

tion from which it was drawn. However, when developing trading systems, sam-

pling is usually done from one narrow point in time; consequently, each trade may

be correlated with those adjacent to it and so would not be independent,

The practical effect of this statistically is to reduce the eflective sample size.

When trying to make inferences, if there is substantial serial dependence, it may

be as if the sample contained only half or even one-fourth of the actual number of

trades or data points observed. To top it off, the extent of serial dependence can-

not definitively be determined. A rough вЂњguestimate,вЂќ however, can be made. One

such guestimate may be obtained by computing a simple lag/lead serial correla-

tion: A correlation is computed between the profit and loss for Trade i and the

profit and loss for Trade i + I, with i ranging from 1 to n - 1. In the example, the

serial correlation was 0.2120, not very high, but a lower number would be prefer-

able. An associated t-statistic may then be calculated along with a statistical sig-

nificance for the correlation In the current case, these statistics reveal that if there

really were no serial correlation in the population, a correlation as large as the one

obtained from the sample would only occur in about 16% of such tests.

Serial dependence is a serious problem. If there is a substantial amount of it,

it would need to be compensated for by treating the sample as if it were smaller

than it actually is. Another way to deal with the effect of serial dependence is to

draw a random sample of trades from a larger sample of trades computed over a

longer period of time. This would also tend to make the sample of trades more rep-

resentative of the population,

What ifthe Markets Change? When developing trading systems, a third assump-

tion of the t-test may be inadvertently violated. There are no precautions that can

be taken to prevent it from happening or to compensate for its occurrence. The rea-

son is that the population from which the development or verification sample was

drawn may be different from the population from which future trades may be taken.

This would happen if the market underwent some real structural or other change.

As mentioned before, the population of trades of a system operating on the S&P

500 before 1983 would be different from the population after that year since, in

1983, the options and futures started trading on the S&P 500 and the market

changed. This sort of thing can devastate any method of evaluating a trading sys-

tem. No matter how much a system is back-tested, if the market changes before

trading begins, the trades will not be taken from the same market for which the sys-

tem was developed and tested; the system will fall apart. All systems, even cur-

rently profitable ones, will eventually succumb to market change. Regardless of the

market, change is inevitable. It is just a question of when it will happen. Despite

this grim fact, the use of statistics to evaluate systems remains essential, because if

the market does not change substantially shortly after trading of the system com-

mences, or if the change is not sufficient to grossly affect the systemвЂ™s performance,

then a reasonable estimate of expected probabilities and returns can be calculated,

Example 2: Evaluating the In-Sample Tests

How can a system that has been fit to a data sample by the repeated adjustment of

parameters (i.e., an optimized system) be evaluated? Traders frequently optimize

systems to obtain good results. In this instance, the use of statistics is more impor-

tant than ever since the results can be analyzed, compensating for the multiplicity

of tests being performed as part of the process of optimization. Table 4-2 contains

the profit/loss figures and a variety of statistics for the in-sample trades (those

taken on the data sample used to optimize the system). The system was optimized

on data from l/1/90 through l/2/95.

Most of the statistics in Table 4-2 are identical to those in Table 4-1, which

was associated with Example 1. Two additional statistics (that differ from those in

the first example) are labeled вЂњOptimization Tests RunвЂќ and вЂњвЂ˜Adjusted for

Optimization.вЂќ The first statistic is simply the number of different parameter com-

binations tried, i.e., the total number of times the system was run on the data, each

time using a different set of parameters. Since the lunar system parameter, LI, was

stepped from 1 to 20 in increments of 1, 20 tests were performed; consequently,

there were 20 t-statistics, one for each test. The number of tests mn is used to make

an adjustment to the probability or significance obtained from the best t-statistic

TABLE 4-2

Trades from the S&P 500 Data Sample on Which the Lunar Model

Was Optimized

800417 900501 5750

800501 800516 11700 17450

800516 900522 -2500 14950

150 15100

800615 900702 2300 1,400

900702 800716 4550 2,950

800731 6675 28825

800731 800802 -2500 28125

800814 900828 8500 35425

SO0828 800811 575 38200

900911 ˜OOSZB 7225 43425

40825

801010 90,ow -2875 38050

*01028 80,028 -2500 35550

˜0,109 *0,,,2 -2700 32850

801128 80,211 8125 40875

801211 80,225 -875 40100

80,225 s10,02 -2500 37600

810108 910108 -2500 35100

010208 -2504

010221 4550

910322 5600

810408 -2500

9m423 -2.500

810507 3800

computed on the sample: Take 1, and subtract from it the statistical significance

obtained for the best-performing test. Take the resultant number and raise it to the

mth power (where m = the number of tests mn). Then subtract that number from

1. This provides the probability of finding, in a sample of m tests (in this case, 20),

at least one t-statistic as good as the one actually obtained for the optimized solu-

tion. The uncorrected probability that the profits observed for the best solution were

due to chance was less than 2%, a fairly significant result, Once adjusted for mul-

tiple tests, i.e., optimization, the statistical significance does not appear anywhere

near as good. Results at the level of those observed could have been obtained for

such an optimized system 3 1% of the time by chance alone. However, things are

not quite as bad as they seem. The adjustment was extremely conservative and

assumed that every test was completely independent of every other test. In actual

fact, there will be a high serial correlation between most tests since, in many trad-

ing systems, small changes in the parameters produce relatively small changes in

the results. This is exactly like serial dependence in data samples: It reduces the

effective population size, in this case, the effective number of tests run. Because

many of the tests are correlated, the 20 actual tests probably correspond to about 5

to 10 independent tests. If the serial dependence among tests is considered, the

adjusted-for-optimization probability would most likely be around 0.15, instead of

the 0.3 104 actually calculated. The nature and extent of serial dependence in the

multiple tests are never known, and therefore, a less conservative adjustment for

optimization cannot be directly calculated, only roughly reckoned.

Under certain circumstances, such as in multiple regression models, there are

exact mathematical formulas for calculating statistics that incorporate the fact that

parameters are being tit, i.e., that optimization is occurring, making corrections for

optimization unnecessary.

Interpreting the Example Statistics

In Example 1, the verification test was presented. The in-sample optimization run

was presented in Example 2. In the discussion of results, we are returning to the nat-

ural order in which the tests were run, i.e., optimization first, verification second.

Optimization Results. Table 4-2 shows the results for the in-sample period. Over

the 5 years of data on which the system was optimized, there were 118 trades (n

= 118). the mean or average trade yielded about $740.97, and the trades were

highly variable, with a sample standard deviation of around +$3,811: i.e., there

were many trades that lost several thousand dollars, as well as trades that made

many thousands. The degree of profitability can easily be seen by looking at the

profit/loss column, which contains many $2,500 losses (the stop got hit) and a sig-

nificant number of wins, many greater than $5,000, some even greater than

$10,000. The expected standard deviation of the mean suggests that if samples of

this kind were repeatedly taken, the mean would vary only about one-tenth as

much as the individual trades, and that many of the samples would have mean

profitabilities in the range of $740 + $350.

The t-statistic for the best-performing system from the set of optimization

mns was 2.1118, which has a statistical significance of 0.0184. This was a fairly

strong result. If only one test had been run (no optimizing), this good a result would

have been obtained (by chance alone) only twice in 100 tests, indicating that the

system is probably capturing some real market inefficiency and has some chance of

holding up. However, be warned: This analysis was for the best of 20 sets of para-

meter values tested. If corrected for the fact that 20 combinations of parameter val-

ues were tested, the adjusted statistical significance would only be about 0.3 1, not

very good; the performance of the system could easily have been due to chance.

Therefore, although the system may hold up, it could also, rather easily, fail.

The serial correlation between trades was only 0.0479, a value small enough

in the present context, with a significance of only 0.6083. These results strongly

suggest that there was no meaningful serial correlation between trades and that the

statistical analyses discussed above are likely to be correct.

There were 58 winning trades in the sample, which represents about a 49%

win rate. The upper 99% confidence boundary was approximately 61% and the

lower 99% confidence boundary was approximately 37%, suggesting that the true

percentage of wins in the population has a 99% likelihood of being found between

those two values. In truth, the confidence region should have been broadened by

correcting for optimization; this was not done because we were not very con-

cerned about the percentage of wins.

Results. Table 4-1, presented earlier, contains the data and statistics

Vetificution

for the out-of-sample test for the model. Since all parameters were already fixed,

and only one test was conducted, mere was no need to consider optimization or its

consequences in any manner. In the period from M/95 to t/1/97, there were 47

trades. The average trade in this sample yielded about $974, which is a greater

average profit per trade than in the optimization sample! The system apparently

did maintain profitable behavior.

At slightly over $6,000, the sample standard deviation was almost double

that of the standard deviation in me optimization sample. Consequently, the stan-

dard deviation of the sample mean was around $890, a fairly large standard error

of estimate; together with the small sample size, this yielded a lower t-statistic

than found in the optimization sample and, therefore, a lowered statistical signifi-

cance of only about 14%. These results were neither very good nor very bad:

There is better than an 80% chance that the system is capitalizing on some real

(non-chance) market inefficiency. The serial correlation in the test sample, however,

was quite a bit higher than in the optimization sample and was significant, with a

probability of 0.1572; i.e., as large a serial correlation as this would only be

expected about 16% of the time by chance alone, if no true (population) serial cor-

relation was present. Consequently, the t-test on the profit/loss figures has likely

66

overstated the statistical significance to some degree (maybe between 20 and

30%). If the sample size was adjusted downward the right amount, the t-test prob-

ability would most likely be around 0.18, instead of the 0.1392 that was calculat-

ed. The confidence interval for the percentage of wins in the population ranged

from about 17% to about 53%.

Overall, the assessment is that the system is probably going to hold up in the

future, but not with a high degree of certainty. Considering there were two inde-

pendent tests--one showing about a 31% probability (corrected for optimization)

that the profits were due to chance, the other showing a statistical significance of

approximately 14% (corrected to 18% due to the serial correlation), there is a good

chance that the average population trade is profitable and, consequently, that the

system will remain profitable in the future.

OTHER STATISTICAL TECHNIQUES AND THEIR

USE

The following section is intended only to acquaint the reader with some other sta-

tistical techniques that are available. We strongly suggest that a more thorough study

be undertaken by those serious about developing and evaluating trading systems.

Genetically Evolved Systems

We develop many systems using genetic algorithms. A popular$fimessfunction (cri-

terion used to determine whether a model is producing the desired outcome) is the

total net profit of the system. However, net profit is not the best measure of system

стр. 7 |