TSSTATEST(1) TSSTATEST(1) NAME tsstatest - make a statistical estimation of a time series SYNOPSIS tsstatest [-c n] [-d] [-e m] [-f o] [-i] [-p] [-v] [filename] DESCRIPTION Tsstatest is for making a statistical estimation of a time series. The number of samples, given the maximum error estimate, and the confidence level required is computed for both the standard deviation, and the mean. The input file structure is a text file consisting of records, in tem- poral order, one record per time series sample. Blank records are ignored, and comment records are signified by a '#' character as the first non white space character in the record. Data records must con- tain at least one field, which is the data value of the sample, but may contain many fields-if the record contains many fields, then the first field is regarded as the sample's time, and the last field as the sam- ple's value at that time. Consider the following formula for determination of the Shannon Proba- bility, P, of an equity market time series, using the average and root mean square of the normalized increments, avg, and, rms, respectively: avg --- + 1 rms P = ------- 2 which is useful in the determination of the optimal fraction of capi- tal, f, to invest in a stock, by: f = 2P - 1 The objective is to estimate how large the data set has to be for determining P to a given accuracy, possibly using statistical estimates of how many data points are required for a given confidence level that the error is less than a specific value. Suppose we have a confidence level, 0 < c < 1, that a value is within, plus or minus, an error level, e. What this means, for example if c = 0.9, and e = 0.1, is that for 90% of the cases, the value will be within the limits of +/- e, or, 5% of the time, on the average, it will be less than -e, and 5% of the time more than +e. The error level for avg, for a given confidence level, will be: e = k (rms / sqrt (n)) avg where n is the number of records in the data set, and k is a function involving a normal distribution. The error level for rms, for the same given confidence level, will be: e = k (rms / sqrt (2n)) rms where k is identical in both cases. Also, the number of records required for a given error level would be: 2 n = ((rms * k) / e ) avg rms and 1 2 n = - ((rms * k) / e ) rms 2 rms where k is the same as above. For equity market indices, a typical value for rms would be 0.01, and 0.0003 for avg. This is probably typical for many stocks, however, high gain stocks, in a "bull" market can have an rms of 0.04, and an avg of 0.005. The value of k can be determined from standard statistical tables: c sigma level ------------------- 50 0.67 68.27 1.00 80 1.28 90 1.64 95 1.96 95.45 2.00 99 2.58 99.73 3.00 where k = sigma level, for a confidence level, c. Note that for a given confidence level: avg avg +/- k (rms / sqrt (n)) --- = --------------------------- rms rms +/- k (rms / sqrt (2n)) avg 1 --- +/- k -------- rms sqrt (n) = --------------------- 1 1 +/- k ------------ 4 * sqrt (n) Now, consider the specific example of avg and rms for an exponential function. In this specific case, avg = rms, and avg / rms = 1. Since k is assumed to be a function of a normally distributed random variable, the error in the ratio avg / rms as a function of the data set size, n, can be found by superposition, and adding the contributing error values as a function of n for both rms and avg root mean square, or: 2 2 sqrt (1 + (1 / 4) ) = 1.030776406 or: avg avg 1 --- ~ --- +/- 1.03 * -------- * k rms rms sqrt (n) avg 1 ~ --- +/ -------- * k rms sqrt (n) where k is determined from the table, above. In this specific case, where avg = rms: avg avg 1 --- ~ --- * (1 +/- -------- * k) rms rms sqrt (n) An interpretation of what this means is that, given a data set size, n, and a confidence level of, say 90%, then 90% of the time, our measure- ments of avg / rms, would fall within an error level of +/- 1.64 * 1 / sqrt (n), ie., 5% of the time it would be greater than the error value, and 5% of the time, it would be lower than the error value. In general, the concern is with the lower error value since from the equation: avg --- + 1 rms P = ------- 2 (at least in this specific case where avg = rms,) that a 90% confidence level would imply that there is a 5% chance of the real value avg / rms being zero is where: k -------- = 1 sqrt (n) or: 1.64 -------- = 1 sqrt (n) or n = 2.6896 ~ 3. What this means is that, if we repeat the experiment of finding 3 records in a row that have rms = avg, with neither equal to zero, many times, that we would loose money in 5% of the cases, making the mea- sured Shannon probability, P, unity, and the estimated Shannon proba- bility, 0.95, eg., we should consider the Shannon probability as 0.95 in this specific case-ie., it would be ill advised to invest all of the capital in such a scenario, since, sooner or later, all of the capital would be lost, (on average, by the 20'th game.) This implies a simple methodology. Measure avg and rms, and compute the Shannon probability. Decease that probability by a factor-ie., one minus the confidence level, divided by two-that the wager could be a loosing proposition, based on the estimates that avg could be zero, (which is a function of the confidence level, and the number of records in the data set.) This, conceivably, could provide a quantitative esti- mate on the number of records required in a data set. Note that if avg / rms is measured at 0.9, then: 1.64 -------- = 0.9 sqrt (n) for the same confidence level of 0.9, or n = 3.32 and: avg --- n p p rms measured -------------------------- 1.0 2.7 1.00 0.95 0.9 3.3 0.95 0.90 0.8 4.2 0.90 0.86 0.7 5.5 0.85 0.81 0.6 7.5 0.80 0.76 0.5 10.8 0.75 0.71 0.4 16.8 0.70 0.67 0.3 29.9 0.65 0.62 0.2 67.2 0.60 0.57 0.1 268.9 0.55 0.52 0.05 1075.8 0.53 0.50 for the same confidence level 0.9. What the table means is that if you have a stock price time series of 67 records, then the minimum measured Shannon probability must be at least 0.6-and the wagering strategy should use the Shannon probability of 0.57-and the minimum number of records used to measure avg and rms is 67. Additionally, a stock time series with a Shannon probability of 0.53 should be measured using not less than 1076 records, and no wager should be made, unless the mea- surements involve substantially more than 1076 records. In general, the Shannon probability of almost all stock time series fall, inclusively, in this range. 67 business days is, approximately, 13.4 weeks, or lit- tle more than a calendar quarter. 1076 business days is slightly longer than four calendar years. Note that in "Chaos and Order in the Capital Markets," Edgar E. Peters, John Wiley & Sons, New York, New York, 1991, pp. 83, referencing "Frac- tals," Jens Feder, Plenum Press, New York, New York, 1988, pp. 179, makes the claim that 2500 records is the minimum size of the data set for using fractal analytical methodologies. Note that a data set of this size would have, with an avg /rms of 0.5-which is "typical" for a stock time series, a Shannon probability error level that is approxi- mately 1%, since it lies between 2 and 3 sigma, and c would be approxi- mately 0.99. This would seem to be consistent with the empirical argu- ments of both Feder and Peters, although Peters implies that less could be used if the system being analyzed is "chaotic" in nature, and one "cycle" of the system's, apparently, "strange attractor" is less than 2500 time units. This analysis would seem to be consistent with the observations of these authors, provided that it is a requirement that the measured Shannon probability be used to calculate the optimum wager fraction. What this analysis would tend to suggest is that, although Feder's and Peter's arguments seem to be confirmed, that there may, also, be other viable solutions for data sets, (or fragments thereof,) that are very much smaller, provided that the measured Shannon probability of the data set, or segment, is sufficiently large-for example, a stock that has a time series fragment that has 5 out of 6 upward movements may prove to be a viable investment opportunity at a measured Shannon prob- ability that is greater than 0.85, (5 / 6 = a Shannon probability of 0.833 ~ 0.85,) if played at a Shannon probability as high as 0.8, but no higher. For example, using a Shannon probability, P, of 0.51 for the tscoins(1) and tsfraction(1) programs, to provide an input fractal time series for the tsstatest(1) program, and iterating, indicates that for a standard deviation of 0.020000, with a confidence level of 0.960784 that the error did not exceed 0.020000, 3 samples would be required. Since the Shannon probability is calculated directly from the standard deviation, (ie., rms = root mean square of the normalized increments,) the maximum error can be calculated: 0.5 ---- = 0.980392157 0.51 which means that a confidence level of 0.960784314 that the error level in the standard deviation is less than 0.02 because standard deviation = rms = 0.02 - 0.02 = 0, which would correspond to a Shannon probabil- ity, P, of 0.5, and since half the errors outside the range of 0.02 would be negative, (and the other half positive,) the confidence level required would be 1 - ((1 - 0.980392157) * 2). What this means is that ((1 - 0.960784314) / 2) * 100 percent of the time, the actual rms value will be sufficiently small to make P equal to, or less than 0.5. This means that P must be decreased by 1.960784300 percent. The reasoning is that after many iterations, the measured P would be too small by 1.90784300% of the time, on average, making the measured P, over all of the iterations, 0.5. This suggests a dynamic rule: do not wager unless the Shannon probabil- ity, P, is strictly greater than 0.51, as measured on strictly more than 3 time units. Interestingly, the Hurst Coefficient, as measured by the tshurst(1) program, graph of a random walk, Brownian motion, or fractional Brownian motion fractals indicates that there is significant near term correlations for 4 or less time units. This suggests a dynamic trading methodology for equities. Similar reasoning would indicate that using a value of P = 0.6 for the tscoins(1) and tsfraction(1) programs to provide input to the tsstat- est(1) program with a confidence level of 0.8, and an error of 0.12, (ie., 10% of the time the value of P would be less than 0.9 * 0.6 = 0.54, where 0.2 - 0.12 = 0.08, and 0.54 = (0.08 + 1) / 2) would require a minimum of 3 records. The fraction of capital wagered should be 2 * 0.54 - 1 = 0.08. To review what has been presented so far, we really are not confident that we know the value of the Shannon probability, P, until we have sufficiently many records, n. One way of addressing this issue is to wait to make a wager until we do. But this strategy has an "opportunity cost," since, approximately 50% of the time, we would not have made an investment when we should have. Note that since investing in equities is not a 100% assured proposition, we only invest a fraction of our capital, f, where f = 2P - 1. Since investing with a data set size that is insufficient, ie., n is too small, lowers the probability of the wins, the Shannon probability, P, will have to be lowered to maintain the optimum wager fraction of the capital. We can compute the value that the Shannon probability, P, must be lowered to account for this. The relationship between the Shannon probability, P, and the root mean square of the normalized increments of a time series, rms, is: rms + 1 P = ------- 2 Let the error, e, in rms created by an insufficient data set size be: e = rms - rms' where 0 < rms' < rms. This means that although rms was measured it could be as low as rms'. The confidence level that rms is not less than rms' can be found by statistical estimate. The Shannon probability, P', associated with rms' is: rms' + 1 P' = -------- 2 P' is the Shannon probability if the root mean square value of the normalized increments of the time series is rms'. Since we want to alter the measured Shannon probability, P, to accommo- date the error created by a insufficient data set size, we multiply P by the confidence level that the real value of P is not less than P', or the confidence level, C, is: P' C = -- P The reasoning is that a value of C, say 0.9, means that the root mean square value of the increments could be below the measured value, rms, by an amount e for 5% of the time, and above rms by an amount e for 5% of the time, so that: P' = CP Substituting: rms' + 1 CP = -------- 2 and solving for rms': rms' = 2CP - 1 or: e = rms - (2CP - 1) = rms - 2CP + 1 and substituting for rms, where rms = 2P - 1: e = 2P - 1 - 2CP + 1 = 2P - 2CP = 2P(1 - C) and substituting P' = CP: e = 2P - 2P' = 2(P - P') C now has to be adjusted because we are only concerned with the values of rms' that are less than rms, where: c = 1 - 2(1 - C) = 1 - 2 + 2C = 2C - 1 but since C = P' / P: 2P' c = -- - 1 P or we have: e = 2(P - P') and: 2P' c = -- - 1 P which are the two general equations for use of this program for trading equities. Making a plot of these equations, of P' vs. n for various P presents an interesting conjecture. The graph can be crudely approximated by a single pole filter, with a pole at 0.033, ie., using the program tscoins(1) with a -p 0.6 argument to simulate an equity value time series, and the program tsinstant(1), with the -s option, to calculate the instantaneous Shannon probability of the time series, followed by the program tspole(1) with a -p 0.033 argument, would output, approxi- mately, P'. The P' tends to under wager for t < 7, and over wager for t > 0.7. The approximation is simple, but crude. Interestingly, using the program tshurst(1) on the same time series indicates that there is good correlation for t < 5, and if this temporal range is of interest, this simple solution may prove adequate for non-rigorous requirements. Addi- tionally, perhaps using the tsmath(1) program, the output of the tspole(1) program could have 0.5 subtracted, multiplied by, say, 0.85, and then the 0.5 re-added to extend the usefulness to approximately 100 business days. The accuracy over this range is approximately +/- 0.01 out of 0.55. Naturally, after very many days, for example, if P = 0.6, P' would still be 0.585, creating a long term error in rms of 0.2 - 0.17 = 0.03. Note that the error created in the exponential growth of the capital would be 0.04 - 0.0289. A substantial long term error. Alternately, perhaps a recursive feed-forward technique could be imple- mented that would allow the pole frequency to be selected for far term compatibility with the statistical estimate, while at the same time approximating the near term. Naturally, this, also, should not be con- sidered a substitute for statistical estimates, but using statistical estimates would probably require a recursive procedure, and that is a formidable proposition. This program will require finding the value of the normal function, given the standard deviation. The method used is to use Romberg/trape- zoid integration to numerically solve for the value. This program will require finding the functional inverse of the normal, ie., Gaussian, function. The method used is to use Romberg/trapezoid integration to numerically solve the equation: x 2 | 1 - t / 2 F(x) = integral | ------ * e dt + 0.5 | 2 * pi 0 which has the derivative: 2 1 - x / 2 f(x) = ------ * e 2 * pi Since F(x) is known, and it is desired to find x, x 2 | 1 - t / 2 F(x) - integral | ------ * e dt + 0.5 = P(x) | 2 * pi 0 = 0 and the Newton-Raphson method of finding roots would be: P(x) P = P - ---- n + 1 n f(x) As a reference on Newton-Raphson Method of root finding, see "Numerical Recipes in C: The Art of Scientific Computing," William H. Press, Brian P. Flannery, Saul A. Teukolsky, William T. Vetterling, Cambridge University Press, New York, 1988, ISBN 0-521-35465-X, pp 270. As a reference on Romberg integration, see "Numerical Recipes in C: The Art of Scientific Computing," William H. Press, Brian P. Flannery, Saul A. Teukolsky, William T. Vetterling, Cambridge University Press, New York, 1988, ISBN 0-521-35465-X, page 124. As a reference on trapezoid iteration, see "Numerical Recipes in C: The Art of Scientific Computing," William H. Press, Brian P. Flannery, Saul A. Teukolsky, William T. Vetterling, Cambridge University Press, New York, 1988, ISBN 0-521-35465-X, page 120. As a reference on polynomial interpolation, see "Numerical Recipes in C: The Art of Scientific Computing," William H. Press, Brian P. Flan- nery, Saul A. Teukolsky, William T. Vetterling, Cambridge University Press, New York, 1988, ISBN 0-521-35465-X, page 90. OPTIONS -c n Confidence level, 0.0 < n < 1.0. -d Print number of samples required as a float. -e m Maximum absolute error estimate, 0.0 < m. -f o Maximum fraction error estimate in standard deviation and mean. -i Input is the integration of a Gaussian variable. -p Only print number of samples required for mean and standard deviation. -v Print the version and copyright banner of the program. filename Input filename. WARNINGS There is little or no provision for handling numerical exceptions. SEE ALSO tsderivative(1), tshcalc(1), tshurst(1), tsintegrate(1), tslogre- turns(1), tslsq(1), tsnormal(1), tsshannon(1), tsblack(1), tsbrown- ian(1), tsdlogistic(1), tsfBm(1), tsfractional(1), tsgaussian(1), tsin- tegers(1), tslogistic(1), tspink(1), tsunfairfractional(1), tswhite(1), tscoin(1), tsunfairbrownian(1), tsfraction(1), tsshannonmax(1), tschangewager(1), tssample(1), tsrms(1), tscoins(1), tsavg(1), tsXsquared(1), tsstockwager(1), tsshannonwindow(1), tsmath(1), tsavg- window(1), tspole(1), tsdft(1), tsbinomial(1), tsdeterministic(1), tsnumber(1), tsrmswindow(1), tsshannonstock(1), tsmarket(1), tsstock(1), tsstatest(1), tsunfraction(1), tsshannonaggregate(1), tsin- stant(1), tsshannonvolume(1), tsstocks(1), tsshannonfundamental(1), tstrade(1), tstradesim(1), tsrunlength(1), tsunshannon(1), tsroot- mean(1), tsrunmagnitude(1), tskurtosis(1), tskurtosiswindow(1), tsroot- meanscale(1), tsscalederivative(1), tsgain(1), tsgainwindow(1) tscauchy(1), tslognormal(1), tskalman(1), tsroot(1), tslaplacian(1) DIAGNOSTICS Error messages for incompatible arguments, failure to allocate memory, inaccessible files, and opening and closing files. AUTHORS ---------------------------------------------------------------------- A license is hereby granted to reproduce this software source code and to create executable versions from this source code for personal, non-commercial use. The copyright notice included with the software must be maintained in all copies produced. THIS PROGRAM IS PROVIDED "AS IS". THE AUTHOR PROVIDES NO WARRANTIES WHATSOEVER, EXPRESSED OR IMPLIED, INCLUDING WARRANTIES OF MERCHANTABILITY, TITLE, OR FITNESS FOR ANY PARTICULAR PURPOSE. THE AUTHOR DOES NOT WARRANT THAT USE OF THIS PROGRAM DOES NOT INFRINGE THE INTELLECTUAL PROPERTY RIGHTS OF ANY THIRD PARTY IN ANY COUNTRY. Copyright (c) 1994-2006, John Conover, All Rights Reserved. Comments and/or bug reports should be addressed to: john@email.johncon.com (John Conover) ---------------------------------------------------------------------- January 18, 2006 TSSTATEST(1)