TSSTATEST(1) TSSTATEST(1)
NAME
tsstatest - make a statistical estimation of a time series
SYNOPSIS
tsstatest [-c n] [-d] [-e m] [-f o] [-i] [-p] [-v] [filename]
DESCRIPTION
Tsstatest is for making a statistical estimation of a time series. The
number of samples, given the maximum error estimate, and the confidence
level required is computed for both the standard deviation, and the
mean.
The input file structure is a text file consisting of records, in tem-
poral order, one record per time series sample. Blank records are
ignored, and comment records are signified by a '#' character as the
first non white space character in the record. Data records must con-
tain at least one field, which is the data value of the sample, but may
contain many fields-if the record contains many fields, then the first
field is regarded as the sample's time, and the last field as the sam-
ple's value at that time.
Consider the following formula for determination of the Shannon Proba-
bility, P, of an equity market time series, using the average and root
mean square of the normalized increments, avg, and, rms, respectively:
avg
--- + 1
rms
P = -------
2
which is useful in the determination of the optimal fraction of capi-
tal, f, to invest in a stock, by:
f = 2P - 1
The objective is to estimate how large the data set has to be for
determining P to a given accuracy, possibly using statistical estimates
of how many data points are required for a given confidence level that
the error is less than a specific value.
Suppose we have a confidence level, 0 < c < 1, that a value is within,
plus or minus, an error level, e. What this means, for example if c =
0.9, and e = 0.1, is that for 90% of the cases, the value will be
within the limits of +/- e, or, 5% of the time, on the average, it will
be less than -e, and 5% of the time more than +e.
The error level for avg, for a given confidence level, will be:
e = k (rms / sqrt (n))
avg
where n is the number of records in the data set, and k is a function
involving a normal distribution. The error level for rms, for the same
given confidence level, will be:
e = k (rms / sqrt (2n))
rms
where k is identical in both cases. Also, the number of records
required for a given error level would be:
2
n = ((rms * k) / e )
avg rms
and
1 2
n = - ((rms * k) / e )
rms 2 rms
where k is the same as above.
For equity market indices, a typical value for rms would be 0.01, and
0.0003 for avg. This is probably typical for many stocks, however, high
gain stocks, in a "bull" market can have an rms of 0.04, and an avg of
0.005.
The value of k can be determined from standard statistical tables:
c sigma level
-------------------
50 0.67
68.27 1.00
80 1.28
90 1.64
95 1.96
95.45 2.00
99 2.58
99.73 3.00
where k = sigma level, for a confidence level, c. Note that for a given
confidence level:
avg avg +/- k (rms / sqrt (n))
--- = ---------------------------
rms rms +/- k (rms / sqrt (2n))
avg 1
--- +/- k --------
rms sqrt (n)
= ---------------------
1
1 +/- k ------------
4 * sqrt (n)
Now, consider the specific example of avg and rms for an exponential
function. In this specific case, avg = rms, and avg / rms = 1. Since k
is assumed to be a function of a normally distributed random variable,
the error in the ratio avg / rms as a function of the data set size, n,
can be found by superposition, and adding the contributing error values
as a function of n for both rms and avg root mean square, or:
2 2
sqrt (1 + (1 / 4) ) = 1.030776406
or:
avg avg 1
--- ~ --- +/- 1.03 * -------- * k
rms rms sqrt (n)
avg 1
~ --- +/ -------- * k
rms sqrt (n)
where k is determined from the table, above. In this specific case,
where avg = rms:
avg avg 1
--- ~ --- * (1 +/- -------- * k)
rms rms sqrt (n)
An interpretation of what this means is that, given a data set size, n,
and a confidence level of, say 90%, then 90% of the time, our measure-
ments of avg / rms, would fall within an error level of +/- 1.64 * 1 /
sqrt (n), ie., 5% of the time it would be greater than the error value,
and 5% of the time, it would be lower than the error value. In general,
the concern is with the lower error value since from the equation:
avg
--- + 1
rms
P = -------
2
(at least in this specific case where avg = rms,) that a 90% confidence
level would imply that there is a 5% chance of the real value avg / rms
being zero is where:
k
-------- = 1
sqrt (n)
or:
1.64
-------- = 1
sqrt (n)
or n = 2.6896 ~ 3.
What this means is that, if we repeat the experiment of finding 3
records in a row that have rms = avg, with neither equal to zero, many
times, that we would loose money in 5% of the cases, making the mea-
sured Shannon probability, P, unity, and the estimated Shannon proba-
bility, 0.95, eg., we should consider the Shannon probability as 0.95
in this specific case-ie., it would be ill advised to invest all of the
capital in such a scenario, since, sooner or later, all of the capital
would be lost, (on average, by the 20'th game.)
This implies a simple methodology. Measure avg and rms, and compute the
Shannon probability. Decease that probability by a factor-ie., one
minus the confidence level, divided by two-that the wager could be a
loosing proposition, based on the estimates that avg could be zero,
(which is a function of the confidence level, and the number of records
in the data set.) This, conceivably, could provide a quantitative esti-
mate on the number of records required in a data set.
Note that if avg / rms is measured at 0.9, then:
1.64
-------- = 0.9
sqrt (n)
for the same confidence level of 0.9, or
n = 3.32
and:
avg
--- n p p
rms measured
--------------------------
1.0 2.7 1.00 0.95
0.9 3.3 0.95 0.90
0.8 4.2 0.90 0.86
0.7 5.5 0.85 0.81
0.6 7.5 0.80 0.76
0.5 10.8 0.75 0.71
0.4 16.8 0.70 0.67
0.3 29.9 0.65 0.62
0.2 67.2 0.60 0.57
0.1 268.9 0.55 0.52
0.05 1075.8 0.53 0.50
for the same confidence level 0.9. What the table means is that if you
have a stock price time series of 67 records, then the minimum measured
Shannon probability must be at least 0.6-and the wagering strategy
should use the Shannon probability of 0.57-and the minimum number of
records used to measure avg and rms is 67. Additionally, a stock time
series with a Shannon probability of 0.53 should be measured using not
less than 1076 records, and no wager should be made, unless the mea-
surements involve substantially more than 1076 records. In general, the
Shannon probability of almost all stock time series fall, inclusively,
in this range. 67 business days is, approximately, 13.4 weeks, or lit-
tle more than a calendar quarter. 1076 business days is slightly
longer than four calendar years.
Note that in "Chaos and Order in the Capital Markets," Edgar E. Peters,
John Wiley & Sons, New York, New York, 1991, pp. 83, referencing "Frac-
tals," Jens Feder, Plenum Press, New York, New York, 1988, pp. 179,
makes the claim that 2500 records is the minimum size of the data set
for using fractal analytical methodologies. Note that a data set of
this size would have, with an avg /rms of 0.5-which is "typical" for a
stock time series, a Shannon probability error level that is approxi-
mately 1%, since it lies between 2 and 3 sigma, and c would be approxi-
mately 0.99. This would seem to be consistent with the empirical argu-
ments of both Feder and Peters, although Peters implies that less could
be used if the system being analyzed is "chaotic" in nature, and one
"cycle" of the system's, apparently, "strange attractor" is less than
2500 time units. This analysis would seem to be consistent with the
observations of these authors, provided that it is a requirement that
the measured Shannon probability be used to calculate the optimum wager
fraction.
What this analysis would tend to suggest is that, although Feder's and
Peter's arguments seem to be confirmed, that there may, also, be other
viable solutions for data sets, (or fragments thereof,) that are very
much smaller, provided that the measured Shannon probability of the
data set, or segment, is sufficiently large-for example, a stock that
has a time series fragment that has 5 out of 6 upward movements may
prove to be a viable investment opportunity at a measured Shannon prob-
ability that is greater than 0.85, (5 / 6 = a Shannon probability of
0.833 ~ 0.85,) if played at a Shannon probability as high as 0.8, but
no higher.
For example, using a Shannon probability, P, of 0.51 for the tscoins(1)
and tsfraction(1) programs, to provide an input fractal time series for
the tsstatest(1) program, and iterating, indicates that for a standard
deviation of 0.020000, with a confidence level of 0.960784 that the
error did not exceed 0.020000, 3 samples would be required.
Since the Shannon probability is calculated directly from the standard
deviation, (ie., rms = root mean square of the normalized increments,)
the maximum error can be calculated:
0.5
---- = 0.980392157
0.51
which means that a confidence level of 0.960784314 that the error level
in the standard deviation is less than 0.02 because standard deviation
= rms = 0.02 - 0.02 = 0, which would correspond to a Shannon probabil-
ity, P, of 0.5, and since half the errors outside the range of 0.02
would be negative, (and the other half positive,) the confidence level
required would be 1 - ((1 - 0.980392157) * 2).
What this means is that ((1 - 0.960784314) / 2) * 100 percent of the
time, the actual rms value will be sufficiently small to make P equal
to, or less than 0.5. This means that P must be decreased by
1.960784300 percent. The reasoning is that after many iterations, the
measured P would be too small by 1.90784300% of the time, on average,
making the measured P, over all of the iterations, 0.5.
This suggests a dynamic rule: do not wager unless the Shannon probabil-
ity, P, is strictly greater than 0.51, as measured on strictly more
than 3 time units. Interestingly, the Hurst Coefficient, as measured by
the tshurst(1) program, graph of a random walk, Brownian motion, or
fractional Brownian motion fractals indicates that there is significant
near term correlations for 4 or less time units. This suggests a
dynamic trading methodology for equities.
Similar reasoning would indicate that using a value of P = 0.6 for the
tscoins(1) and tsfraction(1) programs to provide input to the tsstat-
est(1) program with a confidence level of 0.8, and an error of 0.12,
(ie., 10% of the time the value of P would be less than 0.9 * 0.6 =
0.54, where 0.2 - 0.12 = 0.08, and 0.54 = (0.08 + 1) / 2) would require
a minimum of 3 records. The fraction of capital wagered should be 2 *
0.54 - 1 = 0.08.
To review what has been presented so far, we really are not confident
that we know the value of the Shannon probability, P, until we have
sufficiently many records, n. One way of addressing this issue is to
wait to make a wager until we do. But this strategy has an "opportunity
cost," since, approximately 50% of the time, we would not have made an
investment when we should have. Note that since investing in equities
is not a 100% assured proposition, we only invest a fraction of our
capital, f, where f = 2P - 1. Since investing with a data set size that
is insufficient, ie., n is too small, lowers the probability of the
wins, the Shannon probability, P, will have to be lowered to maintain
the optimum wager fraction of the capital. We can compute the value
that the Shannon probability, P, must be lowered to account for this.
The relationship between the Shannon probability, P, and the root mean
square of the normalized increments of a time series, rms, is:
rms + 1
P = -------
2
Let the error, e, in rms created by an insufficient data set size be:
e = rms - rms'
where 0 < rms' < rms. This means that although rms was measured it
could be as low as rms'. The confidence level that rms is not less than
rms' can be found by statistical estimate. The Shannon probability, P',
associated with rms' is:
rms' + 1
P' = --------
2
P' is the Shannon probability if the root mean square value of the
normalized increments of the time series is rms'.
Since we want to alter the measured Shannon probability, P, to accommo-
date the error created by a insufficient data set size, we multiply P
by the confidence level that the real value of P is not less than P',
or the confidence level, C, is:
P'
C = --
P
The reasoning is that a value of C, say 0.9, means that the root mean
square value of the increments could be below the measured value, rms,
by an amount e for 5% of the time, and above rms by an amount e for 5%
of the time, so that:
P' = CP
Substituting:
rms' + 1
CP = --------
2
and solving for rms':
rms' = 2CP - 1
or:
e = rms - (2CP - 1) = rms - 2CP + 1
and substituting for rms, where rms = 2P - 1:
e = 2P - 1 - 2CP + 1 = 2P - 2CP = 2P(1 - C)
and substituting P' = CP:
e = 2P - 2P' = 2(P - P')
C now has to be adjusted because we are only concerned with the values
of rms' that are less than rms, where:
c = 1 - 2(1 - C) = 1 - 2 + 2C = 2C - 1
but since C = P' / P:
2P'
c = -- - 1
P
or we have:
e = 2(P - P')
and:
2P'
c = -- - 1
P
which are the two general equations for use of this program for trading
equities.
Making a plot of these equations, of P' vs. n for various P presents an
interesting conjecture. The graph can be crudely approximated by a
single pole filter, with a pole at 0.033, ie., using the program
tscoins(1) with a -p 0.6 argument to simulate an equity value time
series, and the program tsinstant(1), with the -s option, to calculate
the instantaneous Shannon probability of the time series, followed by
the program tspole(1) with a -p 0.033 argument, would output, approxi-
mately, P'. The P' tends to under wager for t < 7, and over wager for t
> 0.7. The approximation is simple, but crude. Interestingly, using the
program tshurst(1) on the same time series indicates that there is good
correlation for t < 5, and if this temporal range is of interest, this
simple solution may prove adequate for non-rigorous requirements. Addi-
tionally, perhaps using the tsmath(1) program, the output of the
tspole(1) program could have 0.5 subtracted, multiplied by, say, 0.85,
and then the 0.5 re-added to extend the usefulness to approximately 100
business days. The accuracy over this range is approximately +/- 0.01
out of 0.55. Naturally, after very many days, for example, if P = 0.6,
P' would still be 0.585, creating a long term error in rms of 0.2 -
0.17 = 0.03. Note that the error created in the exponential growth of
the capital would be 0.04 - 0.0289. A substantial long term error.
Alternately, perhaps a recursive feed-forward technique could be imple-
mented that would allow the pole frequency to be selected for far term
compatibility with the statistical estimate, while at the same time
approximating the near term. Naturally, this, also, should not be con-
sidered a substitute for statistical estimates, but using statistical
estimates would probably require a recursive procedure, and that is a
formidable proposition.
This program will require finding the value of the normal function,
given the standard deviation. The method used is to use Romberg/trape-
zoid integration to numerically solve for the value.
This program will require finding the functional inverse of the normal,
ie., Gaussian, function. The method used is to use Romberg/trapezoid
integration to numerically solve the equation:
x 2
| 1 - t / 2
F(x) = integral | ------ * e dt + 0.5
| 2 * pi
0
which has the derivative:
2
1 - x / 2
f(x) = ------ * e
2 * pi
Since F(x) is known, and it is desired to find x,
x 2
| 1 - t / 2
F(x) - integral | ------ * e dt + 0.5 = P(x)
| 2 * pi
0
= 0
and the Newton-Raphson method of finding roots would be:
P(x)
P = P - ----
n + 1 n f(x)
As a reference on Newton-Raphson Method of root finding, see "Numerical
Recipes in C: The Art of Scientific Computing," William H. Press, Brian
P. Flannery, Saul A. Teukolsky, William T. Vetterling, Cambridge
University Press, New York, 1988, ISBN 0-521-35465-X, pp 270.
As a reference on Romberg integration, see "Numerical Recipes in C: The
Art of Scientific Computing," William H. Press, Brian P. Flannery, Saul
A. Teukolsky, William T. Vetterling, Cambridge University Press, New
York, 1988, ISBN 0-521-35465-X, page 124.
As a reference on trapezoid iteration, see "Numerical Recipes in C: The
Art of Scientific Computing," William H. Press, Brian P. Flannery, Saul
A. Teukolsky, William T. Vetterling, Cambridge University Press, New
York, 1988, ISBN 0-521-35465-X, page 120.
As a reference on polynomial interpolation, see "Numerical Recipes in
C: The Art of Scientific Computing," William H. Press, Brian P. Flan-
nery, Saul A. Teukolsky, William T. Vetterling, Cambridge University
Press, New York, 1988, ISBN 0-521-35465-X, page 90.
OPTIONS
-c n Confidence level, 0.0 < n < 1.0.
-d Print number of samples required as a float.
-e m Maximum absolute error estimate, 0.0 < m.
-f o Maximum fraction error estimate in standard deviation and mean.
-i Input is the integration of a Gaussian variable.
-p Only print number of samples required for mean and standard
deviation.
-v Print the version and copyright banner of the program.
filename
Input filename.
WARNINGS
There is little or no provision for handling numerical exceptions.
SEE ALSO
tsderivative(1), tshcalc(1), tshurst(1), tsintegrate(1), tslogre-
turns(1), tslsq(1), tsnormal(1), tsshannon(1), tsblack(1), tsbrown-
ian(1), tsdlogistic(1), tsfBm(1), tsfractional(1), tsgaussian(1), tsin-
tegers(1), tslogistic(1), tspink(1), tsunfairfractional(1), tswhite(1),
tscoin(1), tsunfairbrownian(1), tsfraction(1), tsshannonmax(1),
tschangewager(1), tssample(1), tsrms(1), tscoins(1), tsavg(1),
tsXsquared(1), tsstockwager(1), tsshannonwindow(1), tsmath(1), tsavg-
window(1), tspole(1), tsdft(1), tsbinomial(1), tsdeterministic(1),
tsnumber(1), tsrmswindow(1), tsshannonstock(1), tsmarket(1),
tsstock(1), tsstatest(1), tsunfraction(1), tsshannonaggregate(1), tsin-
stant(1), tsshannonvolume(1), tsstocks(1), tsshannonfundamental(1),
tstrade(1), tstradesim(1), tsrunlength(1), tsunshannon(1), tsroot-
mean(1), tsrunmagnitude(1), tskurtosis(1), tskurtosiswindow(1), tsroot-
meanscale(1), tsscalederivative(1), tsgain(1), tsgainwindow(1)
tscauchy(1), tslognormal(1), tskalman(1), tsroot(1), tslaplacian(1)
DIAGNOSTICS
Error messages for incompatible arguments, failure to allocate memory,
inaccessible files, and opening and closing files.
AUTHORS
----------------------------------------------------------------------
A license is hereby granted to reproduce this software source code and
to create executable versions from this source code for personal,
non-commercial use. The copyright notice included with the software
must be maintained in all copies produced.
THIS PROGRAM IS PROVIDED "AS IS". THE AUTHOR PROVIDES NO WARRANTIES
WHATSOEVER, EXPRESSED OR IMPLIED, INCLUDING WARRANTIES OF
MERCHANTABILITY, TITLE, OR FITNESS FOR ANY PARTICULAR PURPOSE. THE
AUTHOR DOES NOT WARRANT THAT USE OF THIS PROGRAM DOES NOT INFRINGE THE
INTELLECTUAL PROPERTY RIGHTS OF ANY THIRD PARTY IN ANY COUNTRY.
Copyright (c) 1994-2006, John Conover, All Rights Reserved.
Comments and/or bug reports should be addressed to:
john@email.johncon.com (John Conover)
----------------------------------------------------------------------
January 18, 2006 TSSTATEST(1)