TSSTATEST(1)							  TSSTATEST(1)


NAME
       tsstatest - make a statistical estimation of a time series

SYNOPSIS
       tsstatest [-c n] [-d] [-e m] [-f o] [-i] [-p] [-v] [filename]

DESCRIPTION
       Tsstatest  is for making a statistical estimation of a time series. The
       number of samples, given the maximum error estimate, and the confidence
       level  required	is  computed  for both the standard deviation, and the
       mean.

       The input file structure is a text file consisting of records, in  tem-
       poral  order,  one  record  per	time series sample.  Blank records are
       ignored, and comment records are signified by a '#'  character  as  the
       first  non  white space character in the record. Data records must con-
       tain at least one field, which is the data value of the sample, but may
       contain	many fields-if the record contains many fields, then the first
       field is regarded as the sample's time, and the last field as the  sam-
       ple's value at that time.

       Consider  the following formula for determination of the Shannon Proba-
       bility, P, of an equity market time series, using the average and  root
       mean square of the normalized increments, avg, and, rms, respectively:

	       avg
	       --- + 1
	       rms
	   P = -------
		  2

       which  is  useful in the determination of the optimal fraction of capi-
       tal, f, to invest in a stock, by:

	   f = 2P - 1

       The objective is to estimate how large the  data  set  has  to  be  for
       determining P to a given accuracy, possibly using statistical estimates
       of how many data points are required for a given confidence level  that
       the error is less than a specific value.

       Suppose	we have a confidence level, 0 < c < 1, that a value is within,
       plus or minus, an error level, e. What this means, for example if  c  =
       0.9,  and  e  =	0.1,  is  that for 90% of the cases, the value will be
       within the limits of +/- e, or, 5% of the time, on the average, it will
       be less than -e, and 5% of the time more than +e.

       The error level for avg, for a given confidence level, will be:

	   e	= k (rms / sqrt (n))
	    avg

       where  n  is the number of records in the data set, and k is a function
       involving a normal distribution. The error level for rms, for the  same
       given confidence level, will be:

	   e	= k (rms / sqrt (2n))
	    rms

       where  k  is  identical	in  both  cases.  Also,  the number of records
       required for a given error level would be:

				    2
	   n	= ((rms * k) / e   )
	    avg 		rms

       and
		  1		      2
	   n	= - ((rms * k) / e   )
	    rms   2		  rms

       where k is the same as above.

       For equity market indices, a typical value for rms would be  0.01,  and
       0.0003 for avg. This is probably typical for many stocks, however, high
       gain stocks, in a "bull" market can have an rms of 0.04, and an avg  of
       0.005.

       The value of k can be determined from standard statistical tables:

	   c	   sigma level
	   -------------------
	   50	       0.67
	   68.27       1.00
	   80	       1.28
	   90	       1.64
	   95	       1.96
	   95.45       2.00
	   99	       2.58
	   99.73       3.00

       where k = sigma level, for a confidence level, c. Note that for a given
       confidence level:

	   avg	 avg +/- k (rms / sqrt (n))
	   --- = ---------------------------
	   rms	 rms +/- k (rms / sqrt (2n))

		 avg	      1
		 --- +/- k --------
		 rms	   sqrt (n)
	       = ---------------------
			       1
		  1 +/- k ------------
			  4 * sqrt (n)

       Now, consider the specific example of avg and rms  for  an  exponential
       function.  In this specific case, avg = rms, and avg / rms = 1. Since k
       is assumed to be a function of a normally distributed random  variable,
       the error in the ratio avg / rms as a function of the data set size, n,
       can be found by superposition, and adding the contributing error values
       as a function of n for both rms and avg root mean square, or:

		  2	     2
	   sqrt (1  + (1 / 4) ) = 1.030776406

       or:

	   avg	 avg		   1
	   --- ~ --- +/- 1.03 * -------- * k
	   rms	 rms		sqrt (n)

		 avg	   1
	       ~ --- +/ -------- * k
		 rms	sqrt (n)

       where  k  is  determined  from the table, above. In this specific case,
       where avg = rms:

	   avg	 avg		 1
	   --- ~ --- * (1 +/- -------- * k)
	   rms	 rms	      sqrt (n)

       An interpretation of what this means is that, given a data set size, n,
       and  a confidence level of, say 90%, then 90% of the time, our measure-
       ments of avg / rms, would fall within an error level of +/- 1.64 * 1  /
       sqrt (n), ie., 5% of the time it would be greater than the error value,
       and 5% of the time, it would be lower than the error value. In general,
       the concern is with the lower error value since from the equation:

	       avg
	       --- + 1
	       rms
	   P = -------
		  2

       (at least in this specific case where avg = rms,) that a 90% confidence
       level would imply that there is a 5% chance of the real value avg / rms
       being zero is where:

	       k
	   -------- = 1
	   sqrt (n)

       or:

	     1.64
	   -------- = 1
	   sqrt (n)

       or n = 2.6896 ~ 3.

       What  this  means  is  that,  if  we repeat the experiment of finding 3
       records in a row that have rms = avg, with neither equal to zero,  many
       times,  that  we  would loose money in 5% of the cases, making the mea-
       sured Shannon probability, P, unity, and the estimated  Shannon	proba-
       bility,	0.95,  eg., we should consider the Shannon probability as 0.95
       in this specific case-ie., it would be ill advised to invest all of the
       capital	in such a scenario, since, sooner or later, all of the capital
       would be lost, (on average, by the 20'th game.)

       This implies a simple methodology. Measure avg and rms, and compute the
       Shannon	probability.  Decease  that  probability  by a factor-ie., one
       minus the confidence level, divided by two-that the wager  could  be  a
       loosing	proposition,  based  on  the estimates that avg could be zero,
       (which is a function of the confidence level, and the number of records
       in the data set.) This, conceivably, could provide a quantitative esti-
       mate on the number of records required in a data set.

       Note that if avg / rms is measured at 0.9, then:

	     1.64
	   -------- = 0.9
	   sqrt (n)

       for the same confidence level of 0.9, or

	   n = 3.32

       and:

	   avg
	   ---	    n	p	   p
	   rms		 measured
	   --------------------------
	   1.0	   2.7	 1.00	 0.95
	   0.9	   3.3	 0.95	 0.90
	   0.8	   4.2	 0.90	 0.86
	   0.7	   5.5	 0.85	 0.81
	   0.6	   7.5	 0.80	 0.76
	   0.5	  10.8	 0.75	 0.71
	   0.4	  16.8	 0.70	 0.67
	   0.3	  29.9	 0.65	 0.62
	   0.2	  67.2	 0.60	 0.57
	   0.1	 268.9	 0.55	 0.52
	   0.05 1075.8	 0.53	 0.50

       for the same confidence level 0.9. What the table means is that if  you
       have a stock price time series of 67 records, then the minimum measured
       Shannon probability must be at  least  0.6-and  the  wagering  strategy
       should  use  the  Shannon probability of 0.57-and the minimum number of
       records used to measure avg and rms is 67. Additionally, a  stock  time
       series  with a Shannon probability of 0.53 should be measured using not
       less than 1076 records, and no wager should be made,  unless  the  mea-
       surements involve substantially more than 1076 records. In general, the
       Shannon probability of almost all stock time series fall,  inclusively,
       in  this range. 67 business days is, approximately, 13.4 weeks, or lit-
       tle more than a calendar  quarter.   1076  business  days  is  slightly
       longer than four calendar years.

       Note that in "Chaos and Order in the Capital Markets," Edgar E. Peters,
       John Wiley & Sons, New York, New York, 1991, pp. 83, referencing "Frac-
       tals,"  Jens  Feder,  Plenum  Press, New York, New York, 1988, pp. 179,
       makes the claim that 2500 records is the minimum size of the  data  set
       for  using  fractal  analytical	methodologies. Note that a data set of
       this size would have, with an avg /rms of 0.5-which is "typical" for  a
       stock  time  series, a Shannon probability error level that is approxi-
       mately 1%, since it lies between 2 and 3 sigma, and c would be approxi-
       mately  0.99. This would seem to be consistent with the empirical argu-
       ments of both Feder and Peters, although Peters implies that less could
       be  used  if  the system being analyzed is "chaotic" in nature, and one
       "cycle" of the system's, apparently, "strange attractor" is  less  than
       2500  time  units.  This  analysis would seem to be consistent with the
       observations of these authors, provided that it is a  requirement  that
       the measured Shannon probability be used to calculate the optimum wager
       fraction.

       What this analysis would tend to suggest is that, although Feder's  and
       Peter's	arguments seem to be confirmed, that there may, also, be other
       viable solutions for data sets, (or fragments thereof,) that  are  very
       much  smaller,  provided  that  the measured Shannon probability of the
       data set, or segment, is sufficiently large-for example, a  stock  that
       has  a  time  series  fragment that has 5 out of 6 upward movements may
       prove to be a viable investment opportunity at a measured Shannon prob-
       ability	that  is  greater than 0.85, (5 / 6 = a Shannon probability of
       0.833 ~ 0.85,) if played at a Shannon probability as high as  0.8,  but
       no higher.

       For example, using a Shannon probability, P, of 0.51 for the tscoins(1)
       and tsfraction(1) programs, to provide an input fractal time series for
       the  tsstatest(1) program, and iterating, indicates that for a standard
       deviation of 0.020000, with a confidence level  of  0.960784  that  the
       error did not exceed 0.020000, 3 samples would be required.

       Since  the Shannon probability is calculated directly from the standard
       deviation, (ie., rms = root mean square of the normalized  increments,)
       the maximum error can be calculated:

	   0.5
	   ---- = 0.980392157
	   0.51

       which means that a confidence level of 0.960784314 that the error level
       in the standard deviation is less than 0.02 because standard  deviation
       =  rms = 0.02 - 0.02 = 0, which would correspond to a Shannon probabil-
       ity, P, of 0.5, and since half the errors outside  the  range  of  0.02
       would  be negative, (and the other half positive,) the confidence level
       required would be 1 - ((1 - 0.980392157) * 2).

       What this means is that ((1 - 0.960784314) / 2) * 100  percent  of  the
       time,  the  actual rms value will be sufficiently small to make P equal
       to, or  less  than  0.5.  This  means  that  P  must  be  decreased  by
       1.960784300  percent.  The reasoning is that after many iterations, the
       measured P would be too small by 1.90784300% of the time,  on  average,
       making the measured P, over all of the iterations, 0.5.

       This suggests a dynamic rule: do not wager unless the Shannon probabil-
       ity, P, is strictly greater than 0.51, as  measured  on	strictly  more
       than 3 time units. Interestingly, the Hurst Coefficient, as measured by
       the tshurst(1) program, graph of a random  walk,  Brownian  motion,  or
       fractional Brownian motion fractals indicates that there is significant
       near term correlations for 4  or  less  time  units.  This  suggests  a
       dynamic trading methodology for equities.

       Similar	reasoning would indicate that using a value of P = 0.6 for the
       tscoins(1) and tsfraction(1) programs to provide input to  the  tsstat-
       est(1)  program	with  a confidence level of 0.8, and an error of 0.12,
       (ie., 10% of the time the value of P would be less than	0.9  *	0.6  =
       0.54, where 0.2 - 0.12 = 0.08, and 0.54 = (0.08 + 1) / 2) would require
       a minimum of 3 records. The fraction of capital wagered should be  2  *
       0.54 - 1 = 0.08.

       To  review  what has been presented so far, we really are not confident
       that we know the value of the Shannon probability,  P,  until  we  have
       sufficiently  many  records,  n. One way of addressing this issue is to
       wait to make a wager until we do. But this strategy has an "opportunity
       cost,"  since, approximately 50% of the time, we would not have made an
       investment when we should have. Note that since investing  in  equities
       is  not	a  100%  assured proposition, we only invest a fraction of our
       capital, f, where f = 2P - 1. Since investing with a data set size that
       is  insufficient,  ie.,	n  is too small, lowers the probability of the
       wins, the Shannon probability, P, will have to be lowered  to  maintain
       the  optimum  wager  fraction  of the capital. We can compute the value
       that the Shannon probability, P, must be lowered to account for this.

       The relationship between the Shannon probability, P, and the root  mean
       square of the normalized increments of a time series, rms, is:

	       rms + 1
	   P = -------
		  2

       Let the error, e, in rms created by an insufficient data set size be:

	   e = rms - rms'

       where  0  <  rms'  <  rms. This means that although rms was measured it
       could be as low as rms'. The confidence level that rms is not less than
       rms' can be found by statistical estimate. The Shannon probability, P',
       associated with rms' is:

		rms' + 1
	   P' = --------
		   2

       P' is the Shannon probability if the root  mean	square	value  of  the
       normalized increments of the time series is rms'.

       Since we want to alter the measured Shannon probability, P, to accommo-
       date the error created by a insufficient data set size, we  multiply  P
       by  the	confidence level that the real value of P is not less than P',
       or the confidence level, C, is:

	       P'
	   C = --
	       P

       The reasoning is that a value of C, say 0.9, means that the  root  mean
       square  value of the increments could be below the measured value, rms,
       by an amount e for 5% of the time, and above rms by an amount e for  5%
       of the time, so that:

	   P' = CP

       Substituting:

		rms' + 1
	   CP = --------
		   2

       and solving for rms':

	   rms' = 2CP - 1

       or:

	   e = rms - (2CP - 1) = rms - 2CP + 1

       and substituting for rms, where rms = 2P - 1:

	   e  = 2P - 1 - 2CP + 1 = 2P - 2CP = 2P(1 - C)

       and substituting P' = CP:

	   e = 2P - 2P' = 2(P - P')

       C  now has to be adjusted because we are only concerned with the values
       of rms' that are less than rms, where:

	   c = 1 - 2(1 - C) = 1 - 2 + 2C = 2C - 1

       but since C = P' / P:

	       2P'
	   c = -- - 1
	       P

       or we have:

	   e = 2(P - P')

       and:

	       2P'
	   c = -- - 1
	       P

       which are the two general equations for use of this program for trading
       equities.

       Making a plot of these equations, of P' vs. n for various P presents an
       interesting conjecture. The graph can  be  crudely  approximated  by  a
       single  pole  filter,  with  a  pole  at  0.033, ie., using the program
       tscoins(1) with a -p 0.6 argument to  simulate  an  equity  value  time
       series,	and the program tsinstant(1), with the -s option, to calculate
       the instantaneous Shannon probability of the time series,  followed  by
       the  program tspole(1) with a -p 0.033 argument, would output, approxi-
       mately, P'. The P' tends to under wager for t < 7, and over wager for t
       > 0.7. The approximation is simple, but crude. Interestingly, using the
       program tshurst(1) on the same time series indicates that there is good
       correlation  for t < 5, and if this temporal range is of interest, this
       simple solution may prove adequate for non-rigorous requirements. Addi-
       tionally,  perhaps  using  the  tsmath(1)  program,  the  output of the
       tspole(1) program could have 0.5 subtracted, multiplied by, say,  0.85,
       and then the 0.5 re-added to extend the usefulness to approximately 100
       business days. The accuracy over this range is approximately  +/-  0.01
       out  of 0.55. Naturally, after very many days, for example, if P = 0.6,
       P' would still be 0.585, creating a long term error in  rms  of	0.2  -
       0.17  =	0.03. Note that the error created in the exponential growth of
       the capital would be 0.04 - 0.0289.  A  substantial  long  term	error.
       Alternately, perhaps a recursive feed-forward technique could be imple-
       mented that would allow the pole frequency to be selected for far  term
       compatibility  with  the  statistical  estimate, while at the same time
       approximating the near term. Naturally, this, also, should not be  con-
       sidered	a  substitute for statistical estimates, but using statistical
       estimates would probably require a recursive procedure, and that  is  a
       formidable proposition.

       This  program  will  require  finding the value of the normal function,
       given the standard deviation. The method used is to use	Romberg/trape-
       zoid integration to numerically solve for the value.

       This program will require finding the functional inverse of the normal,
       ie., Gaussian, function. The method used is  to	use  Romberg/trapezoid
       integration to numerically solve the equation:

			   x		    2
			   |   1	- t   / 2
	   F(x) = integral | ------ * e 	 dt + 0.5
			   | 2 * pi
			   0

       which has the derivative:

				 2
		    1	     - x   / 2
	   f(x) = ------ * e
		  2 * pi

       Since F(x) is known, and it is desired to find x,

			   x		    2
			   |   1	- t   / 2
	   F(x) - integral | ------ * e 	 dt + 0.5 = P(x)
			   | 2 * pi
			   0

							   = 0

       and the Newton-Raphson method of finding roots would be:

			 P(x)
	   P	  = P  - ----
	    n + 1    n	 f(x)

       As a reference on Newton-Raphson Method of root finding, see "Numerical
       Recipes in C: The Art of Scientific Computing," William H. Press, Brian
       P.  Flannery,  Saul  A.	Teukolsky,  William  T.  Vetterling, Cambridge
       University Press, New York, 1988, ISBN 0-521-35465-X, pp 270.

       As a reference on Romberg integration, see "Numerical Recipes in C: The
       Art of Scientific Computing," William H. Press, Brian P. Flannery, Saul
       A. Teukolsky, William T. Vetterling, Cambridge  University  Press,  New
       York, 1988, ISBN 0-521-35465-X, page 124.

       As a reference on trapezoid iteration, see "Numerical Recipes in C: The
       Art of Scientific Computing," William H. Press, Brian P. Flannery, Saul
       A.  Teukolsky,  William	T. Vetterling, Cambridge University Press, New
       York, 1988, ISBN 0-521-35465-X, page 120.

       As a reference on polynomial interpolation, see "Numerical  Recipes  in
       C:  The	Art of Scientific Computing," William H. Press, Brian P. Flan-
       nery, Saul A. Teukolsky, William T.  Vetterling,  Cambridge  University
       Press, New York, 1988, ISBN 0-521-35465-X, page 90.

OPTIONS
       -c n   Confidence level, 0.0 < n < 1.0.

       -d     Print number of samples required as a float.

       -e m   Maximum absolute error estimate, 0.0 < m.

       -f o   Maximum  fraction error estimate in standard deviation and mean.

       -i     Input is the integration of a Gaussian variable.

       -p     Only print number of samples  required  for  mean  and  standard
	      deviation.

       -v     Print the version and copyright banner of the program.

       filename
	      Input filename.

WARNINGS
       There is little or no provision for handling numerical exceptions.

SEE ALSO
       tsderivative(1),   tshcalc(1),	tshurst(1),  tsintegrate(1),  tslogre-
       turns(1), tslsq(1),  tsnormal(1),  tsshannon(1),  tsblack(1),  tsbrown-
       ian(1), tsdlogistic(1), tsfBm(1), tsfractional(1), tsgaussian(1), tsin-
       tegers(1), tslogistic(1), tspink(1), tsunfairfractional(1), tswhite(1),
       tscoin(1),    tsunfairbrownian(1),    tsfraction(1),   tsshannonmax(1),
       tschangewager(1),   tssample(1),   tsrms(1),   tscoins(1),    tsavg(1),
       tsXsquared(1),  tsstockwager(1),  tsshannonwindow(1), tsmath(1), tsavg-
       window(1),  tspole(1),  tsdft(1),  tsbinomial(1),   tsdeterministic(1),
       tsnumber(1),	tsrmswindow(1),     tsshannonstock(1),	  tsmarket(1),
       tsstock(1), tsstatest(1), tsunfraction(1), tsshannonaggregate(1), tsin-
       stant(1),   tsshannonvolume(1),	tsstocks(1),  tsshannonfundamental(1),
       tstrade(1),  tstradesim(1),  tsrunlength(1),  tsunshannon(1),   tsroot-
       mean(1), tsrunmagnitude(1), tskurtosis(1), tskurtosiswindow(1), tsroot-
       meanscale(1),	tsscalederivative(1),	 tsgain(1),    tsgainwindow(1)
       tscauchy(1), tslognormal(1), tskalman(1), tsroot(1), tslaplacian(1)

DIAGNOSTICS
       Error  messages for incompatible arguments, failure to allocate memory,
       inaccessible files, and opening and closing files.

AUTHORS
       ----------------------------------------------------------------------

       A license is hereby granted to reproduce this software source code and
       to create executable versions from this source code for personal,
       non-commercial use.  The copyright notice included with the software
       must be maintained in all copies produced.

       THIS PROGRAM IS PROVIDED "AS IS". THE AUTHOR PROVIDES NO WARRANTIES
       WHATSOEVER, EXPRESSED OR IMPLIED, INCLUDING WARRANTIES OF
       MERCHANTABILITY, TITLE, OR FITNESS FOR ANY PARTICULAR PURPOSE.  THE
       AUTHOR DOES NOT WARRANT THAT USE OF THIS PROGRAM DOES NOT INFRINGE THE
       INTELLECTUAL PROPERTY RIGHTS OF ANY THIRD PARTY IN ANY COUNTRY.

       Copyright (c) 1994-2006, John Conover, All Rights Reserved.

       Comments and/or bug reports should be addressed to:

	   john@email.johncon.com (John Conover)

       ----------------------------------------------------------------------


			       January 18, 2006 		  TSSTATEST(1)