From: John Conover <john@email.johncon.com>
Subject: Re: portfolio management
Date: Sat, 19 Sep 1998 18:20:49 -0700
John Conover writes:
>
> BTW, the tsinvest program uses the algorithms out of the
> tsshannoneffective program to avoid making bad investment
> recommendations when searching the ticker for the best set of stocks
> for an optimal growth portfolio. The algorithms used are general, and
> can be included in any program. The sources are in
> http://www.johncon.com/ntropix/archive/tsinvest.tar.gz. Traditional
> statistical estimate is only one method that it uses-it turns out that
> statistical estimate is grossly optimistic with fractal data sets.
> For example, using only statistical estimation, one would expect to be
> able to time the market 51.9635% of the time-what would seem to be a
> very workable agenda, with a significant pay off. Not so, however. If
> one attempts such an agenda with only a 48.2783% likelihood of
> succeeding, one will win sometimes, but in the long run, loose the
> entire portfolio, (at a rate of 0.999406857 per day, on average.) The
> tsinvest program can be programmed to attempt to do market timing, and
> then simulations can be run using the NYSE historical data CDs of
> every stock in the NYSE since 1966. The simulations verify that the
> 48.3% number is, indeed, valid. (That's the main use of the tsinvest
> program-to simulate trading strategies, before its turned loose with
> live data.) The 48.2783% number is, also, fairly close to empirical
> market metrics from formal studies run in the mutual fund industry.
>
The tsshannoneffective is for calculating the effective Shannon
probability, given the average, root mean square, data set size, and
data set duration, of the normalized incre- ments of a time series.
Bottom line, it is for programmed trading (PT) of stocks. The C
sources are freely available as Open Source software on
http://www.johncon.com/ntropix/archive/tsinvest.tar.gz.
A fragment, specific to this discussion, of the manual page is
attached ...
John
--
John Conover, john@email.johncon.com, http://www.johncon.com/
DESCRIPTION
DATA SET SIZE CONSIDERATIONS
This program addresses the question "is there reasonable
evidence to justify investment in an equity based on data
set size?"
The Shannon probability of a time series is the likelihood
that the value of the time series will increase in the
next time interval. The Shannon probability is measured
using the average, avg, and root mean square, rms, of the
normalized increments of the time series. Using the rms to
compute the Shannon probability, P:
rms + 1
P = ------- ....................................(1.1)
2
However, there is an error associated with the measurement
of rms do to the size of the data set, N, (ie., the number
of records in the time series,) used in the calculation of
rms. The confidence level, c, is the likelihood that this
error is less than some error level, e.
Over the many time intervals represented in the time
series, the error will be greater than the error level, e,
(1 - c) * 100 percent of the time-requiring that the Shan-
non probability, P, be reduced by a factor of c to accom-
modate the measurement error:
rms - e + 1
Pc = ----------- ...............................(1.2)
2
where the error level, e, and the confidence level, c, are
calculated using statistical estimates, and the product P
times c is the effective Shannon probability that should
be used in the calculation of optimal wagering strategies.
The error, e, expressed in terms of the standard deviation
of the measurement error do to an insufficient data set
size, esigma, is:
e
esigma = --- sqrt (2N) .........................(1.3)
rms
where N is the data set size = number of records. From
this, the confidence level can be calculated from the
cumulative sum, (ie., integration) of the normal distribu-
tion, ie.:
c esigma
-------------
50 0.67
68.27 1.00
80 1.28
90 1.64
95 1.96
95.45 2.00
99 2.58
99.73 3.00
Note that the equation:
rms - e + 1
Pc = ----------- ...............................(1.4)
2
will require an iterated solution since the cumulative
normal distribution is transcendental. For convenience,
let F(esigma) be the function that given esigma, returns
c, (ie., performs the table operation, above,) then:
rms - e + 1
P * F(esigma) = -----------
2
rms * esigma
rms - ------------ + 1
sqrt (2N)
= ---------------------- .........(1.5)
2
Then:
rms * esigma
rms - ------------ + 1
rms + 1 sqrt (2N)
------- * F(esigma) = ---------------------- ...(1.6)
2 2
or:
rms * esigma
(rms + 1) * F(esigma) = rms - ------------ + 1 .(1.7)
sqrt (2N)
Letting a decision variable, decision, be the iteration
error created by this equation not being balanced:
rms * esigma
decision = rms - ------------ + 1
sqrt (2N)
- (rms + 1) * F(esigma) ............(1.8)
which can be iterated to find F(esigma), which is the con-
fidence level, c.
Note that from the equation:
rms - e + 1
Pc = -----------
2
and solving for rms - e, the effective value of rms com-
pensated for accuracy of measurement by statistical esti-
mation:
rms - e = (2 * P * c) - 1 ......................(1.9)
and substituting into the equation:
rms + 1
P = -------
2
rms - e = ((rms + 1) * c) - 1 .................(1.10)
and defining the effective value of rms as rmseff:
rmseff = rms - e ..............................(1.11)
It can be seen that if optimality exists, ie., f = 2P - 1,
or:
2
avg = rms ....................................(1.12)
or:
2
avgeff = rmseff ..............................(1.13)
As an example of this algorithm, if the Shannon probabil-
ity, P, is 0.51, corresponding to an rms of 0.02, then the
confidence level, c, would be 0.996298, or the error
level, e, would be 0.003776, for a data set size, N, of
100.
Likewise, if P is 0.6, corresponding to an rms of 0.2 then
the confidence level, c, would be 0.941584, or the error
level, e, would be 0.070100, for a data set size of 10.
Robustness is an issue in algorithms that, potentially,
operate real time. The traditional means of implementation
of statistical estimates is to use an integration process
inside of a loop that calculates the cumulative of the
normal distribution, controlled by, perhaps, a Newton
Method approximation using the derivative of cumulative of
the normal distribution, ie., the formula for the normal
distribution:
2
1 - x / 2
f(x) = ------------- * e ............(1.14)
sqrt (2 * PI)
Numerical stability and convergence issues are an issue in
such processes.
The Shannon probability of a time series is the likelihood
that the value of the time series will increase in the
next time interval. The Shannon probability is measured
using the average, avg, and root mean square, rms, of the
normalized increments of the time series. Using the avg to
compute the Shannon probability, P:
sqrt (avg) + 1
P = -------------- ............................(1.15)
2
However, there is an error associated with the measurement
of avg do to the size of the data set, N, (ie., the number
of records in the time series,) used in the calculation of
avg. The confidence level, c, is the likelihood that this
error is less than some error level, e.
Over the many time intervals represented in the time
series, the error will be greater than the error level, e,
(1 - c) * 100 percent of the time-requiring that the Shan-
non probability, P, be reduced by a factor of c to accom-
modate the measurement error:
sqrt (avg - e) + 1
Pc = ------------------ .......................(1.16)
2
where the error level, e, and the confidence level, c, are
calculated using statistical estimates, and the product P
times c is the effective Shannon probability that should
be used in the calculation of optimal wagering strategies.
The error, e, expressed in terms of the standard deviation
of the measurement error do to an insufficient data set
size, esigma, is:
e
esigma = --- sqrt (N) .........................(1.17)
rms
where N is the data set size = number of records. From
this, the confidence level can be calculated from the
cumulative sum, (ie., integration) of the normal distribu-
tion, ie.:
c esigma
-------------
50 0.67
68.27 1.00
80 1.28
90 1.64
95 1.96
95.45 2.00
99 2.58
99.73 3.00
Note that the equation:
sqrt (avg - e) + 1
Pc = ------------------ .......................(1.18)
2
will require an iterated solution since the cumulative
normal distribution is transcendental. For convenience,
let F(esigma) be the function that given esigma, returns
c, (ie., performs the table operation, above,) then:
sqrt (avg - e) + 1
P * F(esigma) = ------------------
2
rms * esigma
sqrt [avg - ------------] + 1
sqrt (N)
= ----------------------------- .(1.19)
2
Then:
sqrt (avg) + 1
--------------- * F(esigma) =
2
rms * esigma
sqrt [avg - ------------] + 1
sqrt (N)
----------------------------- .............(1.20)
2
or:
(sqrt (avg) + 1) * F(esigma) =
rms * esigma
sqrt [avg - ------------] + 1 .............(1.21)
sqrt (N)
Letting a decision variable, decision, be the iteration
error created by this equation not being balanced:
rms * esigma
decision = sqrt [avg - ------------] + 1
sqrt (N)
- (sqrt (avg) + 1) * F(esigma) .....(1.22)
which can be iterated to find F(esigma), which is the con-
fidence level, c.
There are two radicals that have to be protected from
numerical floating point exceptions. The sqrt (avg) can be
protected by requiring that avg >= 0, (and returning a
confidence level of 0.5, or possibly zero, in this
instance-a negative avg is not an interesting solution for
the case at hand.) The other radical:
rms * esigma
sqrt [avg - ------------] .....................(1.23)
sqrt (N)
and substituting:
e
esigma = --- sqrt (N) .........................(1.24)
rms
which is:
e
rms * --- sqrt (N)
rms
sqrt [avg - ------------------] ...............(1.25)
sqrt (N)
and reducing:
sqrt [avg - e] ................................(1.26)
requiring that:
avg >= e ......................................(1.27)
Note that if e > avg, then Pc < 0.5, which is not an
interesting solution for the case at hand. This would
require:
avg
esigma <= --- sqrt (N) ........................(1.28)
rms
Obviously, the search algorithm must be prohibited from
searching for a solution in this space. (ie., testing for
a solution in this space.)
The solution is to limit the search of the confidence
array to values that are equal to or less than:
avg
--- sqrt (N) ..................................(1.29)
rms
which can be accomplished by setting integer variable,
top, usually set to sigma_limit - 1, to this value.
Note that from the equation:
sqrt (avg - e) + 1
Pc = ------------------
2
and solving for avg - e, the effective value of avg com-
pensated for accuracy of measurement by statistical esti-
mation:
2
avg - e = ((2 * P * c) - 1) ..................(1.30)
and substituting into the equation:
sqrt (avg) + 1
P = --------------
2
2
avg - e = (((sqrt (avg) + 1) * c) - 1) .......(1.31)
and defining the effective value of avg as avgeff:
avgeff = avg - e ..............................(1.32)
It can be seen that if optimality exists, ie., f = 2P - 1,
or:
2
avg = rms ....................................(1.33)
or:
rmseff = sqrt (avgeff) ........................(1.34)
As an example of this algorithm, if the Shannon probabil-
ity, P, is 0.52, corresponding to an avg of 0.0016, and an
rms of 0.04, then the confidence level, c, would be
0.987108, or the error level, e, would be 0.000893, for a
data set size, N, of 10000.
Likewise, if P is 0.6, corresponding to an rms of 0.2, and
an avg of 0.04, then the confidence level, c, would be
0.922759, or the error level, e, would be 0.028484, for a
data set size of 100.
The Shannon probability of a time series is the likelihood
that the value of the time series will increase in the
next time interval. The Shannon probability is measured
using the average, avg, and root mean square, rms, of the
normalized increments of the time series. Using both the
avg and the rms to compute the Shannon probability, P:
avg
--- + 1
rms
P = ------- ...................................(1.35)
2
However, there is an error associated with both the mea-
surement of avg and rms do to the size of the data set, N,
(ie., the number of records in the time series,) used in
the calculation of avg and rms. The confidence level, c,
is the likelihood that this error is less than some error
level, e.
Over the many time intervals represented in the time
series, the error will be greater than the error level, e,
(1 - c) * 100 percent of the time-requiring that the Shan-
non probability, P, be reduced by a factor of c to accom-
modate the measurement error:
avg - ea
-------- + 1
rms + er
P * ca * cr = ------------ ....................(1.36)
2
where the error level, ea, and the confidence level, ca,
are calculated using statistical estimates, for avg, and
the error level, er, and the confidence level, cr, are
calculated using statistical estimates for rms, and the
product P * ca * cr is the effective Shannon probability
that should be used in the calculation of optimal wagering
strategies, (which is the product of the Shannon probabil-
ity, P, times the superposition of the two confidence lev-
els, ca, and cr, ie., P * ca * cr = Pc, eg., the assump-
tion is made that the error in avg and the error in rms
are independent.)
The error, er, expressed in terms of the standard devia-
tion of the measurement error do to an insufficient data
set size, esigmar, is:
er
esigmar = --- sqrt (2N) .......................(1.37)
rms
where N is the data set size = number of records. From
this, the confidence level can be calculated from the
cumulative sum, (ie., integration) of the normal distribu-
tion, ie.:
cr esigmar
--------------
50 0.67
68.27 1.00
80 1.28
90 1.64
95 1.96
95.45 2.00
99 2.58
99.73 3.00
Note that the equation:
avg
-------- + 1
rms + er
P * cr = ------------ .........................(1.38)
2
will require an iterated solution since the cumulative
normal distribution is transcendental. For convenience,
let F(esigmar) be the function that given esigmar, returns
cr, (ie., performs the table operation, above,) then:
avg
-------- + 1
rms + er
P * F(esigmar) = ------------ =
2
avg
------------------- + 1
esigmar * rms
rms + -------------
sqrt (2N)
----------------------- ......(1.39)
2
Then:
avg
--- + 1
rms
------- * F(esigmar) =
2
avg
------------------- + 1
esigmar * rms
rms + -------------
sqrt (2N)
----------------------- ................(1.40)
2
or:
avg
(--- + 1) * F(esigmar) =
rms
avg
------------------- + 1 ................(1.41)
esigmar * rms
rms + -------------
sqrt (2N)
Letting a decision variable, decision, be the iteration
error created by this equation not being balanced:
avg
decision = ------------------- + 1
esigmar * rms
rms + -------------
sqrt (2N)
avg
- (--- + 1) * F(esigmar) ..........(1.42)
rms
which can be iterated to find F(esigmar), which is the
confidence level, cr.
The error, ea, expressed in terms of the standard devia-
tion of the measurement error do to an insufficient data
set size, esigmaa, is:
ea
esigmaa = --- sqrt (N) ........................(1.43)
rms
where N is the data set size = number of records. From
this, the confidence level can be calculated from the
cumulative sum, (ie., integration) of the normal distribu-
tion, ie.:
ca esigmaa
--------------
50 0.67
68.27 1.00
80 1.28
90 1.64
95 1.96
95.45 2.00
99 2.58
99.73 3.00
Note that the equation:
avg - ea
-------- + 1
rms
P * ca = ------------ .........................(1.44)
2
will require an iterated solution since the cumulative
normal distribution is transcendental. For convenience,
let F(esigmaa) be the function that given esigmaa, returns
ca, (ie., performs the table operation, above,) then:
avg - ea
-------- + 1
rms
P * F(esigmaa) = ------------ =
2
esigmaa * rms
avg - -------------
sqrt (N)
------------------- + 1
rms
----------------------- ......(1.45)
2 Then:
avg
--- + 1
rms
------- * F(esigmaa) =
2
esigmaa * rms
avg - -------------
sqrt (N)
------------------- + 1
rms
----------------------- ................(1.46)
2
or:
avg
(--- + 1) * F(esigmaa) =
rms
esigmaa * rms
avg - -------------
sqrt (N)
------------------- + 1 ................(1.47)
rms
Letting a decision variable, decision, be the iteration
error created by this equation not being balanced:
esigmaa * rms
avg - -------------
sqrt (N)
decision = ------------------- + 1
rms
avg
- (--- + 1) * F(esigmaa) ..................(1.48)
rms
which can be iterated to find F(esigmaa), which is the
confidence level, ca.
Note that from the equation:
avg
-------- + 1
rms + er
P * cr = ------------
2
and solving for rms + er, the effective value of rms com-
pensated for accuracy of measurement by statistical esti-
mation:
avg
rms + er = ---------------- ...................(1.49)
(2 * P * cr) - 1
and substituting into the equation:
avg
--- + 1
rms
P = -------
2
avg
rms + er = -------------------- ...............(1.50)
avg
((--- + 1) * cr) - 1
rms
and defining the effective value of avg as rmseff:
rmseff = rms +/- er ...........................(1.51)
Note that from the equation:
avg - ea
-------- + 1
rms
P * ca = ------------
2
and solving for avg - ea, the effective value of avg com-
pensated for accuracy of measurement by statistical esti-
mation:
avg - ea = ((2 * P * ca) - 1) * rms ...........(1.52)
and substituting into the equation:
avg
--- + 1
rms
P = -------
2
avg
avg - ea = (((--- + 1) * ca) - 1) * rms .......(1.53)
rms
and defining the effective value of avg as avgeff:
avgeff = avg - ea .............................(1.54)
As an example of this algorithm, if the Shannon probabil-
ity, P, is 0.51, corresponding to an rms of 0.02, then the
confidence level, c, would be 0.983847, or the error level
in avg, ea, would be 0.000306, and the error level in rms,
er, would be 0.001254, for a data set size, N, of 20000.
Likewise, if P is 0.6, corresponding to an rms of 0.2 then
the confidence level, c, would be 0.947154, or the error
level in avg, ea, would be 0.010750, and the error level
in rms, er, would be 0.010644, for a data set size of 10.
As a final discussion to this section, consider the time
series for an equity. Suppose that the data set size is
finite, and avg and rms have both been measured, and have
been found to both be positive. The question that needs to
be resolved concerns the confidence, not only in these
measurements, but the actual process that produced the
time series. For example, suppose, although there was no
knowledge of the fact, that the time series was actually
produced by a Brownian motion fractal mechanism, with a
Shannon probability of exactly 0.5. We would expect a
"growth" phenomena for extended time intervals [Sch91, pp.
152], in the time series, (in point of fact, we would
expect the cumulative distribution of the length of such
intervals to be proportional to erf (1 / sqrt (t)).) Note
that, inadvertently, such a time series would potentially
justify investment. What the methodology outlined in this
section does is to preclude such scenarios by effectively
lowering the Shannon probability to accommodate such
issues. In such scenarios, the lowered Shannon probability
will cause data sets with larger sizes to be "favored,"
unless the avg and rms of a smaller data set size are
"strong" enough in relation to the Shannon probabilities
of the other equities in the market. Note that if the data
set sizes of all equities in the market are small, none
will be favored, since they would all be lowered by the
same amount, (if they were all statistically similar.)
To reiterate, in the equation avg = rms * (2P - 1), the
Shannon probability, P, can be compensated by the size of
the data set, ie., Peff, and used in the equation avgeff =
rms * (2Peff - 1), where rms is the measured value of the
root mean square of the normalized increments, and avgeff
is the effective, or compensated value, of the average of
the normalized increments.
DATA SET DURATION CONSIDERATIONS
An additional accuracy issue, besides data set size, is
the time interval over which the data was obtained. There
is some possibility that the data set was taken during an
extended run length, either negative or positive, and the
Shannon probability will have to be compensated to accom-
modate this measurement error. The chances that a run
length will exceed time, t, is:
1 - erf (1 / sqrt (t)) ........................(1.55)
or the Shannon probability, P, will have to be compensated
by a factor of:
erf (1 / sqrt (t)) ............................(1.56)
giving a compensated Shannon probability, Pcomp:
Pcomp = Peff * (1 - erf (1 / sqrt (t)))........(1.57)
Fortunately, since confidence levels are calculated from
the normal probability function, the same lookup table
used for confidence calculations (ie., the cumulative of a
normal distribution,) can be used to calculate the associ-
ated error function.
To use the value of the normal probability function to
calculate the error function, erf (N), proceed as follows;
since erf (X / sqrt (2)) represents the error function
associated with the normal curve:
A) X = N * sqrt (2).
B) Lookup the the value of X in the normal probability
function.
C) Subtract 0.5 from this value.
D) And, multiply by 2.
or:
erf (N) = 2 * (normal (t * sqrt (2)) - 0.5) ...(1.58)