# An approximation for data set size risk

From: John Conover <john@email.johncon.com>
Subject: An approximation for data set size risk
Date: 9 Jun 2003 03:16:21 -0000

```

Let avg be the average of the marginal increments of a time series,
and rms be the deviation. Then, from http://www.johncon.com/ntropix/,

avg
--- + 1
rms
P = -------
2

which is the likelihood of an up movement in the time series, and:

G = (1 + rms)^P * (1 - rms)^(1 - P)

where G is the gain in the time series.

Note that rms - avg is the risk. (Also, note that the avg is not the
average gain! G is.)

Sanity check. Suppose we use the above formulas on a savings
account.  Then, avg = rms, and P = 1. G = 1 + rms = 1 +
avg. Check, where avg = rms = the interest rate.

When doing metrics on a time series, the measurement of rms converges
very quickly; avg very slowly. In point of fact, we will probably
never have a good idea what avg is for equities, for example, (for the
daily closing median values of all equities on the US exchanges in the
Twentieth Century, avg = 0.0004, rms = 0.02):

tsshannoneffective 0.0004 0.02 12000
For P = (sqrt (avg) + 1) / 2:
P = 0.510000
Peff = 0.501674
For P = (rms + 1) / 2:
P = 0.510000
Peff = 0.509410
For P = (avg / rms + 1) / 2:
P = 0.510000
Peff = 0.499942

which means that to even determine that the typical stock had an even
likelihood of increasing on any day would require about 12,000 / 253 =
47 calendar years of daily trading data!

What this means is that there is a significant probability that, by
serendipity, the metrics were taken when the stock is in a "bubble",
and we would be misled.

Sanity check. What's the chances of a stock's value being in a
"bubble" for at least 12,000 days? Its erf (1 / sqrt (12000))
which is about 1 / sqrt (12000) for t >> 1, or about
0.00912870929, and 1 - erf (1 / sqrt (12000)) = 0.990871291. Or we
have a risk, (note that term again,) of about 1% on a data set
size of 12,000 days. Note that 0.990871291 * 0.51 = 0.505344358,
which is a very close approximation-about 1%-to the results given
by the tsshannoneffective program, (which does things quite
formally-its the same code that is in tsinvest.) So, it checks.

But we know that the rms measurement settles quite quickly, and we can
add the risk of the data set size being too small into the risk of the
investment. Letting P' be P compensated for data set size:

P' = P * (1 - erf (1 / sqrt (t)))

which, as above, can be approximated by:

P' = P * (1 - 1 / sqrt (t))

where P' is the investment risk, AND, the risk do to limited data set
size, combined, and requires only avg an rms, measured over t many
time intervals, where t >> 1; it can be used directly in the equation,
above, for G.

You can almost work it out in your head.

John

BTW, how many days does one have to measure a "typical" stock's
performance to have a reasonable idea that it is capable of sustained
growth?

0.5 = 0.51 * (1 - (1 / sqrt (t)))

or t > 2601 trading days, or about 10 years of daily data. Using a
data set size smaller than that, one will lose as much as one makes,
(although you can make a lot before you lose it.)

Now, consider a portfolio of ten of the same stocks, with equal
investments maintained in each.

avg
(sqrt (10) * ---) + 1
rms
P = --------------------- = 0.531622776
2

and:

0.5 = 0.531622776 * (1 - (1 / sqrt (t)))

or t > 283 trading days, or a little more than a calendar year.

You get the drift.

--

John Conover, john@email.johncon.com, http://www.johncon.com/

```

Copyright © 2003 John Conover, john@email.johncon.com. All Rights Reserved.
Last modified: Sun Jun 8 20:53:59 PDT 2003 \$Id: 030608201634.4528.html,v 1.0 2003/06/09 04:08:30 conover Exp \$