# Data Set Sizes

From: John Conover <john@email.johncon.com>
Subject: Data Set Sizes
Date: Mon, 8 Aug 2005 21:06:15 -0700

```

I'm getting a lot of questions on data set sizes required for equity
investments. After the US dot-bomb "bubble" burst in 2000, one would
think that we would have all learned not to buy into "bubbles," with
decisions based on small data set sizes taken inside the "bubble."
But the way things are shaping up, 2009 in the Chinese technology
markets, it will be the sequel to the US markets of 2000.

The way tsinvest calculates the uncertainty, or risk, do to inadequate
data set sizes is quite complicated, (its controlled with the -c and
-C command line options,) but you can approximate it quite

Consider a political poll-we want to know how much of the population
will vote for candidate X; we wish to determine that by sampling the
population, so there will be a small error do to the sampling. For
that we would use "Statistical Estimate," to determine the "margin of
error," which is twice the standard deviation, and is 1 / sqrt (n),
where n is the number of opinions sampled. Two (double sided,)
standard deviations are 0.954499736103641736, and it is, also, called
the "95% confidence level". (All of this comes from Bernoulli
P-Trials, in case you want to pursue the subject further.)

Suppose the sample size, n, is 1000, (a typical number in political
polls.) Then the margin of error would be 1 / sqrt (n) = 0.0316, or
3%.

What that means is that if we repeat like polls, (i.e., of a thousand
samples, each,) a hundred times, 5 of the times, the real value would
be more than +/- 3% of the sampled value. That's where the "Candidate
X is favored by 60% over Candidate Y, with a margin of error of 3%"
comes from on the 6 oclock talking head shows.

The same thing can be used in equity selection, too.

The probability of an up movement, P, in an equity's value is:

P = ((avg / rms) + 1) / 2

(See:
http://www.johncon.com/john/correspondence/020213233852.26478.html
for particulars,) where avg is the average, and, rms is the
root-mean-square of the marginal increments of the equity's price.

The margin of error in measuring rms is rms / n, for n many samples in
the time series-and so is the error in avg, i.e., the rms converges to
a fair accuracy, very quickly, and the error in avg does not-e.g.,
most of the error in measuring P is in the measurment of avg, since
avg << rms, and they both have the same magnitude of error.

So, as a fairly accurate approximation, instead of using the measured
avg, we reduce avg by an amount rms / sqrt (n), to calculate an
"effective P," P', and then use that in the equation:

G = ((1 + rms)^P') * ((1 - rms)^(1 - P'))

to select which equities have the best proforma. Note that:

P' = (((avg / rms) - (1 / sqrt (n))) + 1) / 2

so really, all that happened was to subtract 1 / sqrt (n) from avg /
rms.

But what does that have to do with "bubbles" in equity prices?

Looking at it from a different angle, the chances of a zero free
interval of at least n many time units in a Brownian motion fractal is
erf (1 / sqrt (n)), which for n >> 1 is approximately 1 / sqrt (n); in
other words, the chances, (or probability,) of a "bubble" lasting
through an entire measurement of n many samples, (and giving
measurement in a "bubble,") is about 1 / sqrt (n), which, (since its
an uncertainty,) could be subtracted from P to obtain P', as above,
(the previous calculation ignored data size effects in rms, so that
error term is 1 / 2 the error term of the latter.)

Where does the error term come from? Its the error in the market's
assessment of a fair value for an equity. A "bubble" is the assessment
process-the process of determining a fair market value.

An example: a typical equity's avg is 0.0004, and rms is 0.02,
measured on daily closes for the US equity markets. How many days,
minimum, must be included in the time series for an analysis?

Obviously, avg > 0 for P' > 0.5, so rms / sqrt (n) < avg, or sqrt (n)
= 0.02 / 0.0004 = 50, or n = 50^2 = 2500, or about 10 years of 253
trading days per year, (half that if you prefer the first method.)
This would insure that one was not buying into a "bubble."

However, note that for better performing equities, (the one used in
the example, above, only does about 5-10% gain a year, which is
typical for all equities in the US markets,) that have larger avg
values, the data set size requirements are much smaller.

John

--

John Conover, john@email.johncon.com, http://www.johncon.com/

```