Saturday, September 11, 2010

Forest Gump and the Normal Distribution

"I may not be a smart man, but I know what love is." Forrest is a fictional character and was deemed by the experts as having a well below average IQ; however as far as fiction goes, I think he was a lot more intelligent than Jenny. Maybe he listened to his Mama. Psychologist Alfred Binet, said:


"It seems to us that in intelligence there is a fundamental faculty, the alteration or the lack of which, is of the utmost importance for practical life. This faculty is judgment, otherwise called good sense, practical sense, initiative, the faculty of adapting one's self to circumstances. A person may be a moron or an imbecile if he is lacking in judgment; but with good judgment he can never be either. Indeed the rest of the intellectual faculties seem of little importance in comparison with judgment" (Binet & Simon, 1916, 1973, pp.42-43).



So, what does Forrest have to do with the normal distribution? IQ scores are said to follow a bell-shaped curve, or normal distribution. Technically, the normal distribution does not really work, because there is a chance that a test score could be negative, unless you truncate (fix the distribution so as not to allow negative scores). However, the normal distribution works well in describing the distribution of standardized test scores, for example IQ, SAT, etc.



A Little History


The Stanford Binet Scale is widely known for measuring intelligence levels of individuals between the ages of 2 to 85+ years. This test was the result of the revision by Lewis Terman in 1916 to the Binet-Simon Scale. Since 1916 there have been many revisions to the test. It was the first published intelligence test to supply detailed administration and scoring procedures was the first American test to use the idea of the intelligence quotient (IQ). (Laurent, Swerdlik, & Ryburn, 1992.) Currently, it is in its Fifth Edition. The Stanford-Binet metric is based on an average or mean score of 100.

Importance of the Normal Distribution

The term "normal" possibly arose because of the various attempts made to establish this distribution as the underlying rule governing all continuous variables. These attempts were based on false premises and consequently failed. Nonetheless, the normal distribution rightly occupies a preeminent place in the field of probability. In addition to portraying the distribution of many types of natural and physical phenomena (such as the heights of men, diameters of machined parts, etc.), it also serves as a convenient approximation of many other distributions which are less tractable. Most importantly, it describes the manner in which certain estimators of population characteristics vary from sample to sample and, thereby, serves as the foundation upon which much statistical inference from a random sample to population are made.

About 68% of values drawn from a normal distribution are within one standard deviation σ > 0 away from the mean μ; about 95% of the values are within two standard deviations and about 99.7% lie within three standard deviations. This is known as the 68-95-99.7 rule, or the empirical rule, or the 3-sigma rule.





IQ Scores and the Normal Distribution


IQ Scores are said to follow a normal distribution with a mean of 100 and a standard deviation of 10 (approximately). Thus, if you have a IQ score between 90 and 110, you are within 0ne standard deviation of the mean. (If you have taken any standardized test, you have probably seen or heard these terms before.) About 15% of the population have scores greater than 110 or less that 90. Approximately 2% of the population have IQ scores above 120 or below 80. Less than 1% have scores greater than 130 or less than 70.


Now, I am not saying that I agree with IQ scores. If we take fictional Forrest as a representative of some actual person, and apply Alfred Binet's definition of intelligence, IQ scores may not tell the whole story about intelligence.

Incidentally, What was Forrest Gump's IQ score?

Thursday, September 9, 2010

Random Number Generators

Random numbers are used in things like the lottery and encryption, but this article is about their use in simulation.

Random number are critical building block for introducing variation is simulation experiments. They are used for providing variation between simulation runs, and within each run of a simulation, by providing the basis for producing random variates (samples values or outcomes from particular probability distributions).

Random numbers in simulation provide real, uniformly generated numbers between 0 and 1, sometimes written as Uniform ~ (0, 1). A real Uniform ~ (0, 1) probability distribution would appear as a solid rectangle from 0 to 1 with a height of 1, graphically. The distribution would have a mean of 0.5 and a variance of approximately 0.0833.

Classical uniform random number generators have some major defects, such as, short period length and lack of higher dimension uniformity. However, nowadays there are a class of rather complex generators which is as efficient as the classical generators while enjoy the property of a much longer period and of a higher dimension uniformity.

Computer programs that generate "random" numbers use an algorithm. That means if you know the algorithm and the seed-values you can predict what numbers will result. Because you can predict the numbers they are not truly random - they are pseudo-random. For statistical purposes "good" pseudo-random numbers generators are good enough.

Below, I have evaluated four Real Uniform (0, 1) random number generators using the correlation test and frequency test, as well as calculating estimates for their means and variances. The four random number generators are the Additive Congruence Generator, the Linear Congruential Generator, the Excel 2007 (based on Algorithm AS 183, Appl. Statist (1982) vol. 31, no. 2), and the ExtendSim 8 (a trademark of Imagine That Inc.) random number generator. The results of the tests for all four generators are generally, "good enough" for most discrete event simulation needs ("most" is relative and subject to debate). The ExtendSim 8 random number generator performs particularly well for more complex systems simulation. I will caveat this with the fact that I did not examine cycles or conduct a runs test. Furthermore, I did not perform more rigorous goodness of fit test, such as the Anderson-Darling or Kolmogorov-Smirnov tests.

Additive Congruence Generator

For n := 5 and k := 99991

Statistics

The mean and variance for 1000 random numbers generated by this formula as very close to the "actual" mean and variance of the Uniform distribution. Again, we would have to perform more rigorous tests to determine if these estimates are "good enough," from the perspective of statistical significance.

Mean = 0.483182

Variance = 0.087944

Correlation

Correlation indicates the degree to which the data are linearly related. A correlation of -1 indicates the there is a negative linear relationship, or that the data points approximately fit a line that slopes downward from left to right. A correlation of 1 indicates that the data approximately fits a line sloping upward from left to right. A correlation near zero indicates that the data is not linearly related. Graphically, that would look like the figure below.


w i

w i+1

w i

1


w i+1

0.083997

1

Frequency

The frequency distribution shows how many data point fall within specified ranges called bin. I have selected bins from 0 to 1 with an interval of 0.1 for each bin. The data consisted of 1000 pseudo-random numbers from a uniform distribution. If the data is uniformly distributed between 0 and 1 the histogram (bar-chart_ would look like a rectangle. The graph below is approximately the shape of a rectangle and is a good indication that the data is uniformly distributed. A more rigorous statistical test (a hypothesis test for a goodness of fit) would be required to determine that the data is not uniformly distributed between 0 and 1.



Linear Congruential Generator
Where a, c, and m determine the statistical quality of the generator

Good parameter choices:
a = 16807 (IBM), or 630360016 (Simscript)
c = 0
m = 231
-1=2147483647
x0 = 123457

Statistics

Mean = 0.490559

Variance = 0.081979

Correlation


w i

w i+1

w i

1


w i+1

0.131844

1

Frequency

Random Function in Excel

N = 1000

Statistics

Mean = 0.501519

Variance = 0.080906


Correlation


w i

w i+1

w i

1


w i+1

-0.01739

1

Frequency

See http://support.microsoft.com/kb/828795

real function random()
c
c Algorithm AS 183 Appl. Statist. (1982) vol.31, no.2, FORTRAN Code c
c Returns a pseudo-random numbers with rectangular distribution.
c between 0 and 1. The cycle length is 6.95E+12 (See page 123
c of Applied Statistics (1984) vol.33), not as claimed in the
c original article.
c
c IX, IY and IZ should be set to integer values between 1 and
c 30000 before the first entry.
c
c Integer arithmetic up to 30323 is required.
c
integer ix, iy, iz
common /randc/ ix, iy, iz
c
ix = 171 * mod(ix, 177) - 2 * (ix / 177)
iy = 172 * mod(iy, 176) - 35 * (iy / 176)
iz = 170 * mod(iz, 178) - 63 * (iz / 178)
c
if (ix .lt. 0) ix = ix + 30269
if (iy .lt. 0) iy = iy + 30307
if (iz .lt. 0) iz = iz + 30323
c
c If integer arithmetic up to 5212632 is available, the preceding
c 6 statements may be replaced by:
c
c ix = mod(171 * ix, 30269)
c iy = mod(172 * iy, 30307)
c iz = mod(170 * iz, 30323)
c
random = mod(float(ix) / 30269. + float(iy) / 30307. + float(iz) / 30323., 1.0)
return
end
c
c

Random Numbers in ExtendSim 8

N = 1000

Statistics

Mean = 0.497612

Variance = 0.08407

Correlation


w i

w i+1

w i

1


w i+1

-0.02746

1

Frequency

ExtendSim 8 Distribution Plot Example (Real Uniform)

References & Further Readings:
Aiello W., S. Rajagopalan, and R. Venkatesan, Design of practical and provably good random number generators, Journal of Algorithms, 29, 358-389, 1998.
Dagpunar J., Principles of Random Variate Generation, Clarendon, 1988.
Fishman G., Monte Carlo, Springer, 1996.
James, Fortran version of L'Ecuyer generator, Comput. Phys. Comm., 60, 329-344, 1990.
Knuth D., The Art of Computer Programming, Vol. 2, Addison-Wesley, 1998.
L'Ecuyer P., Efficient and portable combined random number generators, Comm. ACM, 31, 742-749, 774, 1988.
L'Ecuyer P., Uniform random number generation, Ann. Op. Res., 53, 77-120, 1994.
L'Ecuyer P., Random number generation. In Handbook on Simulation, J. Banks (ed.), Wiley, 1998.
Maurer U., A universal statistical test for random bit generators, J. Cryptology, 5, 89-105, 1992.
Sobol' I., and Y. Levitan, A pseudo-random number generator for personal computers, Computers & Mathematics with Applications, 37(4), 33-40, 1999.
Tsang W-W., A decision tree algorithm for squaring the histogram in random number generation, Ars Combinatoria, 23A, 291-301, 1987