IE 642 Simulation of Manufacturing Systems

Up ] Constant supply of input ] Deadlock ] Empirical  Distribution ] Gamma distribution ] Generators ] [ Kolmogorov-Smirnov test ] M/M/n ] Naming Conventions ] Sequencing rules ] Warmup ] Writing data to a file ]

The Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov test is designed to test the hypothesis that a given data set could have been drawn from a given distribution. Unlike the chi-square test, it is primarily intended for use with continuous distributions and is independent of arbitrary computational choices such as bin width.

Suppose that we had collected four data points and sorted them into increasing order to get the data set {1.2, 3.1, 5.1, 6.7}. From the pattern of our data alone, we might guess that, if we continued to collect data from this process, 0-25% of our observations would be less than or equal to 1.2, 25-50% would be less than or equal to 3.1, etc.

Perhaps we would like to compare this empirical pattern to the pattern we would expect to observe if the data points were drawn from a given theoretical distribution; say, an exponential distribution with a mean of 5 (i.e., l = 1/5 = 0.2). If data points were drawn from this exponential distribution, what fraction would we expect to see below 1.2? Below 3.1? These figures can be computed from the cumulative distribution function for the exponential distribution:

F(1.2) = 1 - e-(0.2)(1.2) = 0.21

 

F(3.1) = 1 - e-(0.2)(3.1) = 0.46

 

F(5.1) = 1 - e-(0.2)(5.1) = 0.64

 

F(6.7) = 1 - e-(0.2)(6.7) = 0.74

 

We compare this to our empirical pattern in Figure 1. The first three rows contain the data points and our highest and lowest estimates of the fraction of the data that would fall below each point. The fourth row contains the result of plugging the data points into the theoretical distribution under consideration (in this case, the exponential distribution with a mean of 5). These values are the theoretical estimate of what fraction should fall below each data point. The fifth row is obtained by comparing the fourth row to the second and third rows. Is 0.21 near 0? Near 0.25? We take the absolute value of the larger of the two deviations. For example, in the first column, we get

|0 = 0.21| = 0.21

|0.25 - 0.21| = 0.04

so, the larger deviation is 0.21. This gives an idea of how far our empirical pattern is from our theoretical pattern.

FIGURE 1: Computing D for the Kolmogorov-Smirnov test.

Row 1: Data point x1 1.2 3.1 5.1 6.7
Row 2: Empirical fraction falling below data point (low estimate) 0 0.25 0.50 0.75
Row 3: Empirical fraction falling below data point (high estimate) 0.25 0.50 0.75 1.0
Row 4: F(x1) 0.21 0.46 0.64 0.74
Row 5: Largest deviation 0.21 0.21 0.14 0.26
Row 6: Overall largest deviation (D)       0.26

 

Next, we look over the fifth row to find the largest overall deviation (D). The largest error, 0.26, is the value of our test statistic. Is this measure of "error" large or small for this situation? To make this judgment, we compare our computed value of this test statistic to a critical value from the table in Appendix A. Setting a=0.1 and noting that our sample size is n=4, we get a critical value of D4,1.0 = 0.565. Since our test statistic, D=0.26, is less than 0.565, we do not reject the hypothesis that our data set was drawn from the exponential distribution with a mean of 5.

In general, we use the Kolmogorov-Smirnov test to compare a data set to a given theoretical distribution by filling in a table as follows:

• Row 1: Data set sorted into increasing order and denoted as xi, where i=1,...,n.

• Row 2: Smallest empirical estimate of fraction of points falling below xi, and computed as (i-1)/n for i=1,...,n (e.g. if n=4, this row contains 0, 0.25, 0.50, and 0.75).

• Row 3: Largest empirical estimate of fraction of points falling below xi and computed as i/n for i=1,...,n (e.g., if n=4, this row contains 0.25, 0.50, 0.75, and 1.0).

• Row 4: Theoretical estimate of fraction of points falling below xi and computed as F(xi), where F(x) is the theoretical distribution function being tested.

• Row 5: Absolute value of difference between row 2 and row 4 or between row 3 and row 4, whichever is larger. This is a measure of "error" for this data point.

• Row 6: The largest "error" from row 5, which gives the test statistic D.

Once this table has been completed, the test statistic D can be compared to the critical value from a statistical table. If the test statistic is larger than the critical value, then we reject the hypothesis that the data set was drawn from the given theoretical distribution; otherwise we do not reject the hypothesis.*

In the preceding example, the parameter of the theoretical distribution (i.e., l = 1/5) was not estimated from the data set. In most cases, the critical values in the Kolmogorov-Smirnov table are only valid for testing distributions whose parameters have not been estimated from the data. Had we, for example, used a maximum-likelihood estimate formula to compute l before testing the fit of the distribution, we could not have used the Kolmogorov-Smirnov test. Modified versions of the Kolmogorov-Smirnov test have been developed for testing the fit of a few theoretical distributions in the case where parameter values have been estimated from the data. While we will not cover these modified tests, the ideas behind them are similar to those we have discussed, and many popular statistical software packages perform them.

Back ] Home ] Up ] Next ]