Performance Validation: Distinguishing Small Samples

Again with the question...

Suppose that we have two small samples of performance measurements, perhaps as follows:

Measurement #Sample 1Sample 2
184.590.9
291.088.2
387.689.7
489.589.9
572.490.4

Each sample's five measurements are in transactions per second, and are taken using two slightly different software implementations on identical hardware systems. What can we conclude from this sample? Did the difference in software implementation result in a significant change in performance?

One approach would be to enlist normal statistics to test to see if there is a significant difference between the two samples. This would ordinarily be a job for Student's t test or Welch's t test, but our sample sizes are too small for these tests to be very trustworthy.

Here are some things we might do in this situation:

  1. Gather more data, so that we have at least thirty data points in each sample. This is by far the best approach in almost every situation. Unfortunately, the above data was collected from large and expensive database servers, and each run consumed considerable time, so collecting this much data was not an option in this case. Nevertheless, this is normally your first and best option.
  2. Wring what information you can out of the data sets using the usual statistical methods. This is covered in a later section.
  3. Analyze quantiles, which, with this small of a sample, degenerates into analysis of medians. This is also covered in a later section.

The Usual Statistical Methods

To carry out the usual statistical methods, let's augment the above table with analysis based on means and standard deviations.

Measurement #Sample 1vs. Sample 2 Sample 2vs. Sample 1
184.5 -5.22 90.9 0.79
291.0 1.16 88.2 0.43
387.6 -2.18 89.7 0.63
489.5 -0.31 89.9 0.66
572.4-17.11 90.4 0.72
Mean85.0 89.8
Standard Deviation7.45 1.02

Here we have added the means and standard deviations on the last two rows, and added a pair of columns comparing each measurement to the other sample's mean and standard deviation. We can be quite confident that measurement #1 and #5 from Sample 1 are from a different population than Sample 2, especially measurement #5, which lies more than seventeen of Sample 2's standard deviations from Sample 2's mean.

The analysis in the other direction is not so straightforward, given that all of Sample 2's measurements are quite close to Sample 1's mean. However, all of the measurements are even closer to each other, with the smallest being 0.43 of Sample 1's standard deviations away from Sample 1's mean, and the largest being 0.79 of Sample 1's standard deviations away from Sample 1's mean. Perhaps paradoxically, this situation enables us to make headway. It turns out that the probability of measurements appearing within a given range can be expressed in terms of the so-called error function, which is closely related to the cumulative distribution function of the normal distribution.

The definition of the error function is a bit involved, but if you install maxima, you can simply type:

bfloat(0.5*(1+erf(0.79/sqrt(2)))-0.5*(1+erf(0.43/sqrt(2))));

which yields approximately 0.119, or about 11.9%. However, Sample 2 had five measurements in this range, so we raise 0.119 to the fifth power to obtain 2.37e-5, or about 0.00237%.

These two results give high confidence that the two samples really are different. That said, we are not allowing for the effect of measurement errors on the sample means and standard deviations. With the very small sample sizes we are using, each additional measurement has substantial effect on the means and standard deviations, which casts doubt on our results, and is exactly why you should use larger sample sizes when the results overlap. This is but one reason why it is usually better to collect enough data to allow the mainstream statistical tests to be used, but, again, it was too expensive to collect that much data in this case. It therefore makes sense to look at this data set from other viewpoints, as is done in the next section.

Analysis of Medians

The trick here is to sort the two samples together, retaining record of which sample each measurement comes from:

SampleMeasurement
172.4
184.5
187.6
288.2
189.5
289.7
289.9
290.4
290.9
191.0

The important thing to note is that all of Sample 2's measurements fall above Sample 1's median. The probability of this happening is the fifth power of one half, or about 3.1%, which gives us better than 95% confidence that the difference between the two samples is not simply random chance.

Unfortunately, Sample 1's median is itself a random variable, and might well change if we added measurements to Sample 1. To see this, consider what happens to Sample 1's median if we remove any of the measurements. If we remove either of the two smallest measurements, the median becomes 88.6, which exceeds Sample 2's smallest sample. If we remove the middle measurement, the median becomes 87.0. If we remove either of the two largest measurements, the median becomes 86.1.

The traditional way to handle multiple measurements is to average them, which gives (2*88.6+87.0+2*86.1)/5=87.3, which is again less that the smallest measurement from Sample 2, again giving us better than 95% confidence that the difference between the two samples is not simply random chance.

There are a number of variations on this theme which come under the heading of non-parametric statistics.

What Can We Conclude?

A number of non-standard analyses tell us that Samples 1 and 2 differ, which should give us some confidence.

The next step, as noted earlier, is to apply common sense and knowledge of the underlying hardware and software. That added knowledge is that the Sample 1 software was suffering from lock contention, which often results in erratic performance. The Sample 2 software greatly reduced lock contention, so that the smaller variances are expected behavior.

Given the above analyses and this information about the software, we should be quite comfortable concluding that there is a real difference between the two samples.