Performance Validation: How Much is an Order of Magnitude Worth?

Reviewing the question...

Suppose that one implementation of a program takes ten seconds to complete, while another implementation takes only one second to complete. What can we conclude about the relative performance of the two programs?

If that is all the information we have, then in theory, we cannot conclude very much at all. Even in practice, there are cases where we cannot conclude anything. For example, suppose that each program used two threads that communicated intensively. In this case, we could easily see an order of magnitude difference between running both threads on the same core and running the threads each in two separate sockets. To reiterate, we could see this order-of-magnitude difference even when both implementations used identical code.

Figuring out whether or not this sort of difference represents a real difference in performance is the domain of the field of statistics. However, the old saying lies, damn lies, and statistics should provide you with a healthy level of skepticism. Furthermore, what follows is in no way any kind of substitute for taking a good statistics class.

A key point is that statistics does not normally deal with single measurements, which is why statistics cannot do much with our earlier measurement of 10 seconds on the one hand and one second on the other. One the other hand, if we had run the first program hundreds of times, and its runtime had always been between 9.5 and 10.5 seconds, then the probability that the one-second runtime is due to random chance is vanishingly small.

To see just how small, suppose that the program was much less consistent, so that its runtime had been ten times less well-behaved. In this case, all runtimes will lie between 5 and 15 seconds, perhaps as follows:

RunDuration
(seconds)
110.0
29.3
311.6
48.8
510.6
69.1
711.4
88.2
912.7
1015.0
115.9
1214.2
136.0
145.2
1513.0
164.9
1713.0
1810.5

This rather ugly data set has an arithmetic mean of 10 seconds and a standard deviation of 3 seconds. The difference between the 10-second average and the 1-second measurement is nine seconds, or three times the standard deviation. The probability of measuring a result this far from the mean is well under one percent, giving good reason to believe that the measurement was not a fluke, and that there is therefore a real difference between the two programs.

The smaller the standard deviation, the more easily you can distinguish between real differences in behavior. By the same token, the larger the difference in behavior between the two implementations, the less work required to characterize each of the implementations.

All that aside, if your experiment produces a data set with this much dispersion, your first thought should be that your experiment is in need of a redesign. The excessive levels of dispersion might be caused by any number of errors, including:

  1. Bugs in the code being measured.
  2. Bugs in the code reducing the data from the measurements.
  3. Failure to account for cache-locality or process-migration effects.
  4. Failure to account for thread-placement effects.
  5. Cold-start issues, for example, where the first run reads data from disk and subsequent runs pull it from in-memory caches.
  6. Interference from other work on the system (e.g., system utilities and daemons).
  7. For experiments involving networking, interference from other loads on the network.
  8. For experiments involving filesystems, differing file data and metadata layouts on the underlying storage media.
  9. Write amplification on flash-memory devices.

And these are but a few of the effects that have the potential to confound your experiment. Although statistics can be a very powerful tool, it is not and never will be a substitute for common sense, nor for a deep understanding of the underlying layers of software and hardware.

Nevertheless, a good understanding of basic statistics can be very helpful to you when analyzing experimental data.