what is a baseline and what is a benchmark? what is the best definition for these and how do you baseline a set of numbers and benchmark another set?
Correct me if I'm wrong, but I believe "baseline" refers to a known good state, while "benchmark" refers to the current state. You would do a benchmark and compare it to the baseline.
Interesting definitions from SPR (Software Productivity Research)
Baseline and benchmark are similar but distinct activities.
Figuratively, a baseline is a "line in the sand" for an organization whereby it measures important performance characteristics for future reference.
This is not necessarily a "good" state", just a reference.
A benchmark is best understood by way of the original derivation of the word itself:
Tradesmen engaged in repetitive tasks, such as sawing lumber to consistent lengths, often placed notches on their workbenches to indicate placement of boards prior to cutting. Literally, a benchmark became a standard for comparison and an indicator of past success.
Basically:
HI Gagneet, I'm on the Windows performance team: here is how we use these terms.
A baseline is a measurement of a known configuration that is used as a reference for subsequent measurements. For base line, we characterize the thing being measured: lets take cold boot time for example. Here we have a set of machines that are well characterized - this means we know how they work, that we have good drivers for them, and that the hardware isn't broken or flawed.
On this hardware, we have several "base line" measurements such as XP-RTM, XP-SP2, Vista-RTM, Vista-SP1, Vista-SP2, etc. etc.
For each of these base lines, we have a set of well characterized and understood measurements including all the phases of boot, the amount of CPU, disk and memory utilization, the number of DLL loads, etc. etc.
After a baseline is established, we can then take other measurements and compare them to the base line. For example, we are currently working on Window-7. For each build (daily) we run a set of boot time tests. We compare all the characteristics of each Win-7 build to the base line measurements. This includes all the previous Win-7 builds. This lets us see where the differences lie and helps us drill into the problem areas. Here are some more details.
In scientific research, a benchmark is a kind of test and a baseline is a kind of result.
Let's look at an example of a benchmark test: we might take a collection of 5,000 sentences in English and use the lab's four-core Dell machine to translate them into Spanish using various algorithms. Because we've kept the data and the machine constant, we can meaningfully compare the time taken by the different algorithms to complete the task, as well as their relative accuracy (measured against gold-standard human translations).
To find a baseline for this benchmark test, we might write a very naive translation algorithm that just finds the commonest translation for each individual word, with no regard for the context. Measuring the accuracy of this algorithm against our human translations gives us an idea of the minimum score - the baseline - that the others must beat, and gives us a feel for what level of accuracy counts as "good".
At the other end of the scale from a baseline, an upper bound is a useful yardstick too. In the translation example, we might find the upper bound by measuring the accuracy of one of our human translations with respect to the others. This gives us an idea of how high it's possible to get on our "accuracy" measure before you hit the ceiling of human disagreement. We expect our machine translation algorithms to perform at a level between the baseline and the upper bound.