When I used to program embedded systems and early 8/16-bit PCs (6502, 68K, 8086) I had a pretty good handle on exacly how long (in nanoseconds or microseconds) each instruct
It took almost 11 years, but I have an estimate. Your loop is about 10 ops
* 100 million iterations, so approximately 1 billion ops
. On a 2.3 GHz machine, I would estimate on the order of 0.4 seconds. When I tested it, I actually got 1.2 seconds. So it's within one order of magnitude.
Just take your core frequency, estimate the ops
, and divide. This gives a very rough estimate and I've never been more than an order of magnitude off whenever I test empirically. Just make sure your op
estimates are reasonable.
Modern processors do even more tricky things.
Out-of-order execution. If it is possible to do so without affecting correct behavior, the processors may execute instructions in a different order than they are listed in your program. This can hide the latency of long-running instructions.
Register renaming. Processors often have more physical registers than addressable registers in their instruction set (so-called "architectural" registers). This can be either for backward compatibility, or simply to enable efficient instruction encodings. As a program runs, the processor will "rename" the architectural registers it uses to whatever physical registers are free. This allows the processor to realize more parallelism than existed in the original program.
For instance, if you have a long sequence of operations on EAX and ECX, followed by instructions that re-initialize EAX and ECX to new values and perform another long sequence of operations, the processor can use different physical registers for both tasks, and execute them in parallel.
The Intel P6 microarchitecture does both out-of-order execution and register renaming. The Core 2 architecture is the latest derivative of the P6.
To actually answer your question - it is basically impossible for you to determine performance by hand in the face of all these architectural optimizations.
As Doug already noted, the best case is zero (superscalar processor, multiple execution units, data already in L1 cache).
The worst case is up to several miliseconds (when the OS handles a pagefault and has to fetch the data/instruction from the disk). Excluding disk/swapping it still depends on whether you have a NUMA machine, which kind of topology it has, in which memory node the data lies, whether there is concurrent access from another CPU (bus locking and cache synchronization protocols), etc.