When I used to program embedded systems and early 8/16-bit PCs (6502, 68K, 8086) I had a pretty good handle on exacly how long (in nanoseconds or microseconds) each instruct
I recommend downloading the AMD software optimization guide.
It's not that simple. The timing for your two instructions won't help you gauge performance of a larger set of instructions much. That's because modern processors can execute many operations in parallel, and have large caches so "moving a value to memory" happens at a time quite removed from the instruction's execution.
So, best case is zero (when executed in parallel with other instructions). But how does that help you?
This web page shows some benchmarks, including some %MIPS/MHz results. As you can see, on many benchmarks there are multiple instructions executed per clock cycle. The charts also show the effects of cache size and memory speed.
A lot of good answers on this thread already, but one topic is so far unmentioned: branch misprediction.
Because all modern processors are pipelined, when the instruction decoder runs into an instruction like "jump if equal", it has no idea which way the instruction will jump, and so it just guesses. It then continues feeding instructions into the pipeline based on that guess. If it made the correct prediction, the thruput and latency of the jump instruction is essentially zero. If it makes the wrong guess, the thruput and latency of the same jump instruction could be 50 or a 100 cycles.
Note that the same instruction can have the "zero cost" the first time it's executed in a loop and the really huge cost the next time the same instruction is executed!
Using a description largely based on Intel Pentium architecture, to cut a very very long story short:
Since the timing of an instruction depends on the surrounding instructions, in practice, it's usually best to time a representative piece of code than try and worry about individual instructions. However:
So for example, if, say, floating point add and multiply instructions each have a throughput of 2 and a latency of 5 (actually, for multiply it's a bit greater I think), that means that adding a register to itself or multiplying it by itself will likely take two clock cycles (since there are no other dependent values), whereas adding it the result of a previous multiplication will take something like or a bit less than 2+5 clock cycles, depending where you start/finish timing, and on all sorts of other things. (During some of those clock cycles, another add/multiply operation could be taking place, so it's arguable how many cycles you actually attribute to the individual add/mutliply instructions anyway...)
Oh, and just as a concrete example. For following Java code
public void runTest(double[] data, double randomVal) {
for (int i = data.length-1; i >= 0; i--) {
data[i] = data[i] + randomVal;
}
}
Hotspot 1.6.12 JIT-compiles the inner loop sequence to the following Intel code, consisting of a load-add-store for each position in the array (with 'randomVal' being held in XMM0a in this case):
0b3 MOVSD XMM1a,[EBP + #16]
0b8 ADDSD XMM1a,XMM0a
0bc MOVSD [EBP + #16],XMM1a
0c1 MOVSD XMM1a,[EBP + #8]
0c6 ADDSD XMM1a,XMM0a
0ca MOVSD [EBP + #8],XMM1a
...
each group of load-add-store appears to take 5 clock cycles.
I don't think the worst case is bounded on some platforms. When you have multiple cores and processors vying for the same locations or adjacent memory locations you can see all kinds of degradation in performance. Cache lines have to get moved around from processor to processor. I've haven't seen a good worst case number for memory operations on modern platforms.
All you need is in the appropriate CPU manuals. Both AMD and Intel have PDF's available on their website describing the latencies of every instruction.
Just keep in mind the complexity of modern CPU's. They don't execute one instruction at a time, they can load 3-4 instructions per cycle, and almost all instructions are pipelined so when the next instructions are loaded, the current ones are nowhere near finished. It also reorders instructions to allow for a more efficient scheduling. A modern CPU can easily have 50 instructions in progress at a time.
So you're asking the wrong question. The time taken for a single instruction varies wildly depending on how and when you measure. It depends on how busy the instruction decoder is, on branch predictor, on scheduling and on which other instructions are being scheduled, in addition to the simple issues like caching.