Exactly how “fast” are modern CPUs?

后端未结

关注

 15  1824

When I used to program embedded systems and early 8/16-bit PCs (6502, 68K, 8086) I had a pretty good handle on exacly how long (in nanoseconds or microseconds) each instruct

相关标签:

15条回答

一整个雨季

2020-12-23 23:17

I recommend downloading the AMD software optimization guide.

0 讨论(0)
发布评论:

提交评论
- 加载中...
野的像风

2020-12-23 23:20

It's not that simple. The timing for your two instructions won't help you gauge performance of a larger set of instructions much. That's because modern processors can execute many operations in parallel, and have large caches so "moving a value to memory" happens at a time quite removed from the instruction's execution.

So, best case is zero (when executed in parallel with other instructions). But how does that help you?

This web page shows some benchmarks, including some %MIPS/MHz results. As you can see, on many benchmarks there are multiple instructions executed per clock cycle. The charts also show the effects of cache size and memory speed.

0 讨论(0)
发布评论:

提交评论
- 加载中...
[愿得一人]

2020-12-23 23:20

A lot of good answers on this thread already, but one topic is so far unmentioned: branch misprediction.

Because all modern processors are pipelined, when the instruction decoder runs into an instruction like "jump if equal", it has no idea which way the instruction will jump, and so it just guesses. It then continues feeding instructions into the pipeline based on that guess. If it made the correct prediction, the thruput and latency of the jump instruction is essentially zero. If it makes the wrong guess, the thruput and latency of the same jump instruction could be 50 or a 100 cycles.

Note that the same instruction can have the "zero cost" the first time it's executed in a loop and the really huge cost the next time the same instruction is executed!

0 讨论(0)
发布评论:

提交评论
- 加载中...
小蘑菇

2020-12-23 23:21
Using a description largely based on Intel Pentium architecture, to cut a very very long story short:
- the processor has a number of "execution units" that can perform different types of 'micro-ops'; instructions may be split into several micro-ops
- the different execution units essentially run in parallel
- each micro-op ties up the corresponding execution unit for a certain number of clock cycles so meanwhile no other instruction can use that execution unit: e.g. "floating point add" may tie up the "FP execute" unit for 2 clock cycles
- execution units are grouped by "port", and each clock cycle, a new micro-op can be sent to each port (assuming the relevant execution unit is free at that moment); some units can also be sent an "extra op" halfway through the cycle; so each clock cycle, a certain number of ops can start executing;
- the processor can re-order micro-ops where this doesn't break dependencies (or where the result can still be reconstructed) to take advantage of which execution units are free at a given moment
- so instructions can be executing in parallel, but which parts of which instructions are executing at any one time is quite a complex situation
- the overall time for a given instruction thus depends on how long it had to "wait" for the necessary execution units to become available, the actual time that those ops spent running on the given units, plus any extra time required to "tie up the result"
Since the timing of an instruction depends on the surrounding instructions, in practice, it's usually best to time a representative piece of code than try and worry about individual instructions. However:
- Intel (and presumably other manufacturers) publish a list of instruction throughput and latency timings
- the throughput is the number of clock cycles actually needed on the relevant execution unit(s)
- the latency is a "worst case" number of clock cycles required, once an instruction starts executing, before the result of that execution is available as input to another instruction
So for example, if, say, floating point add and multiply instructions each have a throughput of 2 and a latency of 5 (actually, for multiply it's a bit greater I think), that means that adding a register to itself or multiplying it by itself will likely take two clock cycles (since there are no other dependent values), whereas adding it the result of a previous multiplication will take something like or a bit less than 2+5 clock cycles, depending where you start/finish timing, and on all sorts of other things. (During some of those clock cycles, another add/multiply operation could be taking place, so it's arguable how many cycles you actually attribute to the individual add/mutliply instructions anyway...)

Oh, and just as a concrete example. For following Java code
```
public void runTest(double[] data, double randomVal) {
  for (int i = data.length-1; i >= 0; i--) {
    data[i] = data[i] + randomVal;
  }
}
```
Hotspot 1.6.12 JIT-compiles the inner loop sequence to the following Intel code, consisting of a load-add-store for each position in the array (with 'randomVal' being held in XMM0a in this case):
```
  0b3     MOVSD  XMM1a,[EBP + #16]
  0b8     ADDSD  XMM1a,XMM0a
  0bc     MOVSD  [EBP + #16],XMM1a
  0c1     MOVSD  XMM1a,[EBP + #8]
  0c6     ADDSD  XMM1a,XMM0a
  0ca     MOVSD  [EBP + #8],XMM1a
  ...
```
each group of load-add-store appears to take 5 clock cycles.
0 讨论(0)
发布评论:

提交评论
- 加载中...
天命终不由人

2020-12-23 23:24

I don't think the worst case is bounded on some platforms. When you have multiple cores and processors vying for the same locations or adjacent memory locations you can see all kinds of degradation in performance. Cache lines have to get moved around from processor to processor. I've haven't seen a good worst case number for memory operations on modern platforms.

0 讨论(0)
发布评论:

提交评论
- 加载中...
一生所求

2020-12-23 23:25

All you need is in the appropriate CPU manuals. Both AMD and Intel have PDF's available on their website describing the latencies of every instruction.

Just keep in mind the complexity of modern CPU's. They don't execute one instruction at a time, they can load 3-4 instructions per cycle, and almost all instructions are pipelined so when the next instructions are loaded, the current ones are nowhere near finished. It also reorders instructions to allow for a more efficient scheduling. A modern CPU can easily have 50 instructions in progress at a time.

So you're asking the wrong question. The time taken for a single instruction varies wildly depending on how and when you measure. It depends on how busy the instruction decoder is, on branch predictor, on scheduling and on which other instructions are being scheduled, in addition to the simple issues like caching.

0 讨论(0)
发布评论:

提交评论
- 加载中...