Exactly how “fast” are modern CPUs?

后端 未结 15 1823
深忆病人
深忆病人 2020-12-23 22:41

When I used to program embedded systems and early 8/16-bit PCs (6502, 68K, 8086) I had a pretty good handle on exacly how long (in nanoseconds or microseconds) each instruct

相关标签:
15条回答
  • 2020-12-23 23:02

    It is nearly impossible to provide accurate timing information that you are expecting in a way that will be USEFUL to you.

    The following concepts affect instruction timing; some can vary from moment to moment:

    • Micro-op decomposition
    • Operation pipelining
    • Super-scalar execution
    • Out of order execution
    • SMT / SMP execution
    • Floating point mode
    • Branch prediction / pre-fetch
    • Cache latency
    • Memory latency
    • Clock speed throttling
    • etc

    Consult a book on modern computer architecture if you need any further explanation on the above concepts.

    The best way to measure the speed of your code is (surprise!) to measure the speed of your code running the same workload and under the same conditions as you expect it to when "in the real world".

    0 讨论(0)
  • 2020-12-23 23:03

    An interesting quote from Alan Kay in 2004:

    Just as an aside, to give you an interesting benchmark—on roughly the same system, roughly optimized the same way, a benchmark from 1979 at Xerox PARC runs only 50 times faster today. Moore’s law has given us somewhere between 40,000 and 60,000 times improvement in that time. So there’s approximately a factor of 1,000 in efficiency that has been lost by bad CPU architectures.

    The implication seems to be that CPU performance enhancements seems to focus on areas where they have relatively little impact on the software we really write.

    0 讨论(0)
  • 2020-12-23 23:07

    This only answers part of your question, but I found this table from Wikipedia on locality of reference helpful. It describes the speed of access to and amount of memory in different levels of the memory hierarchy, using approximate 2006 times:

    • CPU registers (8-32 registers) – immediate access (0-1 clock cycles)
    • L1 CPU caches (32 KiB to 128 KiB) – fast access (3 clock cycles)
    • L2 CPU caches (128 KiB to 12 MiB) – slightly slower access (10 clock cycles)
    • Main physical memory (RAM) (256 MiB to 4 GiB) – slow access (100 clock cycles)
    • Disk (file system) (1 GiB to 1 TiB) – very slow (10,000,000 clock cycles)
    • Remote Memory (such as other computers or the Internet) (Practically unlimited) – speed varies
    0 讨论(0)
  • Modern processors such as Core 2 Duo that you mention are both superscalar and pipelined. They have multiple execution units per core and are actually working on more than one instruction at a time per core; this is the superscalar part. The pipelined part means that there is a latency from when an instruction is read in and "issued" to when it completes execution and this time varies depending on the dependencies between that instruction and the others moving through the other execution units at the same time. So, in effect, the timing of any given instruction varies depending on what is around it and what it is depending on. This means that a given instruction has sort of a best case and worst case execution time based on a number of factors. Because of the multiple execution units you can actually have more than one instruction completing execution per core clock, but sometimes there is several clocks between completions if the pipeline has to stall waiting for memory or dependencies in the pipelines.

    All of the above is just from the view of the CPU core itself. Then you have interactions with the caches and contention for bandwidth with the other cores. The Bus Interface Unit of the CPU deals with getting instructions and data fed into the core and putting results back out of the core through the caches to memory.

    Rough order of magnitude rules of thumb to be taken with a grain of salt:

    • Register to Register operations take 1 core clock to execute. This should generally be conservative especially as more of these appear in sequence.
    • Memory related load and store operations take 1 memory bus clock to execute. This should be very conservative. With a high cache hit rate it will be more like 2 CPU bus clocks which is the clock rate of the bus between the CPU core and the cache, but not necessarily the core's clock.
    0 讨论(0)
  • 2020-12-23 23:11

    You can download the Intel 64 and IA-32 manuals here.

    But what you really need is the stuff from Agner Fog.

    He has a lot of additional infos, for example his manual "Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel and AMD CPUs".

    Or test programs for counting clock cycles (he uses the time stamp counter).

    0 讨论(0)
  • 2020-12-23 23:14

    The kind of prediction you're asking for is hopeless.

    If you want a rule of thumb, here are some rules of thumb:

    • In the time it takes to get a word from level 2 cache, a processor can execute at least 10 instructions. So worry about memory access, not instruction counts---computation in registers is almost free.

    • In the time it takes to get a word from RAM, a processor can execute thousands of instructions (this number varies by a couple of order of magnitude depending on the details of your hardware). Make sure this happens only on a cold cache; otherwise nothing else matters.

    • If you're running on x86 CPUs, there aren't enough registers. Try not to have more than 5 live variables in your code at any moment. Or better yet, move to AMD64 (x86_64) and double the number of registers. With 16 registers, and parameters passed in registers, you can quit worrying about registers.

    There was a time when every year I would ask an architect what rules of thumb I should use to predict the cost of the code my compilers generate. I've stopped, because the last time I received a useful answer was in 1999. (The answer was "make sure your loops fit in the reorder buffer". All those who know what a reorder buffer is may now raise your hands. Bonus points if you can discover the size of the reorder buffer on any computer you are currently using.)

    0 讨论(0)
提交回复
热议问题