Why is uint_least16_t faster than uint_fast16_t for multiplication in x86_64?

前端 未结 5 1663
醉梦人生
醉梦人生 2021-01-11 14:28

The C standard is quite unclear about the uint_fast*_t family of types. On a gcc-4.4.4 linux x86_64 system, the types uint_fast16_t and uint_

5条回答
  •  借酒劲吻你
    2021-01-11 14:51

    Actual performance at runtime is a very very complicated topic. With many factors ranging from Ram memory, hard-disks, OS'es; And the many processor specific quirks. But this will give you a rough run down for you:

    N_fastX_t

    • Optimal size to calculate most (addition and subtraction) operations efficiently for the processor. This is hardware specific, in which 32bit variables is native and faster (and hence used) over 16bit variables. (for example)
    • As it does not benefit as well as N_leastX in terms of cacheline hits, this should be used mainly when only a single variable is needed as fast as possible. While not being in a large array (how large is it optimally to switch between both is sadly platform dependent)
    • Note that fast vs least has several quirks case, mainly multiplication and divisions. That is platform specific. However if most operations are additions / subtrations / or / and. It is generally safe to assume fast is faster. (once again note the CPU cache and other quirk)

    N_leastX_t

    • The smallest variable the hardware will allow, that is at least of the X size. (some platform is unable to assign variables below 4 bytes for example. In fact most of your BOOL variables take up at least a byte, and not a bit)
    • May be calculated via CPU costly software emulation if hardware support does not exists.
    • May have performance penalty due to partial hardware support (as compared to fast) in a single operations basis.
    • HOWEVER : as it takes less variable space, it may hit the cache lines alot more frequently. This is much more prominent in arrays. And as such will be faster (memory cost > CPU cost) See http://en.wikipedia.org/wiki/CPU_cache for more details.

    The Multiplication problem?

    Also to answer why the larger fastX variable would be slower in multiplication. Is cause due to the very nature of multiplication. (being similar to what you were thought in school)

    http://en.wikipedia.org/wiki/Binary_multiplier

    //Assuming 4bit int
       0011 (3 in decimal)
     x 0101 (5 in decimal)
     ======
       0011 ("0011 x 0001")
      0000- ("0011 x 0000")
     0011-- ("0011 x 0001")
    0000--- ("0011 x 0000")
    =======
       1111 (15 in decimal)
    

    However it is important to know that a computer is a "logical idiot". While its obvious to us humans to skip the trailing zeros step. The computer will still work it out (its cheaper then checking conditionally then working it out anyway). Hence this creates a quirk for a larger size variable of the same value

       //Assuming 8bit int
          0000 0011 (3 in decimal)
        x 0000 0101 (5 in decimal)
        ===========
          0000 0011 ("0011 x 0001")
        0 0000 000- ("0011 x 0000")
       00 0000 11-- ("0011 x 0001")
      000 0000 0--- ("0011 x 0000")
     0000 0000 ---- (And the remainders of zeros)
     -------------- (Will all be worked out)
     ==============
          0000 1111 (15 in decimal)
    

    While i did not spam out the remainder 0x0 additions in the multiplication process. It is important to note that the computer will "get them done". And hence it is natural that a larger variable multiplication will take longer time then its smaller counterpart. (Hence its always good to avoid multiplication and divisions whenever possible).

    However here comes the 2nd quirk. It may not apply to all processors. It is important to note that all CPU operations are counted in CPU cycles. In which in each cycle dozens (or more) of such small additions operations is performed as seen above. As a result, a 8bit addition may take the same amount of time as an 8bit multiplication, and etc. Due to the various optimizations and CPU specific quirks.

    If it concerns you that much. Go refer to Intel : http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html


    Additional mention about CPU vs RAM

    As CPU have advance to moore's law to be several times faster then your DDR3 RAM.

    This can result to situations where more time is spent looking up the variable from the ram then to CPU "compute" it. This is most prominent in long pointer chains.

    So while a CPU cache exists on most processor to reduce "RAM look-up" time. Its uses is limited to specific cases (where cache line benefits the most). And for cases when it does not fit. Note that the RAM look-up time > CPU processing time (excluding multiplication/divisions/some quirks)

提交回复
热议问题