When is assembly faster than C?

前端 未结 30 2377
挽巷
挽巷 2020-12-02 03:18

One of the stated reasons for knowing assembler is that, on occasion, it can be employed to write code that will be more performant than writing that code in a higher-level

相关标签:
30条回答
  • 2020-12-02 03:49

    It all depends on your workload.

    For day-to-day operations, C and C++ are just fine, but there are certain workloads (any transforms involving video (compression, decompression, image effects, etc)) that pretty much require assembly to be performant.

    They also usually involve using CPU specific chipset extensions (MME/MMX/SSE/whatever) that are tuned for those kinds of operation.

    0 讨论(0)
  • 2020-12-02 03:49

    Given the right programmer, Assembler programs can always be made faster than their C counterparts (at least marginally). It would be difficult to create a C program where you couldn't take out at least one instruction of the Assembler.

    0 讨论(0)
  • 2020-12-02 03:50

    Although C is "close" to the low-level manipulation of 8-bit, 16-bit, 32-bit, 64-bit data, there are a few mathematical operations not supported by C which can often be performed elegantly in certain assembly instruction sets:

    1. Fixed-point multiplication: The product of two 16-bit numbers is a 32-bit number. But the rules in C says that the product of two 16-bit numbers is a 16-bit number, and the product of two 32-bit numbers is a 32-bit number -- the bottom half in both cases. If you want the top half of a 16x16 multiply or a 32x32 multiply, you have to play games with the compiler. The general method is to cast to a larger-than-necessary bit width, multiply, shift down, and cast back:

      int16_t x, y;
      // int16_t is a typedef for "short"
      // set x and y to something
      int16_t prod = (int16_t)(((int32_t)x*y)>>16);`
      

      In this case the compiler may be smart enough to know that you're really just trying to get the top half of a 16x16 multiply and do the right thing with the machine's native 16x16multiply. Or it may be stupid and require a library call to do the 32x32 multiply that's way overkill because you only need 16 bits of the product -- but the C standard doesn't give you any way to express yourself.

    2. Certain bitshifting operations (rotation/carries):

      // 256-bit array shifted right in its entirety:
      uint8_t x[32];
      for (int i = 32; --i > 0; )
      {
         x[i] = (x[i] >> 1) | (x[i-1] << 7);
      }
      x[0] >>= 1;
      

      This is not too inelegant in C, but again, unless the compiler is smart enough to realize what you are doing, it's going to do a lot of "unnecessary" work. Many assembly instruction sets allow you to rotate or shift left/right with the result in the carry register, so you could accomplish the above in 34 instructions: load a pointer to the beginning of the array, clear the carry, and perform 32 8-bit right-shifts, using auto-increment on the pointer.

      For another example, there are linear feedback shift registers (LFSR) that are elegantly performed in assembly: Take a chunk of N bits (8, 16, 32, 64, 128, etc), shift the whole thing right by 1 (see above algorithm), then if the resulting carry is 1 then you XOR in a bit pattern that represents the polynomial.

    Having said that, I wouldn't resort to these techniques unless I had serious performance constraints. As others have said, assembly is much harder to document/debug/test/maintain than C code: the performance gain comes with some serious costs.

    edit: 3. Overflow detection is possible in assembly (can't really do it in C), this makes some algorithms much easier.

    0 讨论(0)
  • 2020-12-02 03:51

    Here is a real world example: Fixed point multiplies on old compilers.

    These don't only come handy on devices without floating point, they shine when it comes to precision as they give you 32 bits of precision with a predictable error (float only has 23 bit and it's harder to predict precision loss). i.e. uniform absolute precision over the entire range, instead of close-to-uniform relative precision (float).


    Modern compilers optimize this fixed-point example nicely, so for more modern examples that still need compiler-specific code, see

    • Getting the high part of 64 bit integer multiplication: A portable version using uint64_t for 32x32 => 64-bit multiplies fails to optimize on a 64-bit CPU, so you need intrinsics or __int128 for efficient code on 64-bit systems.
    • _umul128 on Windows 32 bits: MSVC doesn't always do a good job when multiplying 32-bit integers cast to 64, so intrinsics helped a lot.

    C doesn't have a full-multiplication operator (2N-bit result from N-bit inputs). The usual way to express it in C is to cast the inputs to the wider type and hope the compiler recognizes that the upper bits of the inputs aren't interesting:

    // on a 32-bit machine, int can hold 32-bit fixed-point integers.
    int inline FixedPointMul (int a, int b)
    {
      long long a_long = a; // cast to 64 bit.
    
      long long product = a_long * b; // perform multiplication
    
      return (int) (product >> 16);  // shift by the fixed point bias
    }
    

    The problem with this code is that we do something that can't be directly expressed in the C-language. We want to multiply two 32 bit numbers and get a 64 bit result of which we return the middle 32 bit. However, in C this multiply does not exist. All you can do is to promote the integers to 64 bit and do a 64*64 = 64 multiply.

    x86 (and ARM, MIPS and others) can however do the multiply in a single instruction. Some compilers used to ignore this fact and generate code that calls a runtime library function to do the multiply. The shift by 16 is also often done by a library routine (also the x86 can do such shifts).

    So we're left with one or two library calls just for a multiply. This has serious consequences. Not only is the shift slower, registers must be preserved across the function calls and it does not help inlining and code-unrolling either.

    If you rewrite the same code in (inline) assembler you can gain a significant speed boost.

    In addition to this: using ASM is not the best way to solve the problem. Most compilers allow you to use some assembler instructions in intrinsic form if you can't express them in C. The VS.NET2008 compiler for example exposes the 32*32=64 bit mul as __emul and the 64 bit shift as __ll_rshift.

    Using intrinsics you can rewrite the function in a way that the C-compiler has a chance to understand what's going on. This allows the code to be inlined, register allocated, common subexpression elimination and constant propagation can be done as well. You'll get a huge performance improvement over the hand-written assembler code that way.

    For reference: The end-result for the fixed-point mul for the VS.NET compiler is:

    int inline FixedPointMul (int a, int b)
    {
        return (int) __ll_rshift(__emul(a,b),16);
    }
    

    The performance difference of fixed point divides is even bigger. I had improvements up to factor 10 for division heavy fixed point code by writing a couple of asm-lines.


    Using Visual C++ 2013 gives the same assembly code for both ways.

    gcc4.1 from 2007 also optimizes the pure C version nicely. (The Godbolt compiler explorer doesn't have any earlier versions of gcc installed, but presumably even older GCC versions could do this without intrinsics.)

    See source + asm for x86 (32-bit) and ARM on the Godbolt compiler explorer. (Unfortunately it doesn't have any compilers old enough to produce bad code from the simple pure C version.)


    Modern CPUs can do things C doesn't have operators for at all, like popcnt or bit-scan to find the first or last set bit. (POSIX has a ffs() function, but its semantics don't match x86 bsf / bsr. See https://en.wikipedia.org/wiki/Find_first_set).

    Some compilers can sometimes recognize a loop that counts the number of set bits in an integer and compile it to a popcnt instruction (if enabled at compile time), but it's much more reliable to use __builtin_popcnt in GNU C, or on x86 if you're only targeting hardware with SSE4.2: _mm_popcnt_u32 from <immintrin.h>.

    Or in C++, assign to a std::bitset<32> and use .count(). (This is a case where the language has found a way to portably expose an optimized implementation of popcount through the standard library, in a way that will always compile to something correct, and can take advantage of whatever the target supports.) See also https://en.wikipedia.org/wiki/Hamming_weight#Language_support.

    Similarly, ntohl can compile to bswap (x86 32-bit byte swap for endian conversion) on some C implementations that have it.


    Another major area for intrinsics or hand-written asm is manual vectorization with SIMD instructions. Compilers are not bad with simple loops like dst[i] += src[i] * 10.0;, but often do badly or don't auto-vectorize at all when things get more complicated. For example, you're unlikely to get anything like How to implement atoi using SIMD? generated automatically by the compiler from scalar code.

    0 讨论(0)
  • 2020-12-02 03:53

    Only when using some special purpose instruction sets the compiler doesn't support.

    To maximize the computing power of a modern CPU with multiple pipelines and predictive branching you need to structure the assembly program in a way that makes it a) almost impossible for a human to write b) even more impossible to maintain.

    Also, better algorithms, data structures and memory management will give you at least an order of magnitude more performance than the micro-optimizations you can do in assembly.

    0 讨论(0)
  • 2020-12-02 03:53

    I think the general case when assembler is faster is when a smart assembly programmer looks at the compiler's output and says "this is a critical path for performance and I can write this to be more efficient" and then that person tweaks that assembler or rewrites it from scratch.

    0 讨论(0)
提交回复
热议问题