Why does an inline function have lower efficiency than an in-built function?

后端 未结 3 407
挽巷
挽巷 2021-01-03 20:59

I was trying a question on arrays in InterviewBit. In this question I made an inline function returning the absolute value of an integer. But I was told that my algorithm wa

相关标签:
3条回答
  • 2021-01-03 21:19

    Your solution might arguably be "cleaner" by the textbook if you used the standard library version, but I think the evaluation is wrong. There is no truly good, justifiable reason for your code being rejected.

    This is one of those cases where someone is formally correct (by the textbook), but insists on knowing the only correct solution in a sheer stupid way rather than accepting an alternate solution and saying "...but this here would be best practice, you know".

    Technically, it's a correct, practical approach to say "use the standard library, that's what it is for, and it's likely optimized as much as can be". Even though the "optimized as much as can be" part can, in some situations, very well be wrong due to some constraints that the standard puts onto certain alogorithms and/or containers.

    Now, opinions, best practice, and religion aside. Factually, if you compare the two approaches...

    int main(int argc, char**)
    {
      40f360:       53                      push   %rbx
      40f361:       48 83 ec 20             sub    $0x20,%rsp
      40f365:       89 cb                   mov    %ecx,%ebx
      40f367:       e8 a4 be ff ff          callq  40b210 <__main>
    return std::abs(argc);
      40f36c:       89 da                   mov    %ebx,%edx
      40f36e:       89 d8                   mov    %ebx,%eax
      40f370:       c1 fa 1f                sar    $0x1f,%edx
      40f373:       31 d0                   xor    %edx,%eax
      40f375:       29 d0                   sub    %edx,%eax
    //}
    
    int main(int argc, char**)
    {
      40f360:       53                      push   %rbx
      40f361:       48 83 ec 20             sub    $0x20,%rsp
      40f365:       89 cb                   mov    %ecx,%ebx
      40f367:       e8 a4 be ff ff          callq  40b210 <__main>
    return (argc > 0) ? argc : -argc;
      40f36c:       89 da                   mov    %ebx,%edx
      40f36e:       89 d8                   mov    %ebx,%eax
      40f370:       c1 fa 1f                sar    $0x1f,%edx
      40f373:       31 d0                   xor    %edx,%eax
      40f375:       29 d0                   sub    %edx,%eax
    //}
    

    ... they result in exactly the same, identical instructions.

    But even if the compiler did use a compare followed by a conditional move (which it may do in more complicated "branching assignments" and which it will do for example in the case of min/max), that's maybe one CPU cycle or so slower than the bit hacks, so unless you do this several million times, the statement "not efficient" is kinda doubtful anyway.
    One cache miss, and you have a hundred times the penalty of a conditional move.

    There are valid arguments for and against either approach, which I won't discuss in length. My point is, turning down the OP's solution as "totally wrong" because of such a petty, unimportant detail is rather narrow-minded.

    EDIT:

    (Fun trivia)

    I just tried, for fun and no profit, on my Linux Mint box which uses a somewhat older version of GCC (5.4 as compared to 7.1 above).

    Due to me including <cmath> without much of a thought (hey, a function like abs very clearly belongs to math, doesn't it!) rather than <cstdlib> which hosts the integer overload, the result was, well... surprising. Calling the library function was much inferior to the single-expression wrapper.

    Now, in defense of the standard library, if you include <cstdlib>, then, again, the produced output is exactly identical in either case.

    For reference, the test code looked like:

    #ifdef DRY
      #include <cmath>
      int main(int argc, char**)
      {
         return std::abs(argc);
      }
    #else
      int abs(int v) noexcept { return (v >= 0) ? v : -v; }
      int main(int argc, char**)
      {
         return abs(argc);
      }
    #endif
    

    ...resulting in

    4004f0: 89 fa                   mov    %edi,%edx
    4004f2: 89 f8                   mov    %edi,%eax
    4004f4: c1 fa 1f                sar    $0x1f,%edx
    4004f7: 31 d0                   xor    %edx,%eax
    4004f9: 29 d0                   sub    %edx,%eax
    4004fb: c3                      retq 
    

    Now, It is apparently quite easy to fall into the trap of unwittingly using the wrong standard library function (I demonstrated how easy it is myself!). And all that without the slightest warning from the compiler, such as "hey, you know, you're using a double overload on an integer value (well, obviously there's no warning, it's a valid conversion).

    With that in mind, there may be yet another "justification" why the OP providing his own one-liner wasn't all that terribly bad and wrong. After all, he could have made that same mistake.

    0 讨论(0)
  • 2021-01-03 21:21

    Your abs performs branching based on a condition. While the built-in variant just removes the sign bit from the integer, most likely using just a couple of instructions. Possible assembly example (taken from here):

    cdq
    xor eax, edx
    sub eax, edx
    

    The cdq copies the sign of the register eax to register edx. For example, if it is a positive number, edx will be zero, otherwise, edx will be 0xFFFFFF which denotes -1. The xor operation with the origin number will change nothing if it is a positive number (any number xor 0 will not change). However, when eax is negative, eax xor 0xFFFFFF yields (not eax). The final step is to subtract edx from eax. Again, if eax is positive, edx is zero, and the final value is still the same. For negative values, (~ eax) – (-1) = –eax which is the value wanted.

    As you can see this approach uses only three simple arithmetic instructions and no conditional branching at all.

    Edit: After some research it turned out that many built-in implementations of abs use the same approach, return __x >= 0 ? __x : -__x;, and such a pattern is an obvious target for compiler optimization to avoid unnecessary branching.

    However, that does not justify the use of custom abs implementation as it violates the DRY principle and no one can guarantee that your implementation is going to be just as good for more sophisticated scenarios and/or unusual platforms. Typically one should think about rewriting some of the library functions only when there is a definite performance problem or some other defect detected in existing implementation.

    Edit2: Just switching from int to float shows considerable performance degradation:

    float libfoo(float x)
    {
        return ::std::fabs(x);
    }
    
    andps   xmm0, xmmword ptr [rip + .LCPI0_0]
    

    And a custom version:

    inline float my_fabs(float x)
    {
        return x>0.0f?x:-x;
    }
    
    float myfoo(float x)
    {
        return my_fabs(x);
    }
    
    movaps  xmm1, xmmword ptr [rip + .LCPI1_0] # xmm1 = [-0.000000e+00,-0.000000e+00,-0.000000e+00,-0.000000e+00]
    xorps   xmm1, xmm0
    xorps   xmm2, xmm2
    cmpltss xmm2, xmm0
    andps   xmm0, xmm2
    andnps  xmm2, xmm1
    orps    xmm0, xmm2
    

    online compiler

    0 讨论(0)
  • 2021-01-03 21:23

    I don't agree with their verdict. They are clearly wrong.

    On current, optimizing compilers, both solutions produce the exact same output. And even, if they didn't produce the exact same, they would produce as efficient code as the library one (it could be a little surprising that everything matches: the algorithm, the registers used. Maybe because the actual library implementation is the same as OP's one?).

    No sane optimizing compiler will create branch in your abs() code (if it can be done without a branch), as other answer suggests. If the compiler is not optimizing, then it may not inline library abs(), so it won't be fast either.

    Optimizing abs() is one of the easiest thing to do for a compiler (just add an entry for it in the peephole optimizer, and done).

    Furthermore, I've seen library implementations in the past, where abs() were implemented as a non-inline, library function (it was long time ago, though).

    Proof that both implementations are the same:

    GCC:

    myabs:
        mov     edx, edi    ; argument passed in EDI by System V AMD64 calling convention
        mov     eax, edi
        sar     edx, 31
        xor     eax, edx
        sub     eax, edx
        ret
    
    libabs:
        mov     edx, edi    ; argument passed in EDI by System V AMD64 calling convention
        mov     eax, edi
        sar     edx, 31
        xor     eax, edx
        sub     eax, edx
        ret
    

    Clang:

    myabs:
        mov     eax, edi    ; argument passed in EDI by System V AMD64 calling convention
        neg     eax
        cmovl   eax, edi
        ret
    
    libabs:
        mov     eax, edi    ; argument passed in EDI by System V AMD64 calling convention
        neg     eax
        cmovl   eax, edi
        ret
    

    Visual Studio (MSVC):

    libabs:
        mov      eax, ecx    ; argument passed in ECX by Windows 64-bit calling convention 
        cdq
        xor      eax, edx
        sub      eax, edx
        ret      0
    
    myabs:
        mov      eax, ecx    ; argument passed in ECX by Windows 64-bit calling convention 
        cdq
        xor      eax, edx
        sub      eax, edx
        ret      0
    

    ICC:

    myabs:
        mov       eax, edi    ; argument passed in EDI by System V AMD64 calling convention 
        cdq
        xor       edi, edx
        sub       edi, edx
        mov       eax, edi
        ret      
    
    libabs:
        mov       eax, edi    ; argument passed in EDI by System V AMD64 calling convention 
        cdq
        xor       edi, edx
        sub       edi, edx
        mov       eax, edi
        ret      
    

    See for yourself on Godbolt Compiler Explorer, where you can inspect the machine code generated by various compilers. (Link kindly provided by Peter Cordes.)

    0 讨论(0)
提交回复
热议问题