Why is memcmp(a, b, 4) only sometimes optimized to a uint32 comparison?

后端 未结 4 934
暗喜
暗喜 2021-02-03 16:36

Given this code:

#include 

int equal4(const char* a, const char* b)
{
    return memcmp(a, b, 4) == 0;
}

int less4(const char* a, const char* b         


        
4条回答
  •  挽巷
    挽巷 (楼主)
    2021-02-03 17:32

    As discussed in other answers/comments, using memcmp(a,b,4) < 0 is equivalent to an unsigned comparison between big-endian integers. It couldn't inline as efficiently as == 0 on little-endian x86.

    More importantly, the current version of this behaviour in gcc7/8 only looks for memcmp() == 0 or != 0. Even on a big-endian target where this could inline just as efficiently for < or >, gcc won't do it. (Godbolt's newest big-endian compilers are PowerPC 64 gcc6.3, and MIPS/MIPS64 gcc5.4. mips is big-endian MIPS, while mipsel is little-endian MIPS.) If testing this with future gcc, use a = __builtin_assume_align(a, 4) to make sure gcc doesn't have to worry about unaligned-load performance/correctness on non-x86. (Or just use const int32_t* instead of const char*.)

    If/when gcc learns to inline memcmp for cases other than EQ/NE, maybe gcc will do it on little-endian x86 when its heuristics tell it the extra code size will be worth it. e.g. in a hot loop when compiling with -fprofile-use (profile-guided optimization).


    If you want compilers to do a good job for this case, you should probably assign to a uint32_t and use an endian-conversion function like ntohl. But make sure you pick one that can actually inline; apparently Windows has an ntohl that compiles to a DLL call. See other answers on that question for some portable-endian stuff, and also someone's imperfect attempt at a portable_endian.h, and this fork of it. I was working on a version for a while, but never finished/tested it or posted it.

    The pointer-casting may be Undefined Behaviour, depending on how you wrote the bytes and what the char* points to. If you're not sure about strict-aliasing and/or alignment, memcpy into abytes. Most compilers are good at optimizing away small fixed-size memcpy.

    // I know the question just wonders why gcc does what it does,
    // not asking for how to write it differently.
    // Beware of alignment performance or even fault issues outside of x86.
    
    #include 
    #include 
    
    int equal4_optim(const char* a, const char* b) {
        uint32_t abytes = *(const uint32_t*)a;
        uint32_t bbytes = *(const uint32_t*)b;
    
        return abytes == bbytes;
    }
    
    
    int less4_optim(const char* a, const char* b) {
        uint32_t a_native = be32toh(*(const uint32_t*)a);
        uint32_t b_native = be32toh(*(const uint32_t*)b);
    
        return a_native < b_native;
    }
    

    I checked on Godbolt, and that compiles to efficient code (basically identical to what I wrote in asm below), especially on big-endian platforms, even with old gcc. It also makes much better code than ICC17, which inlines memcmp but only to a byte-compare loop (even for the == 0 case.


    I think this hand-crafted sequence is an optimal implementation of less4() (for the x86-64 SystemV calling convention, like used in the question, with const char *a in rdi and b in rsi).

    less4:
        mov   edi, [rdi]
        mov   esi, [rsi]
        bswap edi
        bswap esi
        # data loaded and byte-swapped to native unsigned integers
        xor   eax,eax    # solves the same problem as gcc's movzx, see below
        cmp   edi, esi
        setb  al         # eax=1 if *a was Below(unsigned) *b, else 0
        ret
    

    Those are all single-uop instructions on Intel and AMD CPUs since K8 and Core2 (http://agner.org/optimize/).

    Having to bswap both operands has an extra code-size cost vs. the == 0 case: we can't fold one of the loads into a memory operand for cmp. (That saves code size, and uops thanks to micro-fusion.) This is on top the two extra bswap instructions.

    On CPUs that support movbe, it can save code size: movbe ecx, [rsi] is a load + bswap. On Haswell, it's 2 uops, so presumably it decodes to the same uops as mov ecx, [rsi] / bswap ecx. On Atom/Silvermont, it's handled right in the load ports, so it's fewer uops as well as smaller code-size.

    See the setcc part of my xor-zeroing answer for more about why xor/cmp/setcc (which clang uses) is better than cmp/setcc/movzx (typical for gcc).

    In the usual case where this inlines into code that branches on the result, the setcc + zero-extend are replaced with a jcc; the compiler optimizes away creating a boolean return value in a register. This is yet another advantage of inlining: the library memcmp does have to create an integer boolean return value which the caller tests, because no x86 ABI/calling convention allows for returning boolean conditions in flags. (I don't know of any non-x86 calling conventions that do that either). For most library memcmp implementations, there's also significant overhead in choosing a strategy depending on length, and maybe alignment checking. That can be pretty cheap, but for size 4 it's going to be more than the cost of all the real work.

提交回复
热议问题