Why is memcmp(a, b, 4) only sometimes optimized to a uint32 comparison?

后端未结

关注

 4  934

暗喜 2021-02-03 16:36

Given this code:

#include 

int equal4(const char* a, const char* b)
{
    return memcmp(a, b, 4) == 0;
}

int less4(const char* a, const char* b


      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   挽巷
                                             
                
                
                (楼主)
            
              
              
                2021-02-03 17:32
              

            
            
                        
As discussed in other answers/comments, using memcmp(a,b,4) < 0 is equivalent to an unsigned comparison between big-endian integers.  It couldn't inline as efficiently as == 0 on little-endian x86.

More importantly, the current version of this behaviour in gcc7/8 only looks for memcmp() == 0 or != 0.  Even on a big-endian target where this could inline just as efficiently for < or >, gcc won't do it.  (Godbolt's newest big-endian compilers are PowerPC 64 gcc6.3, and MIPS/MIPS64 gcc5.4.  mips is big-endian MIPS, while mipsel is little-endian MIPS.)  If testing this with future gcc, use a = __builtin_assume_align(a, 4) to make sure gcc doesn't have to worry about unaligned-load performance/correctness on non-x86.  (Or just use const int32_t* instead of const char*.)

If/when gcc learns to inline memcmp for cases other than EQ/NE, maybe gcc will do it on little-endian x86 when its heuristics tell it the extra code size will be worth it.  e.g. in a hot loop when compiling with -fprofile-use (profile-guided optimization).



If you want compilers to do a good job for this case, you should probably assign to a uint32_t and use an endian-conversion function like ntohl.  But make sure you pick one that can actually inline; apparently Windows has an ntohl that compiles to a DLL call.  See other answers on that question for some portable-endian stuff, and also someone's imperfect attempt at a portable_endian.h, and this fork of it.  I was working on a version for a while, but never finished/tested it or posted it.

The pointer-casting may be Undefined Behaviour, depending on how you wrote the bytes and what the char* points to.  If you're not sure about strict-aliasing and/or alignment, memcpy into abytes.  Most compilers are good at optimizing away small fixed-size memcpy.

// I know the question just wonders why gcc does what it does,
// not asking for how to write it differently.
// Beware of alignment performance or even fault issues outside of x86.

#include 
#include 

int equal4_optim(const char* a, const char* b) {
    uint32_t abytes = *(const uint32_t*)a;
    uint32_t bbytes = *(const uint32_t*)b;

    return abytes == bbytes;
}


int less4_optim(const char* a, const char* b) {
    uint32_t a_native = be32toh(*(const uint32_t*)a);
    uint32_t b_native = be32toh(*(const uint32_t*)b);

    return a_native < b_native;
}


I checked on Godbolt, and that compiles to efficient code (basically identical to what I wrote in asm below), especially on big-endian platforms, even with old gcc.  It also makes much better code than ICC17, which inlines memcmp but only to a byte-compare loop (even for the == 0 case.



I think this hand-crafted sequence is an optimal implementation of less4() (for the x86-64 SystemV calling convention, like used in the question, with const char *a in rdi and b in rsi).

less4:
    mov   edi, [rdi]
    mov   esi, [rsi]
    bswap edi
    bswap esi
    # data loaded and byte-swapped to native unsigned integers
    xor   eax,eax    # solves the same problem as gcc's movzx, see below
    cmp   edi, esi
    setb  al         # eax=1 if *a was Below(unsigned) *b, else 0
    ret


Those are all single-uop instructions on Intel and AMD CPUs since K8 and Core2 (http://agner.org/optimize/).

Having to bswap both operands has an extra code-size cost vs. the == 0 case: we can't fold one of the loads into a memory operand for cmp.  (That saves code size, and uops thanks to micro-fusion.)  This is on top the two extra bswap instructions.

On CPUs that support movbe, it can save code size:  movbe ecx, [rsi] is a load + bswap.  On Haswell, it's 2 uops, so presumably it decodes to the same uops as mov ecx, [rsi] / bswap ecx.  On Atom/Silvermont, it's handled right in the load ports, so it's fewer uops as well as smaller code-size.

See the setcc part of my xor-zeroing answer for more about why xor/cmp/setcc  (which clang uses) is better than cmp/setcc/movzx (typical for gcc).

In the usual case where this inlines into code that branches on the result, the setcc + zero-extend are replaced with a jcc; the compiler optimizes away creating a boolean return value in a register.    This is yet another advantage of inlining: the library memcmp does have to create an integer boolean return value which the caller tests, because no x86 ABI/calling convention allows for returning boolean conditions in flags.  (I don't know of any non-x86 calling conventions that do that either).  For most library memcmp implementations, there's also significant overhead in choosing a strategy depending on length, and maybe alignment checking.  That can be pretty cheap, but for size 4 it's going to be more than the cost of all the real work.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复