Given this code:
#include
int equal4(const char* a, const char* b)
{
return memcmp(a, b, 4) == 0;
}
int less4(const char* a, const char* b
As discussed in other answers/comments, using memcmp(a,b,4) < 0
is equivalent to an unsigned
comparison between big-endian integers. It couldn't inline as efficiently as == 0
on little-endian x86.
More importantly, the current version of this behaviour in gcc7/8 only looks for memcmp() == 0 or != 0. Even on a big-endian target where this could inline just as efficiently for <
or >
, gcc won't do it. (Godbolt's newest big-endian compilers are PowerPC 64 gcc6.3, and MIPS/MIPS64 gcc5.4. mips
is big-endian MIPS, while mipsel
is little-endian MIPS.) If testing this with future gcc, use a = __builtin_assume_align(a, 4)
to make sure gcc doesn't have to worry about unaligned-load performance/correctness on non-x86. (Or just use const int32_t*
instead of const char*
.)
If/when gcc learns to inline memcmp
for cases other than EQ/NE, maybe gcc will do it on little-endian x86 when its heuristics tell it the extra code size will be worth it. e.g. in a hot loop when compiling with -fprofile-use (profile-guided optimization).
If you want compilers to do a good job for this case, you should probably assign to a uint32_t
and use an endian-conversion function like ntohl
. But make sure you pick one that can actually inline; apparently Windows has an ntohl that compiles to a DLL call. See other answers on that question for some portable-endian stuff, and also someone's imperfect attempt at a portable_endian.h, and this fork of it. I was working on a version for a while, but never finished/tested it or posted it.
The pointer-casting may be Undefined Behaviour, depending on how you wrote the bytes and what the char* points to. If you're not sure about strict-aliasing and/or alignment, memcpy
into abytes
. Most compilers are good at optimizing away small fixed-size memcpy
.
// I know the question just wonders why gcc does what it does,
// not asking for how to write it differently.
// Beware of alignment performance or even fault issues outside of x86.
#include
#include
int equal4_optim(const char* a, const char* b) {
uint32_t abytes = *(const uint32_t*)a;
uint32_t bbytes = *(const uint32_t*)b;
return abytes == bbytes;
}
int less4_optim(const char* a, const char* b) {
uint32_t a_native = be32toh(*(const uint32_t*)a);
uint32_t b_native = be32toh(*(const uint32_t*)b);
return a_native < b_native;
}
I checked on Godbolt, and that compiles to efficient code (basically identical to what I wrote in asm below), especially on big-endian platforms, even with old gcc. It also makes much better code than ICC17, which inlines memcmp
but only to a byte-compare loop (even for the == 0
case.
I think this hand-crafted sequence is an optimal implementation of less4()
(for the x86-64 SystemV calling convention, like used in the question, with const char *a
in rdi
and b
in rsi
).
less4:
mov edi, [rdi]
mov esi, [rsi]
bswap edi
bswap esi
# data loaded and byte-swapped to native unsigned integers
xor eax,eax # solves the same problem as gcc's movzx, see below
cmp edi, esi
setb al # eax=1 if *a was Below(unsigned) *b, else 0
ret
Those are all single-uop instructions on Intel and AMD CPUs since K8 and Core2 (http://agner.org/optimize/).
Having to bswap both operands has an extra code-size cost vs. the == 0
case: we can't fold one of the loads into a memory operand for cmp
. (That saves code size, and uops thanks to micro-fusion.) This is on top the two extra bswap
instructions.
On CPUs that support movbe
, it can save code size: movbe ecx, [rsi]
is a load + bswap. On Haswell, it's 2 uops, so presumably it decodes to the same uops as mov ecx, [rsi]
/ bswap ecx
. On Atom/Silvermont, it's handled right in the load ports, so it's fewer uops as well as smaller code-size.
See the setcc part of my xor-zeroing answer for more about why xor/cmp/setcc (which clang uses) is better than cmp/setcc/movzx (typical for gcc).
In the usual case where this inlines into code that branches on the result, the setcc + zero-extend are replaced with a jcc; the compiler optimizes away creating a boolean return value in a register. This is yet another advantage of inlining: the library memcmp
does have to create an integer boolean return value which the caller tests, because no x86 ABI/calling convention allows for returning boolean conditions in flags. (I don't know of any non-x86 calling conventions that do that either). For most library memcmp
implementations, there's also significant overhead in choosing a strategy depending on length, and maybe alignment checking. That can be pretty cheap, but for size 4 it's going to be more than the cost of all the real work.