问题
Motivated by this question, I compared three different functions for checking if 8 bytes pointed to by the argument are zeros (note that in the original question, characters are compared with '0'
, not 0
):
bool f1(const char *ptr)
{
for (int i = 0; i < 8; i++)
if (ptr[i])
return false;
return true;
}
bool f2(const char *ptr)
{
bool res = true;
for (int i = 0; i < 8; i++)
res &= (ptr[i] == 0);
return res;
}
bool f3(const char *ptr)
{
static const char tmp[8]{};
return !std::memcmp(ptr, tmp, 8);
}
Though I would expect the same assembly outcome with enabled optimizations, only the memcmp
version was translated into a single cmp
instruction on x64. Both f1
and f2
were translated into either a winded or unwinded loop. Moreover, this holds for all GCC, Clang, and Intel compilers with -O3
.
Is there any reason why f1
and f2
cannot be optimized into a single compare instruction? It seem to be a pretty straightforward optimization to me.
Live demo: https://godbolt.org/z/j48366
回答1:
First of all, f1
stops reading at the first non-zero byte, so there are cases where it won't fault if you pass it a pointer to a shorter object near the end of a page, and the next page is unmapped. Unconditionally reading 8 bytes can fault in cases where f1
doesn't encounter UB, as @bruno points out. (Is it safe to read past the end of a buffer within the same page on x86 and x64?). The compiler doesn't know that you're never going to use it this way; it has to make code that works for every possible non-UB case for any hypothetical caller.
You can fix that by making the function arg const char ptr[static 8]
(but that's a C99 feature, not C++) to guarantee that it's safe to touch all 8 bytes even if the C abstract machine wouldn't. Then the compiler can safely invent reads. (A pointer to a struct {char buf[8]};
would also work, but wouldn't be strict-aliasing safe if the actual pointed-to object wasn't that.)
GCC and clang can't auto-vectorize loops whose trip-count isn't known before the first iteration. So that rules out all search loops like f1
, even if made it check a static array of known size or something. (ICC can vectorize some search loops like a naive strlen implementation, though.)
Your f2
could have been optimized the same as f3
, to a qword cmp
, without overcoming that major compiler-internals limitations because it always does 8 iterations. In fact, current nightly builds of clang do optimize f2, thanks @Tharwen for spotting that.
Recognizing loop patterns is not that simple, and takes compile time to look for. IDK how valuable this optimization would be in practice; that's what compiler devs need trade off against when considering writing more code to look for such patterns. (Maintenance cost of code, and compile-time cost.)
The value depends on how much real world code actually has patterns like this, as well as how big a saving it is when you find it. In this case it's a very nice saving, so it's not crazy for clang to look for it, especially if they have the infrastructure to turn a loop over 8 bytes into an 8-byte integer operation in general.
In practice, just use memcmp
if that's what you want; apparently most compilers don't spend time looking for patterns like f2
. Modern compilers do reliably inline it, especially for x86-64 where unaligned loads are known to be safe and efficient in asm.
Or use memcpy
to do an aliasing-safe unaligned load and compare that, if you think your compiler is more likely to have a builtin memcpy than memcmp.
Or in GNU C++, use a typedef to express unaligned may-alias loads:
bool f4(const char *ptr) {
typedef uint64_t aliasing_unaligned_u64 __attribute__((aligned(1), may_alias));
auto val = *(const aliasing_unaligned_u64*)ptr;
return val != 0;
}
Compiles on Godbolt with GCC10 -O3:
f4(char const*):
cmp QWORD PTR [rdi], 0
setne al
ret
Casting to uint64_t*
would potentially violate alignof(uint64_t)
, and probably violate the strict-aliasing rule unless the actual object pointed to by the char*
was compatible with uint64_t
.
And yes, alignment does matter on x86-64 because the ABI allows compilers to make assumptions based on it. A faulting movaps
or other problems can happen with real compilers in corner cases.
https://trust-in-soft.com/blog/2020/04/06/gcc-always-assumes-aligned-pointers/
Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?
Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior? is another example of using
may_alias
(withoutaligned(1)
in that case because implicit-length strings could end at any point, so you need to do aligned loads to make sure that your chunk that contains at least 1 valid string byte doesn't cross a page boundary.) Also Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
回答2:
Is there any reason why f1 and f2 cannot be optimized into a single compare instruction (possibly with additional unaligned load)? It seem to be a pretty straightforward optimization to me.
In f1 the loop stops when ptr[i]
is true, so it is not always equivalent of to consider 8 elements as it is the case with the two other functions or directly comparing a 8 bytes word if the size of the array is less than 8 (the compiler does not know the size of the array) :
f1("\000\001"); // no access out of the array
f2("\000\001"); // access out of the array
f3("\000\001"); // access out of the array
For f2 I agree that can be replaced by a 8 bytes comparison under the condition the CPU allows to read a word of 8 bytes from any address alignment which is the case of the x64 but that can introduce unusual situation as explained in Unusual situations where this wouldn't be safe in x86 asm
回答3:
You need to help your compiler a bit to get exactly what you want... If you want to compare 8 bytes in one CPU operation, you'll need to change your char pointer so it points to something that's actually 8 bytes long, like a uint64_t pointer.
If your compiler does not support uint64_t, you can use unsigned long long*
instead:
#include <cstdint>
inline bool EightBytesNull(const char *ptr)
{
return *reinterpret_cast<const uint64_t*>(ptr) == 0;
}
Note that this will work on x86, but will not on ARM, which requires strict integer memory alignment.
来源:https://stackoverflow.com/questions/63392584/test-of-8-subsequent-bytes-isnt-translated-into-a-single-compare-instruction