Test of 8 subsequent bytes isn't translated into a single compare instruction

久未见 提交于 2020-08-27 22:12:07

问题


Motivated by this question, I compared three different functions for checking if 8 bytes pointed to by the argument are zeros (note that in the original question, characters are compared with '0', not 0):

bool f1(const char *ptr)
{    
  for (int i = 0; i < 8; i++)
    if (ptr[i])
      return false;
  return true;
}

bool f2(const char *ptr)
{  
  bool res = true;
  for (int i = 0; i < 8; i++)
    res &= (ptr[i] == 0);
  return res;
}

bool f3(const char *ptr)
{  
  static const char tmp[8]{};
  return !std::memcmp(ptr, tmp, 8);
}

Though I would expect the same assembly outcome with enabled optimizations, only the memcmp version was translated into a single cmp instruction on x64. Both f1 and f2 were translated into either a winded or unwinded loop. Moreover, this holds for all GCC, Clang, and Intel compilers with -O3.

Is there any reason why f1 and f2 cannot be optimized into a single compare instruction? It seem to be a pretty straightforward optimization to me.

Live demo: https://godbolt.org/z/j48366


回答1:


First of all, f1 stops reading at the first non-zero byte, so there are cases where it won't fault if you pass it a pointer to a shorter object near the end of a page, and the next page is unmapped. Unconditionally reading 8 bytes can fault in cases where f1 doesn't encounter UB, as @bruno points out. (Is it safe to read past the end of a buffer within the same page on x86 and x64?). The compiler doesn't know that you're never going to use it this way; it has to make code that works for every possible non-UB case for any hypothetical caller.

You can fix that by making the function arg const char ptr[static 8] (but that's a C99 feature, not C++) to guarantee that it's safe to touch all 8 bytes even if the C abstract machine wouldn't. Then the compiler can safely invent reads. (A pointer to a struct {char buf[8]}; would also work, but wouldn't be strict-aliasing safe if the actual pointed-to object wasn't that.)


GCC and clang can't auto-vectorize loops whose trip-count isn't known before the first iteration. So that rules out all search loops like f1, even if made it check a static array of known size or something. (ICC can vectorize some search loops like a naive strlen implementation, though.)

Your f2 could have been optimized the same as f3, to a qword cmp, without overcoming that major compiler-internals limitations because it always does 8 iterations. In fact, current nightly builds of clang do optimize f2, thanks @Tharwen for spotting that.

Recognizing loop patterns is not that simple, and takes compile time to look for. IDK how valuable this optimization would be in practice; that's what compiler devs need trade off against when considering writing more code to look for such patterns. (Maintenance cost of code, and compile-time cost.)

The value depends on how much real world code actually has patterns like this, as well as how big a saving it is when you find it. In this case it's a very nice saving, so it's not crazy for clang to look for it, especially if they have the infrastructure to turn a loop over 8 bytes into an 8-byte integer operation in general.


In practice, just use memcmp if that's what you want; apparently most compilers don't spend time looking for patterns like f2. Modern compilers do reliably inline it, especially for x86-64 where unaligned loads are known to be safe and efficient in asm.

Or use memcpy to do an aliasing-safe unaligned load and compare that, if you think your compiler is more likely to have a builtin memcpy than memcmp.

Or in GNU C++, use a typedef to express unaligned may-alias loads:

bool f4(const char *ptr) {
   typedef uint64_t aliasing_unaligned_u64 __attribute__((aligned(1), may_alias));
    auto val = *(const aliasing_unaligned_u64*)ptr;
    return val != 0;
}

Compiles on Godbolt with GCC10 -O3:

f4(char const*):
        cmp     QWORD PTR [rdi], 0
        setne   al
        ret

Casting to uint64_t* would potentially violate alignof(uint64_t), and probably violate the strict-aliasing rule unless the actual object pointed to by the char* was compatible with uint64_t.

And yes, alignment does matter on x86-64 because the ABI allows compilers to make assumptions based on it. A faulting movaps or other problems can happen with real compilers in corner cases.

  • https://trust-in-soft.com/blog/2020/04/06/gcc-always-assumes-aligned-pointers/

  • Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?

  • Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior? is another example of using may_alias (without aligned(1) in that case because implicit-length strings could end at any point, so you need to do aligned loads to make sure that your chunk that contains at least 1 valid string byte doesn't cross a page boundary.) Also Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?




回答2:


Is there any reason why f1 and f2 cannot be optimized into a single compare instruction (possibly with additional unaligned load)? It seem to be a pretty straightforward optimization to me.

In f1 the loop stops when ptr[i] is true, so it is not always equivalent of to consider 8 elements as it is the case with the two other functions or directly comparing a 8 bytes word if the size of the array is less than 8 (the compiler does not know the size of the array) :

f1("\000\001"); // no access out of the array
f2("\000\001"); // access out of the array
f3("\000\001"); // access out of the array

For f2 I agree that can be replaced by a 8 bytes comparison under the condition the CPU allows to read a word of 8 bytes from any address alignment which is the case of the x64 but that can introduce unusual situation as explained in Unusual situations where this wouldn't be safe in x86 asm




回答3:


You need to help your compiler a bit to get exactly what you want... If you want to compare 8 bytes in one CPU operation, you'll need to change your char pointer so it points to something that's actually 8 bytes long, like a uint64_t pointer.

If your compiler does not support uint64_t, you can use unsigned long long* instead:

#include <cstdint>

inline bool EightBytesNull(const char *ptr)
{  
    return *reinterpret_cast<const uint64_t*>(ptr) == 0;
}

Note that this will work on x86, but will not on ARM, which requires strict integer memory alignment.



来源:https://stackoverflow.com/questions/63392584/test-of-8-subsequent-bytes-isnt-translated-into-a-single-compare-instruction

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!