How can a literal 0 and 0 as a variable yield different behavior with the function __builtin_clz?

回眸只為那壹抹淺笑 提交于 2020-07-20 14:06:31

问题


There's only 1 circumstance where __builtin_clz gives the wrong answer. I'm curious what's causing that behavior.

When I use the literal value 0 I always get 32 as expected. But 0 as a variable yields 31. Why does the method of storing the value 0 matter?

I've taken an architecture class but don't understand the diffed assembly. It looks like when given the literal value 0, the assembly somehow always has the correct answer of 32 hard coded even without optimizations. And the method for counting leading zeros is different when using -march=native.

This post about emulating __builtin_clz with _BitScanReverse and the line bsrl %eax, %eax seem to imply that bit scan reverse doesn't work for 0.

+-------------------+-------------+--------------+
|      Compile      | literal.cpp | variable.cpp |
+-------------------+-------------+--------------+
| g++               |          32 |           31 |
| g++ -O            |          32 |           32 |
| g++ -march=native |          32 |           32 |
+-------------------+-------------+--------------+

literal.cpp

#include <iostream>

int main(){
    int i = 0;
    std::cout << __builtin_clz(0) << std::endl;
}

variable.cpp

#include <iostream>

int main(){
    int i = 0;
    std::cout << __builtin_clz(i) << std::endl;
}

Diff for g++ -S [in name] -o [out name]

1c1
<       .file   "literal.cpp"
---
>       .file   "variable.cpp"
23c23,26
<       movl    $32, %esi
---
>       movl    -4(%rbp), %eax
>       bsrl    %eax, %eax
>       xorl    $31, %eax
>       movl    %eax, %esi

Diff for g++ -march=native -S [in name] -o [out name]

1c1
<       .file   "literal.cpp"
---
>       .file   "variable.cpp"
23c23,25
<       movl    $32, %esi
---
>       movl    -4(%rbp), %eax
>       lzcntl  %eax, %eax
>       movl    %eax, %esi

Diff for g++ -O -S [in name] -o [out name]

1c1
<       .file   "literal.cpp"
---
>       .file   "variable.cpp"

回答1:


When you compile with optimization disabled, the compiler doesn't do constant-propagation across statements. That part is a duplicate of Why does integer division by -1 (negative one) result in FPE? - read my answer there, and/or Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?

This is why a literal zero can be different from a variable with value = 0. Only the variable with optimization disabled results in runtime bsr+xor $31, %reg.


As documented in the GCC manual for __builtin_clz

Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined.

This allows clz / ctz to compile to 31-bsr or bsf instructions respectively, on x86. 31-bsr is implemented with bsr+xor $31,%reg thanks to the magic of 2's complement. (BSR produces the index of the highest set bit, not the leading-zero count).

Note that it only says result, not behaviour. It's not C++ UB (whole program could do absolutely anything), it's limited to that result, just like in x86 asm. But anyway, it seems when the input is a compile-time constant 0, GCC produces the type-width like x86 lzcnt, and like clz instructions on other ISAs. (This probably happens in target-independent GIMPLE tree optimization where constant-propagation through operations including builtins is done.)


Intel documents bsf/bsr as If the content source operand is 0, the content of the destination operand is undefined. In real life, Intel hardware implements the same behaviour AMD documents: leave the destination unmodified in that case.

But since Intel refuses to document it, compilers won't let you make code which takes advantage of it. GCC does not know or care about that behaviour, and provides no way to take advantage of it. Neither does MSVC even though its intrinsic that takes an output pointer arg so could easily have worked that way. See VS: unexpected optimization behavior with _BitScanReverse64 intrinsic


With -march=native, GCC can use BMI1 lzcnt directly, which is well-defined for every possible input bit pattern including 0. It directly produces a leading-zero count instead of the index of the first set bit.

(This is why BSR/BSF don't make sense for input=0; there is no index for them to find. Fun fact: bsr %eax, %eax does "work" for eax=0. In asm, the instructions also set ZF according to whether the input was zero so you can detect when the output is "undefined" instead of a separate test+branch before bsr. Or on AMD and everything else in real life, that it left the destination unmodified.)


Further reading about bsf vs. lzcnt and false dependencies

On Intel until Skylake, lzcnt / tzcnt have a false dependency on the output register even though the result does not depend on it ever. IIRC, Coffee Lake also fixed the false dep for popcnt. (All of which runs on the same execution unit as BSR/BSF.)

  • Why does breaking the "output dependency" of LZCNT matter?
  • VS: unexpected optimization behavior with _BitScanReverse64 intrinsic
  • How can x86 bsr/bsf have fixed latency, not data dependent? Doesn't it loop over bits like the pseudocode shows?


来源:https://stackoverflow.com/questions/59014816/how-can-a-literal-0-and-0-as-a-variable-yield-different-behavior-with-the-functi

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!