问题
I have this memchr
code that I'm trying to make non-branching:
.globl memchr
memchr:
mov %rdx, %rcx
mov %sil, %al
cld
repne scasb
lea -1(%rdi), %rax
test %rcx, %rcx
cmove %rcx, %rax
ret
I'm unsure whether or not cmove
is a branching instruction. Is it? If so, how do I rearrange my code so it doesn't branch?
回答1:
No, it's not a branch, that's the whole point of cmovcc
.
It's an ALU select that has a data dependency on both inputs, not a control dependency. (With a memory source, it unconditionally loads the memory source, unlike ARM predicated load instructions that are truly NOPed. So you can't use it with maybe-bad pointers for branchless bounds or NULL checks. That's maybe the clearest illustration that it's definitely not a branch.)
But anyway, it's not predicted or speculated in any way; as far as the CPU scheduler is concerned it's just like an adc
instruction: 2 integer inputs + FLAGS, and 1 integer output. (Only difference from adc
/sbb
is that it doesn't write FLAGS. And of course runs on an execution unit with different internals).
Whether that's good or bad entirely depends on the use-case. See also gcc optimization flag -O3 makes code slower than -O2 for much more about cmov
upside / downside
Note that repne scasb
is not fast. "Fast Strings" only works for rep stos / movs.
repne scasb
runs about 1 count per clock cycle on modern CPUs, i.e. typically about 16x worse than a simple SSE2 pcmpeqb
/pmovmskb
/test+jnz
loop. And with clever optimization you can go even faster, up to 2 vectors per clock saturating the load ports.
(e.g. see glibc's memchr
for ORing pcmpeqb
results for a whole cache line together to feed one pmovmskb
, IIRC. Then go back and sort out where the actual hit was.)
repne scasb
also has startup overhead, but microcode branching is different from regular branching: it's not branch-predicted on Intel CPUs. So this can't mispredict, but is total garbage for performance with anything but very small buffers.
SSE2 is baseline for x86-64 and efficient unaligned loads + pmovmskb
make it a no-brainer for memchr
where you can check for length >= 16 to avoid crossing into an unmapped page.
Fast strlen:
- Why is this code 6.5x slower with optimizations enabled? shows a simple not-unrolled strlen for 16-byte-aligned inputs using SSE2.
- Why does glibc's strlen need to be so complicated to run quickly? links to some more stuff about hand-optimized asm strlen functions in glibc. (And how to make a bithack strlen in GNU C avoid strict-aliasing UB.)
- https://codereview.stackexchange.com/a/213558 scalar bithack strlen, including the same 4-byte-at-a-time bithack that the glibc question was about. Better than byte-at-a-time but pointless with SSE2 (which x86-64 guarantees). However, @CodyGray's tutorial-style answer may be a useful for beginners. Note that it doesn't take into account Is it safe to read past the end of a buffer within the same page on x86 and x64?
来源:https://stackoverflow.com/questions/57524415/is-cmovcc-considered-a-branching-instruction