问题
I noticed if we know there is good chance for control flow is true or false, we can tell it to compiler, for instance, in Linux kernel, there are lots of likely
unlikely
, actually impled by __builtin_expect
provided by gcc
, so I want to find out how does it work, then checked the assembly out there:
20:branch_prediction_victim.cpp **** if (array_aka[j] >= 128)
184 .loc 3 20 0 is_stmt 1
185 00f1 488B85D0 movq -131120(%rbp), %rax
185 FFFDFF
186 00f8 8B8485F0 movl -131088(%rbp,%rax,4), %eax
186 FFFDFF
187 00ff 83F87F cmpl $127, %eax
188 0102 7E17 jle .L13
Then for __builtin_expect
20:branch_prediction_victim.cpp **** if (__builtin_expect((array_aka[j] >= 128), 1))
184 .loc 3 20 0 is_stmt 1
185 00f1 488B85D0 movq -131120(%rbp), %rax
185 FFFDFF
186 00f8 8B8485F0 movl -131088(%rbp,%rax,4), %eax
186 FFFDFF
187 00ff 83F87F cmpl $127, %eax
188 0102 0F9FC0 setg %al
189 0105 0FB6C0 movzbl %al, %eax
190 0108 4885C0 testq %rax, %rax
191 010b 7417 je .L13
- 188 -
setg
set if greater, here set if greater than what? - 189 -
movzbl
move zero extend byte to long, I know this one move%al
to%eax
- 190 -
testq
bitwise OR then set ZF CF flags, is this right?
I want to know how do they affect branch prediction, and improve performance, three extra instruction, more cycles needed right?
回答1:
setcc
reads FLAGS, in this case set by the cmp
right before. Read the manual.
This looks like you forgot to enable optimization, so __builtin_expect
is just creating a 0
/ 1
boolean value in a register and branching on it being non-zero, instead of branching on the original FLAGS condition. Don't look at un-optimized code, it's always going to suck.
The clues are: the braindead booleanizing as part of likely
, and loading j
from the stack using RBP as a frame pointer with movq -131120(%rbp), %rax
likely
generally doesn't improve runtime branch prediction, it improves code layout to minimize the amount of taken branches when things go the way the source code said they would (i.e. the fast case). So it improves I-cache locality for the common case. e.g. the compiler will lay things out so the common case is a not-taken conditional branch, just falling through. This makes things easier for the front-end in superscalar pipelined CPUs that fetch/decode multiple instructions at once. Continuing to fetch in a straight line is easiest.
likely
can actually get the compiler to use a branch instead of a cmov
for cases that you know are predictable, even if compiler heuristics (without profile-guided optimization) would have gotten it wrong. Related: gcc optimization flag -O3 makes code slower than -O2
来源:https://stackoverflow.com/questions/61030543/how-to-understand-macro-likely-affecting-branch-prediction