What\'s the best practice for using a switch
statement vs using an if
statement for 30 unsigned
enumerations where about 10 have an ex
Compilers are really good at optimizing switch
. Recent gcc is also good at optimizing a bunch of conditions in an if
.
I made some test cases on godbolt.
When the case
values are grouped close together, gcc, clang, and icc are all smart enough to use a bitmap to check if a value is one of the special ones.
e.g. gcc 5.2 -O3 compiles the switch
to (and the if
something very similar):
errhandler_switch(errtype): # gcc 5.2 -O3
cmpl $32, %edi
ja .L5
movabsq $4301325442, %rax # highest set bit is bit 32 (the 33rd bit)
btq %rdi, %rax
jc .L10
.L5:
rep ret
.L10:
jmp fire_special_event()
Notice that the bitmap is immediate data, so there's no potential data-cache miss accessing it, or a jump table.
gcc 4.9.2 -O3 compiles the switch
to a bitmap, but does the 1U<
if
version to series of branches.
errhandler_switch(errtype): # gcc 4.9.2 -O3
leal -1(%rdi), %ecx
cmpl $31, %ecx # cmpl $32, %edi wouldn't have to wait an extra cycle for lea's output.
# However, register read ports are limited on pre-SnB Intel
ja .L5
movl $1, %eax
salq %cl, %rax # with -march=haswell, it will use BMI's shlx to avoid moving the shift count into ecx
testl $2150662721, %eax
jne .L10
.L5:
rep ret
.L10:
jmp fire_special_event()
Note how it subtracts 1 from errNumber
(with lea
to combine that operation with a move). That lets it fit the bitmap into a 32bit immediate, avoiding the 64bit-immediate movabsq
which takes more instruction bytes.
A shorter (in machine code) sequence would be:
cmpl $32, %edi
ja .L5
mov $2150662721, %eax
dec %edi # movabsq and btq is fewer instructions / fewer Intel uops, but this saves several bytes
bt %edi, %eax
jc fire_special_event
.L5:
ret
(The failure to use jc fire_special_event
is omnipresent, and is a compiler bug.)
rep ret
is used in branch targets, and following conditional branches, for the benefit of old AMD K8 and K10 (pre-Bulldozer): What does `rep ret` mean?. Without it, branch prediction doesn't work as well on those obsolete CPUs.
bt
(bit test) with a register arg is fast. It combines the work of left-shifting a 1 by errNumber
bits and doing a test
, but is still 1 cycle latency and only a single Intel uop. It's slow with a memory arg because of its way-too-CISC semantics: with a memory operand for the "bit string", the address of the byte to be tested is computed based on the other arg (divided by 8), and isn't limited to the 1, 2, 4, or 8byte chunk pointed to by the memory operand.
From Agner Fog's instruction tables, a variable-count shift instruction is slower than a bt
on recent Intel (2 uops instead of 1, and shift doesn't do everything else that's needed).