After asking this SO question, I received a very interesting comment from @AndonM.Coleman that I had to verify.
Since your disassembled code is written fo
This depends on a lot of things, but mostly what (if anything) you tell the compiler to optimize for.
If the compiler is set to optimize for size (smallest bytecode), then sometimes it will use XOR
in seemingly strange places. For instance, the variable length encoding scheme X86 uses can set a register to 0 by XOR
'ing itself in fewer bytes of code than would be required using the MOV
instruction.
XOR
:if ( (val ^ ~0U) == 0 ) /* 3-bytes to negate and test (x86) */
XOR eax,0FFFFFFFFh
requires 3-bytes AND sets/clears the Zero Flag (ZF)
NOT
:if ( (~val) == 0) /* 4-bytes to negate and test (x86) */
NOT eax
is encoded into a 2-byte instruction, but does not affect CPU flags.
TEST eax,eax
adds an additional 2-bytes, and is necessary to set/clear the Zero Flag (ZF)
NOT
is also a simple instruction, but since it does not affect any CPU flags, you must issue a TEST
instruction afterwards to use it for branching as seen in your code. This actually produces larger bytecode, so a smart compiler set to optimize for size would probably try to avoid using NOT
. How many cycles both of these instructions together take to complete varies between CPU generation, and a smart compiler would also factor this into its decision making when told to optimize for speed.
If you find out later that this branch really is a bottleneck and there is no higher-level way around the problem, then you could do some low-level tuning. However, this is such a trivial thing to focus on these days unless you are targeting something like a low-power embedded CPU or memory limited device. The only places I have ever squeezed out enough performance by hand-tuning to make it worthwhile were in algorithms that benefited from data parallelism and where the compiler was not smart enough to effectively utilize specialized instruction sets like MMX/SSE.