Bitwise operators, not vs xor use in branching

前端 未结 1 953
既然无缘
既然无缘 2021-01-27 16:09

After asking this SO question, I received a very interesting comment from @AndonM.Coleman that I had to verify.

Since your disassembled code is written fo

1条回答
  •  陌清茗
    陌清茗 (楼主)
    2021-01-27 16:27

    This depends on a lot of things, but mostly what (if anything) you tell the compiler to optimize for.

    If the compiler is set to optimize for size (smallest bytecode), then sometimes it will use XOR in seemingly strange places. For instance, the variable length encoding scheme X86 uses can set a register to 0 by XOR'ing itself in fewer bytes of code than would be required using the MOV instruction.

    Consider the code that uses XOR:

    if ( (val ^ ~0U) == 0 )  /* 3-bytes to negate and test (x86) */
    

        XOR eax,0FFFFFFFFh requires 3-bytes AND sets/clears the Zero Flag (ZF)

    Now, consider the code that uses NOT:

    if ( (~val) == 0)        /* 4-bytes to negate and test (x86) */
    

        NOT eax is encoded into a 2-byte instruction, but does not affect CPU flags.

        TEST eax,eax adds an additional 2-bytes, and is necessary to set/clear the Zero Flag (ZF)

    NOT is also a simple instruction, but since it does not affect any CPU flags, you must issue a TEST instruction afterwards to use it for branching as seen in your code. This actually produces larger bytecode, so a smart compiler set to optimize for size would probably try to avoid using NOT. How many cycles both of these instructions together take to complete varies between CPU generation, and a smart compiler would also factor this into its decision making when told to optimize for speed.


    If you are not writing hand-tuned assembly, it is best to use whatever is clearest to a human and hope that the compiler is smart enough to choose different instructions/scheduling/etc. to optimize for size/speed as requested at compile-time. Compilers have a smart set of heuristics they use to choose and schedule instructions, they know more about the target CPU architecture than the average coder.

    If you find out later that this branch really is a bottleneck and there is no higher-level way around the problem, then you could do some low-level tuning. However, this is such a trivial thing to focus on these days unless you are targeting something like a low-power embedded CPU or memory limited device. The only places I have ever squeezed out enough performance by hand-tuning to make it worthwhile were in algorithms that benefited from data parallelism and where the compiler was not smart enough to effectively utilize specialized instruction sets like MMX/SSE.

    0 讨论(0)
提交回复
热议问题