When I am down to squeezing the last bit of performance out of a kernel, I usually find that replacing the logical operators (&&
and
Bitwise operations can be carried out in registers at hardware level. Register operations are the fastest, this is specially true when the data can fit in the register. Logical operations involve expression evaluation which may not be register bound. Typically &, |, ^, >>... are some of the fastest operations and used widely in high performance logic.