I was curious if java.lang.Integer.rotateLeft
gets optimized by using a rotation instruction and wrote a benchmark for it. The results were inconclusive: It was
I did not know that gcc and the java jit were capable of recognizing that a sequence of SHIFT and OR operators can be reduced to a ROTATE instruction, very interesting.
The g++ compiler unrolls your loops and uses SHIFT immediate
and ROTATE immediate
instructions (since you shift and rotate by constant values).
Here's the six instruction sequence that is repeated in the TimeShift loop unroll case:
movq %rax, %rbx
salq $13, %rbx
leaq (%rbp,%rbx), %rbx
movq %rdi, %rbp
sarq $27, %rbp
xorq %rbx, %rdx
Here's the six instruction sequence that is repeated in the TimeRotate loop unroll case:
movq %rdx, %rbx
rorq $45, %rbx
leaq (%rbp,%rbx), %rbx
movq %r8, %rbp
rorq $49, %rbp
xorq %rbx, %r9
They differ mainly in the use of salq/sarq for SHIFT
and rorq for ROTATE
so you are correct in wondering why the timing differs.
The answer lies deep in the micro-architecture of Sandy Bridge (your Core i5 processor) and is found in INTEL® 64 and IA-32 Processor Architectures Optimization Reference Manual
The latest is Order Number: 248966-026 April 2012
The SHIFT
instruction has 1 cycle latency whether you use the by 1
opcode or by immediate
. It can dispatch from either Port 0
or Port 1
and for this reason has a 0.5 cycle throughput - the processor can dispatch and retire two SHIFT immediate
instructions per cycle. The ROTATE
instruction needs three micro-ops if the results of the condition flags are needed (they aren't in the code generated by gcc) and two micro-ops if not (so two micro-ops in your case). The ROTATE
instruction, however, can only be dispatched from Port 1
and therefore has a 1 cycle throughput - the processor can dispatch and retire only one ROTATE immediate
per cycle.
I've copied the relevant image and section below.
3.5.1.5 Bitwise Rotation
Bitwise rotation can choose between rotate with count specified in the CL register, an immediate constant and by 1 bit. Generally, The rotate by immediate and rotate by register instructions are slower than rotate by 1 bit. The rotate by 1 instruction has the same latency as a shift. Assembly/Compiler Coding Rule 35. (ML impact, L generality) Avoid ROTATE by register or ROTATE by immediate instructions. If possible, replace with a ROTATE by 1 instruction. In Intel microarchitecture code name Sandy Bridge, ROL/ROR by immediate has 1- cycle throughput, SHLD/SHRD using the same register as source and destination by an immediate constant has 1-cycle latency with 0.5 cycle throughput. The “ROL/ROR reg, imm8” instruction has two micro-ops with the latency of 1-cycle for the rotate register result and 2-cycles for the flags, if used. In Intel microarchitecture code name Ivy Bridge, The “ROL/ROR reg, imm8” instruction with immediate greater than 1, is one micro-op with one-cycle latency when the overflow flag result is used. When the immediate is one, dependency on the overflow flag result of ROL/ROR by a subsequent instruction will see the ROL/ROR instruction with two-cycle latency.
2.4.4.2 Execution Units and Issue Ports
At each cycle, the core may dispatch µops to one or more of four issue ports. At the microarchitecture level, store operations are further divided into two parts: store data and store address operations. The four ports through which μops are dispatched to execution units and to load and store operations are shown in Figure 2-6. Some ports can dispatch two µops per clock. Those execution units are marked Double Speed.
Port 0. In the first half of the cycle, port 0 can dispatch either one floating-point move µop (a floating-point stack move, floating-point exchange or floating-point store data) or one arithmetic logical unit (ALU) µop (arithmetic, logic, branch or store data). In the second half of the cycle, it can dispatch one similar ALU µop.
Port 1. In the first half of the cycle, port 1 can dispatch either one floating-point execution (all floating-point operations except moves, all SIMD operations) µop or one normal-speed integer (multiply, shift and rotate) µop or one ALU (arithmetic) µop. In the second half of the cycle, it can dispatch one similar ALU µop.
Port 2. This port supports the dispatch of one load operation per cycle.
Port 3. This port supports the dispatch of one store address operation per cycle.
The total issue bandwidth can range from zero to six µops per cycle. Each pipeline contains several execution units. The µops are dispatched to the pipeline that corresponds to the correct type of operation. For example, an integer arithmetic logic unit and the floating-point execution units (adder, multiplier, and divider) can share a pipeline.
When you are looking at micro-benchmarks, you have to consider that the JIT will optimise common patterns e.g. shift, it recognises more efficiently than uncommon patterns e.g. rotate (or ones it does not recognise) This can means that two operations which should take the same amount of time can perform quite differently because one is more heavily optimised than the other. e.g. with more loop unrolling or dead code removal.
Even simple instructions can interact to produce different and unexpected results. In other words you cannot look at a single instruction and assume it tell you very much about how it will perform when more instructions are used. Context is important at such a low level.
I suggest you try comparing these operations in a realistic program and I suspect you will have great difficulty finding a measurable difference.
According to this benchmark, the shifts and rotate both have the same latency on your CPU, but rotates have a lower throughput (results listed there as "T" are reciprocal throughput, which is more easily comparable with latencies). That could have precisely the kind of result you're seeing - the lower throughput sort of gets in the way a little, but you weren't completely saturating the execution units so it doesn't show the full factor of 2 difference. Testing that yourself is not easy, especially not if you have to fight a compiler to make it emit the instructions your want.