When compiling the benchmark code below with -O3
I was impressed by the difference it made in latency so i began to wonder whether the compiler is not \"cheatin
It can be very difficult to benchmark what you think you are measuring. In the case of the inner loop:
for (int j = 0; j < load; ++j)
if (i % 4 == 0)
x += (i % 4) * (i % 8);
else x -= (i % 16) * (i % 32);
A shrewd compiler might be able to see through that and change the code to something like:
x = load * 174; // example only
I know that isn't equivalent, but there is some fairly simple expression which can replace that loop.
The way to be sure is to use the gcc -S
compiler option and look at the assembly code it generates.