I\'m trying to optimize some code that\'s supposed to read single precision floats from memory and perform arithmetic on them in double precision. This is becoming a signif
This is actually an optimization. CVTSS2SD from memory leaves the high 64 bits of the destination register unchanged. This means that a partial-register update occurs, which can incur a significant stall and greatly reduce ILP in many circumstances. MOVSS, on the other hand, zeros the unused bits of the register, which is dependency-breaking, and avoids the risk of the stall.
You may well have a bottleneck on conversion to double, but this isn't it.
I'll expand a little bit on exactly why the partial register update is a performance hazard.
I have no idea what computation is actually being performed, but let's suppose that it looks like this very simple example:
double accumulator, x;
float y[n];
for (size_t i=0; i<n; ++i) {
accumulator += x*(double)y[i];
}
The "obvious" codegen for the loop looks something like this:
loop_begin:
cvtss2sd xmm0, [y + 4*i]
mulsd xmm0, x
addsd accumulator, xmm0
// some loop arithmetic that I'll ignore; it isn't important.
Naively, the only loop-carried dependency is in the accumulator update, so asymptotically the loop should run at a speed of 1/(addsd
latency), which is 3 cycles per loop iteration on current "typical" x86 cores (see Agner Fog's tables or Intel's Optimization Manual for more details).
However, if we actually look at the operation of these instructions, we see that the high 64 bits of xmm0, even though they have no effect on the result we are interested in, form a second loop-carried dependency chain. Each cvtss2sd
instruction cannot begin until the result of the preceding loop iteration's mulsd
is available; this bounds the actual speed of the loop to 1/(cvtss2sd
latency + mulsd
latency), or 7 cycles per loop iteration on typical x86 cores (the good news is that you only pay the reg-reg conversion latency, because the conversion operation is cracked into two µops, and the load µop does not have a dependency on xmm0
, so it can be hoisted).
We can write out the operation of this loop as follows to make it a bit more clear (I'm ignoring the load-half of the cvtss2sd
, as those µops are nearly unconstrained and can happen more-or-less whenever):
cycle iteration 1 iteration 2 iteration 3
------------------------------------------------
0 cvtss2sd
1 .
2 mulsd
3 .
4 .
5 .
6 . --- xmm0[64:127]-->
7 addsd cvtss2sd(*)
8 . .
9 .-- accum -+ mulsd
10 | .
11 | .
12 | .
13 | . --- xmm0[64:127]-->
14 +-> addsd cvtss2sd
15 . .
(*) I'm actually simplifying things a bit; we need to consider not only latency but also port utilization in order to make this accurate. Considering only latency suffices to illustrate the stall in question, however, so I'm keeping it simple. Pretend we are running on a machine with infinite ILP resources.
Now suppose that we write the loop like this instead:
loop_begin:
movss xmm0, [y + 4*i]
cvtss2sd xmm0, xmm0
mulsd xmm0, x
addsd accumulator, xmm0
// some loop arithmetic that I'll ignore; it isn't important.
Because movss
from memory zeros bits [32:127] of xmm0, there is no longer a loop-carried dependency on xmm0, so we are bound by accumulation latency, as expected; execution at steady state looks something like this:
cycle iteration i iteration i+1 iteration i+2
------------------------------------------------
0 cvtss2sd .
1 . .
2 mulsd . movss
3 . cvtss2sd .
4 . . .
5 . mulsd .
6 . . cvtss2sd
7 addsd . .
8 . . mulsd
9 . . .
10 . -- accum --> addsd .
11 . .
12 . .
13 . -- accum --> addsd
Note that in my toy example, there's still a lot more to be done to optimize the code in question after eliminating the partial-register-update stall. It can be vectorized, and multiple accumulators can be used (at the cost of changing the specific rounding that occurs) to minimize the effect of the loop-carried accumulate-to-accumulate latency.