Why don't GCC and Clang use cvtss2sd [memory]?

后端 未结 1 1575
北海茫月
北海茫月 2020-12-15 09:59

I\'m trying to optimize some code that\'s supposed to read single precision floats from memory and perform arithmetic on them in double precision. This is becoming a signif

相关标签:
1条回答
  • 2020-12-15 10:38

    This is actually an optimization. CVTSS2SD from memory leaves the high 64 bits of the destination register unchanged. This means that a partial-register update occurs, which can incur a significant stall and greatly reduce ILP in many circumstances. MOVSS, on the other hand, zeros the unused bits of the register, which is dependency-breaking, and avoids the risk of the stall.

    You may well have a bottleneck on conversion to double, but this isn't it.


    I'll expand a little bit on exactly why the partial register update is a performance hazard.

    I have no idea what computation is actually being performed, but let's suppose that it looks like this very simple example:

    double accumulator, x;
    float y[n];
    for (size_t i=0; i<n; ++i) {
        accumulator += x*(double)y[i];
    }
    

    The "obvious" codegen for the loop looks something like this:

    loop_begin:
      cvtss2sd xmm0, [y + 4*i]
      mulsd    xmm0,  x
      addsd    accumulator, xmm0
      // some loop arithmetic that I'll ignore; it isn't important.
    

    Naively, the only loop-carried dependency is in the accumulator update, so asymptotically the loop should run at a speed of 1/(addsd latency), which is 3 cycles per loop iteration on current "typical" x86 cores (see Agner Fog's tables or Intel's Optimization Manual for more details).

    However, if we actually look at the operation of these instructions, we see that the high 64 bits of xmm0, even though they have no effect on the result we are interested in, form a second loop-carried dependency chain. Each cvtss2sd instruction cannot begin until the result of the preceding loop iteration's mulsd is available; this bounds the actual speed of the loop to 1/(cvtss2sd latency + mulsd latency), or 7 cycles per loop iteration on typical x86 cores (the good news is that you only pay the reg-reg conversion latency, because the conversion operation is cracked into two µops, and the load µop does not have a dependency on xmm0, so it can be hoisted).

    We can write out the operation of this loop as follows to make it a bit more clear (I'm ignoring the load-half of the cvtss2sd, as those µops are nearly unconstrained and can happen more-or-less whenever):

    cycle  iteration 1    iteration 2    iteration 3
    ------------------------------------------------
    0      cvtss2sd
    1      .
    2      mulsd
    3      .
    4      .
    5      .
    6      . --- xmm0[64:127]-->
    7      addsd          cvtss2sd(*)
    8      .              .
    9      .-- accum -+   mulsd
    10                |   .
    11                |   .
    12                |   .
    13                |   . --- xmm0[64:127]-->
    14                +-> addsd          cvtss2sd
    15                    .              .
    

    (*) I'm actually simplifying things a bit; we need to consider not only latency but also port utilization in order to make this accurate. Considering only latency suffices to illustrate the stall in question, however, so I'm keeping it simple. Pretend we are running on a machine with infinite ILP resources.

    Now suppose that we write the loop like this instead:

    loop_begin:
       movss    xmm0, [y + 4*i]
       cvtss2sd xmm0,  xmm0
       mulsd    xmm0,  x
       addsd    accumulator, xmm0
       // some loop arithmetic that I'll ignore; it isn't important.
    

    Because movss from memory zeros bits [32:127] of xmm0, there is no longer a loop-carried dependency on xmm0, so we are bound by accumulation latency, as expected; execution at steady state looks something like this:

    cycle  iteration i    iteration i+1  iteration i+2
    ------------------------------------------------
    0      cvtss2sd       .
    1      .              .
    2      mulsd          .              movss 
    3      .              cvtss2sd       .
    4      .              .              .
    5      .              mulsd          .
    6      .              .              cvtss2sd
    7      addsd          .              .
    8      .              .              mulsd
    9      .              .              .
    10     . -- accum --> addsd          .
    11                    .              .
    12                    .              .
    13                    . -- accum --> addsd
    

    Note that in my toy example, there's still a lot more to be done to optimize the code in question after eliminating the partial-register-update stall. It can be vectorized, and multiple accumulators can be used (at the cost of changing the specific rounding that occurs) to minimize the effect of the loop-carried accumulate-to-accumulate latency.

    0 讨论(0)
提交回复
热议问题