Floating point division vs floating point multiplication

后端 未结 7 1610
醉话见心
醉话见心 2020-11-22 11:37

Is there any (non-microoptimization) performance gain by coding

float f1 = 200f / 2

in comparision to

float f2 = 200f * 0.5         


        
7条回答
  •  伪装坚强ぢ
    2020-11-22 12:08

    Be very careful with division, and avoid it when possible. For example, hoist float inverse = 1.0f / divisor; out of a loop and multiply by inverse inside the loop. (If the rounding error in inverse is acceptable)

    Usually 1.0/x will not be exactly-representable as a float or double. It will be exact when x is a power of 2. This lets compilers optimize x / 2.0f to x * 0.5f without any change in the result.

    To let the compiler do this optimization for you even when the result won't be exact (or with a runtime-variable divisor), you need options like gcc -O3 -ffast-math. Specifically, -freciprocal-math (enabled by -funsafe-math-optimizations enabled by -ffast-math) lets the compiler replace x / y with x * (1/y) when that's useful. Other compilers have similar options, and ICC may enable some "unsafe" optimization by default (I think it does, but I forget).

    -ffast-math is often important to allow auto-vectorization of FP loops, especially reductions (e.g. summing an array into one scalar total), because FP math is not associative. Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?

    Also note that C++ compilers can fold + and * into an FMA in some cases (when compiling for a target that supports it, like -march=haswell), but they can't do that with /.


    Division has worse latency than multiplication or addition (or FMA) by a factor of 2 to 4 on modern x86 CPUs, and worse throughput by a factor of 6 to 401 (for a tight loop doing only division instead of only multiplication).

    The divide / sqrt unit is not fully pipelined, for reasons explained in @NathanWhitehead's answer. The worst ratios are for 256b vectors, because (unlike other execution units) the divide unit is usually not full-width, so wide vectors have to be done in two halves. A not-fully-pipelined execution unit is so unusual that Intel CPUs have an arith.divider_active hardware performance counter to help you find code that bottlenecks on divider throughput instead of the usual front-end or execution port bottlenecks. (Or more often, memory bottlenecks or long latency chains limiting instruction-level parallelism causing instruction throughput to be less than ~4 per clock).

    However, FP division and sqrt on Intel and AMD CPUs (other than KNL) is implemented as a single uop, so it doesn't necessarily have a big throughput impact on surrounding code. The best case for division is when out-of-order execution can hide the latency, and when there are lots of multiplies and adds (or other work) that can happen in parallel with the divide.

    (Integer division is microcoded as multiple uops on Intel, so it always has more impact on surrounding code that integer multiply. There's less demand for high-performance integer division, so the hardware support isn't as fancy. Related: microcoded instructions like idiv can cause alignment-sensitive front-end bottlenecks.)

    So for example, this will be really bad:

    for ()
        a[i] = b[i] / scale;  // division throughput bottleneck
    
    // Instead, use this:
    float inv = 1.0 / scale;
    for ()
        a[i] = b[i] * inv;  // multiply (or store) throughput bottleneck
    

    All you're doing in the loop is load/divide/store, and they're independent so it's throughput that matters, not latency.

    A reduction like accumulator /= b[i] would bottleneck on divide or multiply latency, rather than throughput. But with multiple accumulators that you divide or multiply at the end, you can hide the latency and still saturate the throughput. Note that sum += a[i] / b[i] bottlenecks on add latency or div throughput, but not div latency because the division isn't on the critical path (the loop-carried dependency chain).


    But in something like this (approximating a function like log(x) with a ratio of two polynomials), the divide can be pretty cheap:

    for () {
        // (not shown: extracting the exponent / mantissa)
        float p = polynomial(b[i], 1.23, -4.56, ...);  // FMA chain for a polynomial
        float q = polynomial(b[i], 3.21, -6.54, ...);
        a[i] = p/q;
    }
    

    For log() over the range of the mantissa, a ratio of two polynomials of order N has much less error than a single polynomial with 2N coefficients, and evaluating 2 in parallel gives you some instruction-level parallelism within a single loop body instead of one massively long dep chain, making things a LOT easier for out-of-order execution.

    In this case, we don't bottleneck on divide latency because out-of-order execution can keep multiple iterations of the loop over the arrays in flight.

    We don't bottleneck on divide throughput as long as our polynomials are big enough that we only have one divide for every 10 FMA instructions or so. (And in a real log() use case, there's be a bunch of work extracting exponent / mantissa and combining things back together again, so there's even more work to do between divides.)


    When you do need to divide, usually it's best to just divide instead of rcpps

    x86 has an approximate-reciprocal instruction (rcpps), which only gives you 12 bits of precision. (AVX512F has 14 bits, and AVX512ER has 28 bits.)

    You can use this to do x / y = x * approx_recip(y) without using an actual divide instruction. (rcpps itsef is fairly fast; usually a bit slower than multiplication. It uses a table lookup from a table internal to the CPU. The divider hardware may use the same table for a starting point.)

    For most purposes, x * rcpps(y) is too inaccurate, and a Newton-Raphson iteration to double the precision is required. But that costs you 2 multiplies and 2 FMAs, and has latency about as high as an actual divide instruction. If all you're doing is division, then it can be a throughput win. (But you should avoid that kind of loop in the first place if you can, maybe by doing the division as part of another loop that does other work.)

    But if you're using division as part of a more complex function, the rcpps itself + the extra mul + FMA usually makes it faster to just divide with a divps instruction, except on CPUs with very low divps throughput.

    (For example Knight's Landing, see below. KNL supports AVX512ER, so for float vectors the VRCP28PS result is already accurate enough to just multiply without a Newton-Raphson iteration. float mantissa size is only 24 bits.)


    Specific numbers from Agner Fog's tables:

    Unlike every other ALU operation, division latency/throughput is data-dependent on some CPUs. Again, this is because it's so slow and not fully pipelined. Out-of-order scheduling is easier with fixed latencies, because it avoids write-back conflicts (when the same execution port tries to produce 2 results in the same cycle, e.g. from running a 3 cycle instruction and then two 1-cycle operations).

    Generally, the fastest cases are when the divisor is a "round" number like 2.0 or 0.5 (i.e. the base2 float representation has lots of trailing zeros in the mantissa).

    float latency (cycles) / throughput (cycles per instruction, running just that back to back with independent inputs):

                       scalar & 128b vector        256b AVX vector
                       divss      |  mulss
                       divps xmm  |  mulps           vdivps ymm | vmulps ymm
    
    Nehalem          7-14 /  7-14 | 5 / 1           (No AVX)
    Sandybridge     10-14 / 10-14 | 5 / 1        21-29 / 20-28 (3 uops) | 5 / 1
    Haswell         10-13 / 7     | 5 / 0.5       18-21 /   14 (3 uops) | 5 / 0.5
    Skylake            11 / 3     | 4 / 0.5          11 /    5 (1 uop)  | 4 / 0.5
    
    Piledriver       9-24 / 5-10  | 5-6 / 0.5      9-24 / 9-20 (2 uops) | 5-6 / 1 (2 uops)
    Ryzen              10 / 3     | 3 / 0.5         10  /    6 (2 uops) | 3 / 1 (2 uops)
    
     Low-power CPUs:
    Jaguar(scalar)     14 / 14    | 2 / 1
    Jaguar             19 / 19    | 2 / 1            38 /   38 (2 uops) | 2 / 2 (2 uops)
    
    Silvermont(scalar)    19 / 17    | 4 / 1
    Silvermont      39 / 39 (6 uops) | 5 / 2            (No AVX)
    
    KNL(scalar)     27 / 17 (3 uops) | 6 / 0.5
    KNL             32 / 20 (18uops) | 6 / 0.5        32 / 32 (18 uops) | 6 / 0.5  (AVX and AVX512)
    

    double latency (cycles) / throughput (cycles per instruction):

                       scalar & 128b vector        256b AVX vector
                       divsd      |  mulsd
                       divpd xmm  |  mulpd           vdivpd ymm | vmulpd ymm
    
    Nehalem         7-22 /  7-22 | 5 / 1        (No AVX)
    Sandybridge    10-22 / 10-22 | 5 / 1        21-45 / 20-44 (3 uops) | 5 / 1
    Haswell        10-20 /  8-14 | 5 / 0.5      19-35 / 16-28 (3 uops) | 5 / 0.5
    Skylake        13-14 /     4 | 4 / 0.5      13-14 /     8 (1 uop)  | 4 / 0.5
    
    Piledriver      9-27 /  5-10 | 5-6 / 1       9-27 / 9-18 (2 uops)  | 5-6 / 1 (2 uops)
    Ryzen           8-13 /  4-5  | 4 / 0.5       8-13 /  8-9 (2 uops)  | 4 / 1 (2 uops)
    
      Low power CPUs:
    Jaguar            19 /   19  | 4 / 2            38 /  38 (2 uops)  | 4 / 2 (2 uops)
    
    Silvermont(scalar) 34 / 32    | 5 / 2
    Silvermont         69 / 69 (6 uops) | 5 / 2           (No AVX)
    
    KNL(scalar)      42 / 42 (3 uops) | 6 / 0.5   (Yes, Agner really lists scalar as slower than packed, but fewer uops)
    KNL              32 / 20 (18uops) | 6 / 0.5        32 / 32 (18 uops) | 6 / 0.5  (AVX and AVX512)
    

    Ivybridge and Broadwell are different too, but I wanted to keep the table small. (Core2 (before Nehalem) has better divider performance, but its max clock speeds were lower.)

    Atom, Silvermont, and even Knight's Landing (Xeon Phi based on Silvermont) have exceptionally low divide performance, and even a 128b vector is slower than scalar. AMD's low-power Jaguar CPU (used in some consoles) is similar. A high-performance divider takes a lot of die area. Xeon Phi has low power per-core, and packing lots of cores on a die gives it tighter die-area constraints that Skylake-AVX512. It seems that AVX512ER rcp28ps / pd is what you're "supposed" to use on KNL.

    (See this InstLatx64 result for Skylake-AVX512 aka Skylake-X. Numbers for vdivps zmm: 18c / 10c, so half the throughput of ymm.)


    Long latency chains become a problem when they're loop-carried, or when they're so long that they stop out-of-order execution from finding parallelism with other independent work.


    Footnote 1: how I made up those div vs. mul performance ratios:

    FP divide vs. multiple performance ratios are even worse than that in low-power CPUs like Silvermont and Jaguar, and even in Xeon Phi (KNL, where you should use AVX512ER).

    Actual divide/multiply throughput ratios for scalar (non-vectorized) double: 8 on Ryzen and Skylake with their beefed-up dividers, but 16-28 on Haswell (data-dependent, and more likely towards the 28 cycle end unless your divisors are round numbers). These modern CPUs have very powerful dividers, but their 2-per-clock multiply throughput blows it away. (Even more so when your code can auto-vectorize with 256b AVX vectors). Also note that with the right compiler options, those multiply throughputs also apply to FMA.

    Numbers from http://agner.org/optimize/ instruction tables for Intel Haswell/Skylake and AMD Ryzen, for SSE scalar (not including x87 fmul / fdiv) and for 256b AVX SIMD vectors of float or double. See also the x86 tag wiki.

提交回复
热议问题