Efficiently computing (a - K) / (a + K) with improved accuracy

后端 未结 6 1737
抹茶落季
抹茶落季 2021-02-18 15:44

In various contexts, for example for the argument reduction for mathematical functions, one needs to compute (a - K) / (a + K), where a is a positive v

6条回答
  •  北恋
    北恋 (楼主)
    2021-02-18 16:21

    Since my goal is to merely widen the interval on which accurate results are achieved, rather than to find a solution that works for all possible values of a, making use of double-float arithmetic for all intermediate computation seems too costly.

    Thinking some more about the problem, it is clear that the computation of the remainder of the division, e in the code from my question, is the crucial part of achieving more accurate result. Mathematically, the remainder is (a-K) - q * (a+K). In my code, I simply used m to represent (a-K) and represented (a+k) as m + 2*K, as this delivers numerically superior results to the straightforward representation.

    With relatively small additional computational cost, (a+K) can be represented as a double-float, that is, a head-tail pair p:plo, which leads to the following modified version of my original code:

    /* Compute q = (a - K) / (a + K) with improved accuracy. Variant 2 */
    m = a - K;
    p = a + K;
    r = 1.0f / p;
    q = m * r;
    mx = fmaxf (a, K);
    mn = fminf (a, K);
    plo = (mx - p) + mn;
    t = fmaf (q, -p, m);
    e = fmaf (q, -plo, t);
    q = fmaf (r, e, q);
    

    Testing shows that this delivers nearly correctly rounded results for a in [K/2, 224*K), allowing for a substantial increase to the upper bound of the interval on which accurate results are achieved.

    Widening the interval at the lower end requires the more accurate representation of (a-K). We can compute this as a double-float head-tail pair m:mlo, which leads to the following code variant:

    /* Compute q = (a - K) / (a + K) with improved accuracy. Variant 3 */
    m = a - K;
    p = a + K;
    r = 1.0f / p;
    q = m * r;
    plo = (a < K) ? ((K - p) + a) : ((a - p) + K);
    mlo = (a < K) ? (a - (K + m)) : ((a - m) - K);
    t = fmaf (q, -p, m);
    e = fmaf (q, -plo, t);
    e = e + mlo;
    q = fmaf (r, e, q);
    

    Exhaustive testing hows that this delivers nearly correctly rounded results for a in the interval [K/224, K*224). Unfortunately, this comes at a cost of ten additional operations compared to the code in my question, which is a steep price to pay to get the maximum error from around 1.625 ulps with the naive computation down to near 0.5 ulp.

    As in my original code from the question, one can express (a+K) in terms of (a-K), thus eliminating the computation of the tail of p, plo. This approach results in the following code:

    /* Compute q = (a - K) / (a + K) with improved accuracy. Variant 4 */
    m = a - K;
    p = a + K;
    r = 1.0f / p;
    q = m * r;
    mlo = (a < K) ? (a - (K + m)) : ((a - m) - K);
    t = fmaf (q, -2.0f*K, m);
    t = fmaf (q, -m, t);
    e = fmaf (q - 1.0f, -mlo, t);
    q = fmaf (r, e, q);
    

    This turns out to be advantageous if the main focus is decreasing the lower limit of the interval, which is my particular focus as explained in the question. Exhaustive testing of the single-precision case shows that when K=2n nearly correctly rounded results are produced for values of a in the interval [K/224, 4.23*K]. With a total of 14 or 15 operations (depending on whether an architecture supports full predication or just conditional moves), this requires seven to eight more operations than my original code.

    Lastly, one might base the residual computation directly on the original variable a to avoid the error inherent in the computation of m and p. This leads to the following code that, for K = 2n, computes nearly correctly rounded results for a in the interval [K/224, K/3):

    /* Compute q = (a - K) / (a + K) with improved accuracy. Variant 5 */
    m = a - K;
    p = a + K;
    r = 1.0f / p;       
    q = m * r;
    t = fmaf (q + 1.0f, -K, a);
    e = fmaf (q, -a, t);
    q = fmaf (r, e, q);
    

提交回复
热议问题