Vectorizable implementation of complementary error function erfcf()

前端 未结 1 1735
后悔当初
后悔当初 2021-01-05 03:52

The complementary error function, erfc, is a special functions closely related to the standard normal distribution. It is frequently used in statistics and the natu

1条回答
  •  清酒与你
    2021-01-05 04:42

    After looking into various approaches, the one that seems most suitable is the algorithm proposed in the following paper:

    M. M. Shepherd and J. G. Laframboise, "Chebyshev Approximation of (1 + 2 x) exp(x2) erfc x in 0 ≤ x < ∞." Mathematics of Computation, Volume 36, No. 153, January 1981, pp. 249-253 (online copy)

    The basic idea of the paper is to create an approximation to (1 + 2 x) exp(x2) erfc(x), from which we can compute erfcx(x) by simply dividing by (1 + 2 x), and erfc(x) by then multiplying with exp(-x2). The tightly bounded range of the function, with function values roughly in [1, 1.3], and its general "flatness" lend itself well to polynomial approximation. Numerical properties of this approach are further improved by narrowing the approximation interval: the original argument x is transformed by q = (x - K) / (x + K), where K is a suitably chosen constant, followed by computing p (q), where p is a polynomial.

    Since erfc -x = 2 - erfc x, we only need to consider the interval [0, ∞] which is mapped to the interval [-1, 1] by this transformation. For IEEE-754 single-precision, erfcf() vanishes (becomes zero) for x > 10.0546875, so one needs to consider only x ∈ [0, 10.0546875). What is the "optimal' value of K for this range? I know of no mathematical analysis that would provide the answer, the paper suggests K = 3.75 based on experiments.

    One can readily establish that for single-precision computation, a minimax polynomial approximation of degree 9 is sufficient for various values of K in that general vicinity. Systematically generating such approximations with the Remez algorithm, with K varying between 1.5 and 4 in steps of 1/16, lowest approximation error is observed for K = {2, 2.625, 3.3125}. Of these, K = 2 is the most advantageous choice, since it lends itself to very accurate computation of (x - K) / (x + K), as shown in this question.

    The value K = 2 and the input domain for x would suggest that it is necessary to use variant 4 from my answer, however once can demonstrate experimentally that the less expensive variant 5 achieves the same accuracy here, which is likely due to the very shallow slope of the approximated function for q > -0.5, which causes any error in the argument q to be reduced by roughly a factor of ten.

    Since computation of erfc() requires post-processing steps in addition to the initial approximation, it is clear that the accuracy of both of these computations must be high in order to achieve a sufficiently accurate final result. Error correcting techniques must be used.

    One observes that the most significant coefficient in the polynomial approximation of (1 + 2 x) exp(x2) erfc(x) is of the form (1 + s), where s < 0.5. This means we can represent the leading coefficient more accurately by splitting off 1, and only using s in the polynomial. So instead of computing a polynomial p(q), then multiplying by the reciprocal r = 1 / (1 + 2 x), it is mathematically equivalent but numerically advantageous to compute the core approximation as p(q) + 1, and use p to compute fma (p, r, r).

    The accuracy of the division can be enhanced by computing an initial quotient q from the reciprocal r, compute the residual e = p+1 - q * (1 + 2 x) with the help of an FMA, then use e to apply the correction q = q + (e * r), again using an FMA.

    Exponentiation has error magnification properties, therefore computation of e-x2 must be performed carefully. The availability of FMA trivially allows the computation of -x2 as a double-float shigh:slow. ex is its own derivative, so one can compute eshigh:slow as eshigh + eshigh * slow. This computation can be combined with the multiplication of the previous intermediate result r to yield r = r * eshigh + r * eshigh * slow. By use of FMA, one ensures that the most significant term r * eshigh is computed as accurately as possible.

    Combining the steps above with a few simple selections to handle exceptional cases and negative arguments, one arrives at the following C code:

    float my_expf (float);
    
    /*  
     * Based on: M. M. Shepherd and J. G. Laframboise, "Chebyshev Approximation of 
     * (1+2x)exp(x^2)erfc x in 0 <= x < INF", Mathematics of Computation, Vol. 36,
     * No. 153, January 1981, pp. 249-253.  
     */  
    float my_erfcf (float x)
    {
        float a, d, e, m, p, q, r, s, t;
    
        a = fabsf (x); 
    
        /* Compute q = (a-2)/(a+2) accurately. [0, 10.0546875] -> [-1, 0.66818] */
        m = a - 2.0f;
        p = a + 2.0f;
        r = 1.0f / p;
        q = m * r;
        t = fmaf (q + 1.0f, -2.0f, a); 
        e = fmaf (q, -a, t); 
        q = fmaf (r, e, q); 
    
        /* Approximate (1+2*a)*exp(a*a)*erfc(a) as p(q)+1 for q in [-1, 0.66818] */
        p =             -0x1.a48024p-12f;  // -4.01020574e-4
        p = fmaf (p, q, -0x1.42a172p-10f); // -1.23073824e-3
        p = fmaf (p, q,  0x1.585784p-10f); //  1.31355994e-3
        p = fmaf (p, q,  0x1.1ade24p-07f); //  8.63243826e-3
        p = fmaf (p, q, -0x1.081b72p-07f); // -8.05991236e-3
        p = fmaf (p, q, -0x1.bc0b94p-05f); // -5.42047396e-2
        p = fmaf (p, q,  0x1.4ffc40p-03f); //  1.64055347e-1
        p = fmaf (p, q, -0x1.540840p-03f); // -1.66031361e-1
        p = fmaf (p, q, -0x1.7bf612p-04f); // -9.27639678e-2
        p = fmaf (p, q,  0x1.1ba03ap-02f); //  2.76978403e-1
    
        /* Divide (1+p) by (1+2*a) ==> exp(a*a)*erfc(a) */
        t = a + a;
        d = t + 1.0f;
        r = 1.0f / d;
        q = fmaf (p, r, r); // q = (p+1)/(1+2*a)
        e = (p - q) + fmaf (q, -t, 1.0f); // (p+1) - q*(1+2*a)
        r = fmaf (e, r, q);
    
        /* Multiply by exp(-a*a) ==> erfc(a) */
        s = a * a; 
        e = my_expf (-s);  
        t = fmaf (a, -a, s);
        r = fmaf (r, e, r * e * t);
    
        /* Handle NaN arguments to erfc() */
        if (!(a <= 0x1.fffffep127f)) r = x + x;
    
        /* Clamp result for large arguments */
        if (a > 10.0546875f) r = 0.0f;
    
        /* Handle negative arguments to erfc() */
        if (x < 0.0f) r = 2.0f - r; 
    
        return r;
    }
    
    /* Compute exponential base e. Maximum ulp error = 0.87161 */
    float my_expf (float a)
    {
        float c, f, r;
        int i;
    
        // exp(a) = exp(i + f); i = rint (a / log(2)) 
        c = 0x1.800000p+23f; // 1.25829120e+7
        r = fmaf (0x1.715476p+0f, a, c) - c; // 1.44269502e+0
        f = fmaf (r, -0x1.62e400p-01f, a); // -6.93145752e-1 // log_2_hi 
        f = fmaf (r, -0x1.7f7d1cp-20f, f); // -1.42860677e-6 // log_2_lo
        i = (int)r;
        // approximate r = exp(f) on interval [-log(2)/2,+log(2)/2]
        r =             0x1.6a98dap-10f;  // 1.38319808e-3
        r = fmaf (r, f, 0x1.1272cap-07f); // 8.37550033e-3
        r = fmaf (r, f, 0x1.555a20p-05f); // 4.16689515e-2
        r = fmaf (r, f, 0x1.55542ep-03f); // 1.66664466e-1
        r = fmaf (r, f, 0x1.fffff6p-02f); // 4.99999851e-1
        r = fmaf (r, f, 0x1.000000p+00f); // 1.00000000e+0
        r = fmaf (r, f, 0x1.000000p+00f); // 1.00000000e+0
        // exp(a) = 2**i * exp(f);
        r = ldexpf (r, i);
        // handle special cases
        if (!(fabsf (a) < 104.0f)) {
            r = a + a; // handle NaNs
            if (a < 0.0f) r = 0.0f;
            if (a > 0.0f) r = 1e38f * 1e38f; // + INF
        }
        return r;
    }
    

    I used my own implementation of expf() in the above code to isolate my work from differences in the expf() implementations on different compute platforms. But any implementation of expf() whose maximum error is close to 0.5 ulp should work well. As shown above, that is, when using my_expf(), my_erfcf() has a maximum error of 2.65712 ulps. Provided availability of a vectorizable expf(), the code above should vectorize without problem. I did a quick check with the Intel compiler 13.1.3.198. I put a call to my_erfcf() in a loop, added #include , replaced the call to my_expf() with a call to expf(), then compiled using these command line switches:

    /Qstd=c99 /O3 /QxCORE-AVX2 /fp:precise /Qfma /Qimf-precision:high:expf /Qvec_report=2
    

    The Intel compiler reported that the loop had been vectorized, which I double checked by inspection of the disassembled binary code.

    Since my_erfcf() only uses reciprocals rather than full divisions, it is amenable to the use of fast reciprocal implementations, provided they deliver almost correctly-rounded results. For processors that provide a fast single-precision reciprocal approximation in hardware, this can easily be achieved by coupling this with a Halley iteration with cubic convergence. A (scalar) example of this approach for x86 processors is:

    /* Compute 1.0f / a almost correctly rounded. Halley iteration with cubic convergence */
    float fast_recipf (float a)
    {
        __m128 t;
        float e, r;
        t = _mm_set_ss (a);
        t = _mm_rcp_ss (t);
        _mm_store_ss (&r, t);
        e = fmaf (r, -a, 1.0f);
        e = fmaf (e, e, e);
        r = fmaf (e, r, r);
        return r;
    }
    

    0 讨论(0)
提交回复
热议问题