Optimizations for pow() with const non-integer exponent?

前端 未结 10 622
旧时难觅i
旧时难觅i 2020-12-04 09:17

I have hot spots in my code where I\'m doing pow() taking up around 10-20% of my execution time.

My input to pow(x,y) is very specific, so

相关标签:
10条回答
  • 2020-12-04 09:33

    I shall answer the question you really wanted to ask, which is how to do fast sRGB <-> linear RGB conversion. To do this precisely and efficiently we can use polynomial approximations. The following polynomial approximations have been generated with sollya, and have a worst case relative error of 0.0144%.

    inline double poly7(double x, double a, double b, double c, double d,
                                  double e, double f, double g, double h) {
        double ab, cd, ef, gh, abcd, efgh, x2, x4;
        x2 = x*x; x4 = x2*x2;
        ab = a*x + b; cd = c*x + d;
        ef = e*x + f; gh = g*x + h;
        abcd = ab*x2 + cd; efgh = ef*x2 + gh;
        return abcd*x4 + efgh;
    }
    
    inline double srgb_to_linear(double x) {
        if (x <= 0.04045) return x / 12.92;
    
        // Polynomial approximation of ((x+0.055)/1.055)^2.4.
        return poly7(x, 0.15237971711927983387,
                       -0.57235993072870072762,
                        0.92097986411523535821,
                       -0.90208229831912012386,
                        0.88348956209696805075,
                        0.48110797889132134175,
                        0.03563925285274562038,
                        0.00084585397227064120);
    }
    
    inline double linear_to_srgb(double x) {
        if (x <= 0.0031308) return x * 12.92;
    
        // Piecewise polynomial approximation (divided by x^3)
        // of 1.055 * x^(1/2.4) - 0.055.
        if (x <= 0.0523) return poly7(x, -6681.49576364495442248881,
                                          1224.97114922729451791383,
                                          -100.23413743425112443219,
                                             6.60361150127077944916,
                                             0.06114808961060447245,
                                            -0.00022244138470139442,
                                             0.00000041231840827815,
                                            -0.00000000035133685895) / (x*x*x);
    
        return poly7(x, -0.18730034115395793881,
                         0.64677431008037400417,
                        -0.99032868647877825286,
                         1.20939072663263713636,
                         0.33433459165487383613,
                        -0.01345095746411287783,
                         0.00044351684288719036,
                        -0.00000664263587520855) / (x*x*x);
    }
    

    And the sollya input used to generate the polynomials:

    suppressmessage(174);
    f = ((x+0.055)/1.055)^2.4;
    p0 = fpminimax(f, 7, [|D...|], [0.04045;1], relative);
    p = fpminimax(f/(p0(1)+1e-18), 7, [|D...|], [0.04045;1], relative);
    print("relative:", dirtyinfnorm((f-p)/f, [s;1]));
    print("absolute:", dirtyinfnorm((f-p), [s;1]));
    print(canonical(p));
    
    s = 0.0523;
    z = 3;
    f = 1.055 * x^(1/2.4) - 0.055;
    
    p = fpminimax(1.055 * (x^(z+1/2.4) - 0.055*x^z/1.055), 7, [|D...|], [0.0031308;s], relative)/x^z;
    print("relative:", dirtyinfnorm((f-p)/f, [0.0031308;s]));
    print("absolute:", dirtyinfnorm((f-p), [0.0031308;s]));
    print(canonical(p));
    
    p = fpminimax(1.055 * (x^(z+1/2.4) - 0.055*x^z/1.055), 7, [|D...|], [s;1], relative)/x^z;
    print("relative:", dirtyinfnorm((f-p)/f, [s;1]));
    print("absolute:", dirtyinfnorm((f-p), [s;1]));
    print(canonical(p));
    
    0 讨论(0)
  • 2020-12-04 09:36

    In the IEEE 754 hacking vein, here is another solution which is faster and less "magical." It achieves an error margin of .08% in about a dozen clock cycles (for the case of p=2.4, on an Intel Merom CPU).

    Floating point numbers were originally invented as an approximation to logarithms, so you can use the integer value as an approximation of log2. This is somewhat-portably achievable by applying the convert-from-integer instruction to a floating-point value, to obtain another floating-point value.

    To complete the pow computation, you can multiply by a constant factor and convert the logarithm back with the convert-to-integer instruction. On SSE, the relevant instructions are cvtdq2ps and cvtps2dq.

    It's not quite so simple, though. The exponent field in IEEE 754 is signed, with a bias value of 127 representing an exponent of zero. This bias must be removed before you multiply the logarithm, and re-added before you exponentiate. Furthermore, bias adjustment by subtraction won't work on zero. Fortunately, both adjustments can be achieved by multiplying by a constant factor beforehand.

    x^p
    = exp2( p * log2( x ) )
    = exp2( p * ( log2( x ) + 127 - 127 ) - 127 + 127 )
    = cvtps2dq( p * ( log2( x ) + 127 - 127 - 127 / p ) )
    = cvtps2dq( p * ( log2( x ) + 127 - log2( exp2( 127 - 127 / p ) ) )
    = cvtps2dq( p * ( log2( x * exp2( 127 / p - 127 ) ) + 127 ) )
    = cvtps2dq( p * ( cvtdq2ps( x * exp2( 127 / p - 127 ) ) ) )
    

    exp2( 127 / p - 127 ) is the constant factor. This function is rather specialized: it won't work with small fractional exponents, because the constant factor grows exponentially with the inverse of the exponent and will overflow. It won't work with negative exponents. Large exponents lead to high error, because the mantissa bits are mingled with the exponent bits by the multiplication.

    But, it's just 4 fast instructions long. Pre-multiply, convert from "integer" (to logarithm), power-multiply, convert to "integer" (from logarithm). Conversions are very fast on this implementation of SSE. We can also squeeze an extra constant coefficient into the first multiplication.

    template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden >
    __m128 fastpow( __m128 arg ) {
            __m128 ret = arg;
    //      std::printf( "arg = %,vg\n", ret );
            // Apply a constant pre-correction factor.
            ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. )
                    * pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) );
    //      std::printf( "scaled = %,vg\n", ret );
            // Reinterpret arg as integer to obtain logarithm.
            asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) );
    //      std::printf( "log = %,vg\n", ret );
            // Multiply logarithm by power.
            ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) );
    //      std::printf( "powered = %,vg\n", ret );
            // Convert back to "integer" to exponentiate.
            asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) );
    //      std::printf( "result = %,vg\n", ret );
            return ret;
    }
    

    A few trials with exponent = 2.4 show this consistently overestimates by about 5%. (The routine is always guaranteed to overestimate.) You could simply multiply by 0.95, but a few more instructions will get us about 4 decimal digits of accuracy, which should be enough for graphics.

    The key is to match the overestimate with an underestimate, and take the average.

    • Compute x^0.8: four instructions, error ~ +3%.
    • Compute x^-0.4: one rsqrtps. (This is quite accurate enough, but does sacrifice the ability to work with zero.)
    • Compute x^0.4: one mulps.
    • Compute x^-0.2: one rsqrtps.
    • Compute x^2: one mulps.
    • Compute x^3: one mulps.
    • x^2.4 = x^2 * x^0.4: one mulps. This is the overestimate.
    • x^2.4 = x^3 * x^-0.4 * x^-0.2: two mulps. This is the underestimate.
    • Average the above: one addps, one mulps.

    Instruction tally: fourteen, including two conversions with latency = 5 and two reciprocal square root estimates with throughput = 4.

    To properly take the average, we want to weight the estimates by their expected errors. The underestimate raises the error to a power of 0.6 vs 0.4, so we expect it to be 1.5x as erroneous. Weighting doesn't add any instructions; it can be done in the pre-factor. Calling the coefficient a: a^0.5 = 1.5 a^-0.75, and a = 1.38316186.

    The final error is about .015%, or 2 orders of magnitude better than the initial fastpow result. The runtime is about a dozen cycles for a busy loop with volatile source and destination variables… although it's overlapping the iterations, real-world usage will also see instruction-level parallelism. Considering SIMD, that's a throughput of one scalar result per 3 cycles!

    int main() {
            __m128 const x0 = _mm_set_ps( 0.01, 1, 5, 1234.567 );
            std::printf( "Input: %,vg\n", x0 );
    
            // Approx 5% accuracy from one call. Always an overestimate.
            __m128 x1 = fastpow< 24, 10, 1, 1 >( x0 );
            std::printf( "Direct x^2.4: %,vg\n", x1 );
    
            // Lower exponents provide lower initial error, but too low causes overflow.
            __m128 xf = fastpow< 8, 10, int( 1.38316186 * 1e9 ), int( 1e9 ) >( x0 );
            std::printf( "1.38 x^0.8: %,vg\n", xf );
    
            // Imprecise 4-cycle sqrt is still far better than fastpow, good enough.
            __m128 xfm4 = _mm_rsqrt_ps( xf );
            __m128 xf4 = _mm_mul_ps( xf, xfm4 );
    
            // Precisely calculate x^2 and x^3
            __m128 x2 = _mm_mul_ps( x0, x0 );
            __m128 x3 = _mm_mul_ps( x2, x0 );
    
            // Overestimate of x^2 * x^0.4
            x2 = _mm_mul_ps( x2, xf4 );
    
            // Get x^-0.2 from x^0.4. Combine with x^-0.4 into x^-0.6 and x^2.4.
            __m128 xfm2 = _mm_rsqrt_ps( xf4 );
            x3 = _mm_mul_ps( x3, xfm4 );
            x3 = _mm_mul_ps( x3, xfm2 );
    
            std::printf( "x^2 * x^0.4: %,vg\n", x2 );
            std::printf( "x^3 / x^0.6: %,vg\n", x3 );
            x2 = _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 ) );
            // Final accuracy about 0.015%, 200x better than x^0.8 calculation.
            std::printf( "average = %,vg\n", x2 );
    }
    

    Well… sorry I wasn't able to post this sooner. And extending it to x^1/2.4 is left as an exercise ;v) .


    Update with stats

    I implemented a little test harness and two x(512) cases corresponding to the above.

    #include <cstdio>
    #include <xmmintrin.h>
    #include <cmath>
    #include <cfloat>
    #include <algorithm>
    using namespace std;
    
    template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden >
    __m128 fastpow( __m128 arg ) {
        __m128 ret = arg;
    //  std::printf( "arg = %,vg\n", ret );
        // Apply a constant pre-correction factor.
        ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. )
            * pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) );
    //  std::printf( "scaled = %,vg\n", ret );
        // Reinterpret arg as integer to obtain logarithm.
        asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) );
    //  std::printf( "log = %,vg\n", ret );
        // Multiply logarithm by power.
        ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) );
    //  std::printf( "powered = %,vg\n", ret );
        // Convert back to "integer" to exponentiate.
        asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) );
    //  std::printf( "result = %,vg\n", ret );
        return ret;
    }
    
    __m128 pow125_4( __m128 arg ) {
        // Lower exponents provide lower initial error, but too low causes overflow.
        __m128 xf = fastpow< 4, 5, int( 1.38316186 * 1e9 ), int( 1e9 ) >( arg );
    
        // Imprecise 4-cycle sqrt is still far better than fastpow, good enough.
        __m128 xfm4 = _mm_rsqrt_ps( xf );
        __m128 xf4 = _mm_mul_ps( xf, xfm4 );
    
        // Precisely calculate x^2 and x^3
        __m128 x2 = _mm_mul_ps( arg, arg );
        __m128 x3 = _mm_mul_ps( x2, arg );
    
        // Overestimate of x^2 * x^0.4
        x2 = _mm_mul_ps( x2, xf4 );
    
        // Get x^-0.2 from x^0.4, and square it for x^-0.4. Combine into x^-0.6.
        __m128 xfm2 = _mm_rsqrt_ps( xf4 );
        x3 = _mm_mul_ps( x3, xfm4 );
        x3 = _mm_mul_ps( x3, xfm2 );
    
        return _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 * 0.9999 ) );
    }
    
    __m128 pow512_2( __m128 arg ) {
        // 5/12 is too small, so compute the sqrt of 10/12 instead.
        __m128 x = fastpow< 5, 6, int( 0.992245 * 1e9 ), int( 1e9 ) >( arg );
        return _mm_mul_ps( _mm_rsqrt_ps( x ), x );
    }
    
    __m128 pow512_4( __m128 arg ) {
        // 5/12 is too small, so compute the 4th root of 20/12 instead.
        // 20/12 = 5/3 = 1 + 2/3 = 2 - 1/3. 2/3 is a suitable argument for fastpow.
        // weighting coefficient: a^-1/2 = 2 a; a = 2^-2/3
        __m128 xf = fastpow< 2, 3, int( 0.629960524947437 * 1e9 ), int( 1e9 ) >( arg );
        __m128 xover = _mm_mul_ps( arg, xf );
    
        __m128 xfm1 = _mm_rsqrt_ps( xf );
        __m128 x2 = _mm_mul_ps( arg, arg );
        __m128 xunder = _mm_mul_ps( x2, xfm1 );
    
        // sqrt2 * over + 2 * sqrt2 * under
        __m128 xavg = _mm_mul_ps( _mm_set1_ps( 1/( 3 * 0.629960524947437 ) * 0.999852 ),
                                    _mm_add_ps( xover, xunder ) );
    
        xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) );
        xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) );
        return xavg;
    }
    
    __m128 mm_succ_ps( __m128 arg ) {
        return (__m128) _mm_add_epi32( (__m128i) arg, _mm_set1_epi32( 4 ) );
    }
    
    void test_pow( double p, __m128 (*f)( __m128 ) ) {
        __m128 arg;
    
        for ( arg = _mm_set1_ps( FLT_MIN / FLT_EPSILON );
                ! isfinite( _mm_cvtss_f32( f( arg ) ) );
                arg = mm_succ_ps( arg ) ) ;
    
        for ( ; _mm_cvtss_f32( f( arg ) ) == 0;
                arg = mm_succ_ps( arg ) ) ;
    
        std::printf( "Domain from %g\n", _mm_cvtss_f32( arg ) );
    
        int n;
        int const bucket_size = 1 << 25;
        do {
            float max_error = 0;
            double total_error = 0, cum_error = 0;
            for ( n = 0; n != bucket_size; ++ n ) {
                float result = _mm_cvtss_f32( f( arg ) );
    
                if ( ! isfinite( result ) ) break;
    
                float actual = ::powf( _mm_cvtss_f32( arg ), p );
    
                float error = ( result - actual ) / actual;
                cum_error += error;
                error = std::abs( error );
                max_error = std::max( max_error, error );
                total_error += error;
    
                arg = mm_succ_ps( arg );
            }
    
            std::printf( "error max = %8g\t" "avg = %8g\t" "|avg| = %8g\t" "to %8g\n",
                        max_error, cum_error / n, total_error / n, _mm_cvtss_f32( arg ) );
        } while ( n == bucket_size );
    }
    
    int main() {
        std::printf( "4 insn x^12/5:\n" );
        test_pow( 12./5, & fastpow< 12, 5, 1059, 1000 > );
        std::printf( "14 insn x^12/5:\n" );
        test_pow( 12./5, & pow125_4 );
        std::printf( "6 insn x^5/12:\n" );
        test_pow( 5./12, & pow512_2 );
        std::printf( "14 insn x^5/12:\n" );
        test_pow( 5./12, & pow512_4 );
    }
    

    Output:

    4 insn x^12/5:
    Domain from 1.36909e-23
    error max =      inf    avg =      inf  |avg| =      inf    to 8.97249e-19
    error max =  2267.14    avg =  139.175  |avg| =  139.193    to 5.88021e-14
    error max = 0.123606    avg = -0.000102963  |avg| = 0.0371122   to 3.85365e-09
    error max = 0.123607    avg = -0.000108978  |avg| = 0.0368548   to 0.000252553
    error max =  0.12361    avg = 7.28909e-05   |avg| = 0.037507    to  16.5513
    error max = 0.123612    avg = -0.000258619  |avg| = 0.0365618   to 1.08471e+06
    error max = 0.123611    avg = 8.70966e-05   |avg| = 0.0374369   to 7.10874e+10
    error max =  0.12361    avg = -0.000103047  |avg| = 0.0371122   to 4.65878e+15
    error max = 0.123609    avg =      nan  |avg| =      nan    to 1.16469e+16
    14 insn x^12/5:
    Domain from 1.42795e-19
    error max =      inf    avg =      nan  |avg| =      nan    to 9.35823e-15
    error max = 0.000936462 avg = 2.0202e-05    |avg| = 0.000133764 to 6.13301e-10
    error max = 0.000792752 avg = 1.45717e-05   |avg| = 0.000129936 to 4.01933e-05
    error max = 0.000791785 avg = 7.0132e-06    |avg| = 0.000129923 to  2.63411
    error max = 0.000787589 avg = 1.20745e-05   |avg| = 0.000129347 to   172629
    error max = 0.000786553 avg = 1.62351e-05   |avg| = 0.000132397 to 1.13134e+10
    error max = 0.000785586 avg = 8.25205e-06   |avg| = 0.00013037  to 6.98147e+12
    6 insn x^5/12:
    Domain from 9.86076e-32
    error max = 0.0284339   avg = 0.000441158   |avg| = 0.00967327  to 6.46235e-27
    error max = 0.0284342   avg = -5.79938e-06  |avg| = 0.00897913  to 4.23516e-22
    error max = 0.0284341   avg = -0.000140706  |avg| = 0.00897084  to 2.77556e-17
    error max = 0.028434    avg = 0.000440504   |avg| = 0.00967325  to 1.81899e-12
    error max = 0.0284339   avg = -6.11153e-06  |avg| = 0.00897915  to 1.19209e-07
    error max = 0.0284298   avg = -0.000140597  |avg| = 0.00897084  to 0.0078125
    error max = 0.0284371   avg = 0.000439748   |avg| = 0.00967319  to      512
    error max = 0.028437    avg = -7.74294e-06  |avg| = 0.00897924  to 3.35544e+07
    error max = 0.0284369   avg = -0.000142036  |avg| = 0.00897089  to 2.19902e+12
    error max = 0.0284368   avg = 0.000439183   |avg| = 0.0096732   to 1.44115e+17
    error max = 0.0284367   avg = -7.41244e-06  |avg| = 0.00897923  to 9.44473e+21
    error max = 0.0284366   avg = -0.000141706  |avg| = 0.00897088  to 6.1897e+26
    error max = 0.485129    avg = -0.0401671    |avg| = 0.048422    to 4.05648e+31
    error max = 0.994932    avg = -0.891494 |avg| = 0.891494    to 2.65846e+36
    error max = 0.999329    avg =      nan  |avg| =      nan    to       -0
    14 insn x^5/12:
    Domain from 2.64698e-23
    error max =  0.13556    avg = 0.00125936    |avg| = 0.00354677  to 1.73472e-18
    error max = 0.000564988 avg = 2.51458e-06   |avg| = 0.000113709 to 1.13687e-13
    error max = 0.000565065 avg = -1.49258e-06  |avg| = 0.000112553 to 7.45058e-09
    error max = 0.000565143 avg = 1.5293e-06    |avg| = 0.000112864 to 0.000488281
    error max = 0.000565298 avg = 2.76457e-06   |avg| = 0.000113713 to       32
    error max = 0.000565453 avg = -1.61276e-06  |avg| = 0.000112561 to 2.09715e+06
    error max = 0.000565531 avg = 1.42628e-06   |avg| = 0.000112866 to 1.37439e+11
    error max = 0.000565686 avg = 2.71505e-06   |avg| = 0.000113715 to 9.0072e+15
    error max = 0.000565763 avg = -1.56586e-06  |avg| = 0.000112415 to 1.84467e+19
    

    I suspect accuracy of the more accurate 5/12 is being limited by the rsqrt operation.

    0 讨论(0)
  • 2020-12-04 09:36

    Ian Stephenson wrote this code which he claims outperforms pow(). He describes the idea as follows:

    Pow is basically implemented using log's: pow(a,b)=x(logx(a)*b). so we need a fast log and fast exponent - it doesn't matter what x is so we use 2. The trick is that a floating point number is already in a log style format:

    a=M*2E
    

    Taking the log of both sides gives:

    log2(a)=log2(M)+E
    

    or more simply:

    log2(a)~=E
    

    In other words if we take the floating point representation of a number, and extract the Exponent we've got something that's a good starting point as its log. It turns out that when we do this by massaging the bit patterns, the Mantissa ends up giving a good approximation to the error, and it works pretty well.

    This should be good enough for simple lighting calculations, but if you need something better, you can then extract the Mantissa, and use that to calculate a quadratic correction factor which is pretty accurate.

    0 讨论(0)
  • 2020-12-04 09:36

    Binomial series does account for a constant exponent, but you will be able to use it only if you can normalize all your input to the range [1,2). (Note that it computes (1+x)^a). You'll have to do some analysis to decide how many terms you need for your desired accuracy.

    0 讨论(0)
  • 2020-12-04 09:37

    So traditionally the powf(x, p) = x^p is solved by rewriting x as x=2^(log2(x)) making powf(x,p) = 2^(p*log2(x)), which transforms the problem into two approximations exp2() & log2(). This has the advantage of working with larger powers p, however the downside is that this is not the optimal solution for a constant power p and over a specified input bound 0 ≤ x ≤ 1.

    When the power p > 1, the answer is a trivial minimax polynomial over the bound 0 ≤ x ≤ 1, which is the case for p = 12/5 = 2.4 as can be seen below:

    float pow12_5(float x){
        float mp;
        // Minimax horner polynomials for x^(5/12), Note: choose the accurarcy required then implement with fma() [Fused Multiply Accumulates]
        // mp = 0x4.a84a38p-12 + x * (-0xd.e5648p-8 + x * (0xa.d82fep-4 + x * 0x6.062668p-4)); // 1.13705697e-3
        mp = 0x1.117542p-12 + x * (-0x5.91e6ap-8 + x * (0x8.0f50ep-4 + x * (0xa.aa231p-4 + x * (-0x2.62787p-4))));  // 2.6079002e-4
        // mp = 0x5.a522ap-16 + x * (-0x2.d997fcp-8 + x * (0x6.8f6d1p-4 + x * (0xf.21285p-4 + x * (-0x7.b5b248p-4 + x * 0x2.32b668p-4))));  // 8.61377e-5
        // mp = 0x2.4f5538p-16 + x * (-0x1.abcdecp-8 + x * (0x5.97464p-4 + x * (0x1.399edap0 + x * (-0x1.0d363ap0 + x * (0xa.a54a3p-4 + x * (-0x2.e8a77cp-4))))));  // 3.524655e-5
        return(mp);
    }
    

    However when p < 1 the minimax approximation over the bound 0 ≤ x ≤ 1 does not appropriately converge to the desired accuracy. One option [not really] is to rewrite the problem y=x^p=x^(p+m)/x^m where m=1,2,3 is a positive integer, making the new power approximation p > 1 but this introduces division which is inherently slower.

    There's however another option which is to decompose the input x as its floating point exponent and mantissa form:

    x = mx* 2^(ex) where 1 ≤ mx < 2
    y = x^(5/12) = mx^(5/12) * 2^((5/12)*ex), let ey = floor(5*ex/12), k = (5*ex) % 12
      = mx^(5/12) * 2^(k/12) * 2^(ey)
    

    The minimax approximation of mx^(5/12) over 1 ≤ mx < 2 now converges much faster than before, without division, but requires 12 point LUT for the 2^(k/12). The code is below:

    float powk_12LUT[] = {0x1.0p0, 0x1.0f38fap0, 0x1.1f59acp0,  0x1.306fep0, 0x1.428a3p0, 0x1.55b81p0, 0x1.6a09e6p0, 0x1.7f910ep0, 0x1.965feap0, 0x1.ae89fap0, 0x1.c823ep0, 0x1.e3437ep0};
    float pow5_12(float x){
        union{float f; uint32_t u;} v, e2;
        float poff, m, e, ei;
        int xe;
    
        v.f = x;
        xe = ((v.u >> 23) - 127);
    
        if(xe < -127) return(0.0f);
    
        // Calculate remainder k in 2^(k/12) to find LUT
        e = xe * (5.0f/12.0f);
        ei = floorf(e);
        poff = powk_12LUT[(int)(12.0f * (e - ei))];
    
        e2.u = ((int)ei + 127) << 23;   // Calculate the exponent
        v.u = (v.u & ~(0xFFuL << 23)) | (0x7FuL << 23); // Normalize exponent to zero
    
        // Approximate mx^(5/12) on [1,2), with appropriate degree minimax
        // m = 0x8.87592p-4 + v.f * (0x8.8f056p-4 + v.f * (-0x1.134044p-4));    // 7.6125e-4
        // m = 0x7.582138p-4 + v.f * (0xb.1666bp-4 + v.f * (-0x2.d21954p-4 + v.f * 0x6.3ea0cp-8));  // 8.4522726e-5
        m = 0x6.9465cp-4 + v.f * (0xd.43015p-4 + v.f * (-0x5.17b2a8p-4 + v.f * (0x1.6cb1f8p-4 + v.f * (-0x2.c5b76p-8))));   // 1.04091259e-5
        // m = 0x6.08242p-4 + v.f * (0xf.352bdp-4 + v.f * (-0x7.d0c1bp-4 + v.f * (0x3.4d153p-4 + v.f * (-0xc.f7a42p-8 + v.f * 0x1.5d840cp-8))));    // 1.367401e-6
    
        return(m * poff * e2.f);
    }
    
    0 讨论(0)
  • 2020-12-04 09:42

    This might not answer your question.

    The 2.4f and 1/2.4f make me very suspicious, because those are exactly the powers used to convert between sRGB and a linear RGB color space. So you might actually be trying to optimize that, specifically. I don't know, which is why this might not answer your question.

    If this is the case, try using a lookup table. Something like:

    __attribute__((aligned(64))
    static const unsigned short SRGB_TO_LINEAR[256] = { ... };
    __attribute__((aligned(64))
    static const unsigned short LINEAR_TO_SRGB[256] = { ... };
    
    void apply_lut(const unsigned short lut[256], unsigned char *src, ...
    

    If you are using 16-bit data, change as appropriate. I would make the table 16 bits anyway so you can dither the result if necessary when working with 8-bit data. This obviously won't work very well if your data is floating point to begin with -- but it doesn't really make sense to store sRGB data in floating point, so you might as well convert to 16-bit / 8-bit first and then do the conversion from linear to sRGB.

    (The reason sRGB doesn't make sense as floating point is that HDR should be linear, and sRGB is only convenient for storing on disk or displaying on screen, but not convenient for manipulation.)

    0 讨论(0)
提交回复
热议问题