Integer cube root

后端 未结 6 1827
一生所求
一生所求 2021-02-05 07:18

I\'m looking for fast code for 64-bit (unsigned) cube roots. (I\'m using C and compiling with gcc, but I imagine most of the work required will be language- and compiler-agnost

6条回答
  •  温柔的废话
    2021-02-05 07:57

    I've adapted the algorithm presented in 1.5.2 (the kth root) in Modern Computer Arithmetic (Brent and Zimmerman). For the case of (k == 3), and given a 'relatively' accurate over-estimate of the initial guess - this algorithm seems to out-perform the 'Hacker's Delight' code above.

    Not only that, but MCA as a text provides theoretical background as well as a proof of correctness and terminating criteria.

    Provided that we can produce a 'relatively' good initial over-estimate, I haven't been able to find a case that exceeds (7) iterations. (Is this effectively related to 64-bit values having 2^6 bits?) Either way, it's an improvement over the (21) iterations in the HacDel code - with linear O(b) convergence, despite having a loop body that is evidently much faster.

    The initial estimate I've used is based on a 'rounding up' of the number of significant bits in the value (x). Given (b) significant bits in (x), we can say: 2^(b - 1) <= x < 2^b. I state without proof (though it should be relatively easy to demonstrate) that: 2^ceil(b / 3) > x^(1/3)


    static inline uint32_t u64_cbrt (uint64_t x)
    {
        uint64_t r0 = 1, r1;
    
        /* IEEE-754 cbrt *may* not be exact. */
    
        if (x == 0) /* cbrt(0) : */
            return (0);
    
        int b = (64) - __builtin_clzll(x);
        r0 <<= (b + 2) / 3; /* ceil(b / 3) */
    
        do /* quadratic convergence: */
        {
            r1 = r0;
            r0 = (2 * r1 + x / (r1 * r1)) / 3;
        }
        while (r0 < r1);
    
        return ((uint32_t) r1); /* floor(cbrt(x)); */
    }
    

    A crbt call probably isn't all that useful - unlike the sqrt call which can be efficiently implemented on modern hardware. That said, I've seen gains for sets of values under 2^53 (exactly represented in IEEE-754 doubles), which surprised me.

    The only downside is the division by: (r * r) - this can be slow, as the latency of integer division continues to fall behind other advances in ALUs. The division by a constant: (3) is handled by reciprocal methods on any modern optimising compiler.

    It's interesting that Intel's 'Icelake' microarchitecture will significantly improve integer division - an operation that seems to have been neglected for a long time. I simply won't trust the 'Hacker's Delight' answer until I can find a sound theoretical basis for it. And then I have to work out which variant is the 'correct' answer.

提交回复
热议问题