Integer cube root

后端未结

关注

 6  1827

一生所求 2021-02-05 07:18

I\'m looking for fast code for 64-bit (unsigned) cube roots. (I\'m using C and compiling with gcc, but I imagine most of the work required will be language- and compiler-agnost

6条回答

温柔的废话 (楼主)

2021-02-05 07:57
I've adapted the algorithm presented in 1.5.2 (the kth root) in Modern Computer Arithmetic (Brent and Zimmerman). For the case of (k == 3), and given a 'relatively' accurate over-estimate of the initial guess - this algorithm seems to out-perform the 'Hacker's Delight' code above.

Not only that, but MCA as a text provides theoretical background as well as a proof of correctness and terminating criteria.

Provided that we can produce a 'relatively' good initial over-estimate, I haven't been able to find a case that exceeds (7) iterations. (Is this effectively related to 64-bit values having 2^6 bits?) Either way, it's an improvement over the (21) iterations in the HacDel code - with linear O(b) convergence, despite having a loop body that is evidently much faster.

The initial estimate I've used is based on a 'rounding up' of the number of significant bits in the value (x). Given (b) significant bits in (x), we can say: 2^(b - 1) <= x < 2^b. I state without proof (though it should be relatively easy to demonstrate) that: 2^ceil(b / 3) > x^(1/3)
```
static inline uint32_t u64_cbrt (uint64_t x)
{
    uint64_t r0 = 1, r1;

    /* IEEE-754 cbrt *may* not be exact. */

    if (x == 0) /* cbrt(0) : */
        return (0);

    int b = (64) - __builtin_clzll(x);
    r0 <<= (b + 2) / 3; /* ceil(b / 3) */

    do /* quadratic convergence: */
    {
        r1 = r0;
        r0 = (2 * r1 + x / (r1 * r1)) / 3;
    }
    while (r0 < r1);

    return ((uint32_t) r1); /* floor(cbrt(x)); */
}
```
A crbt call probably isn't all that useful - unlike the sqrt call which can be efficiently implemented on modern hardware. That said, I've seen gains for sets of values under 2^53 (exactly represented in IEEE-754 doubles), which surprised me.

The only downside is the division by: (r * r) - this can be slow, as the latency of integer division continues to fall behind other advances in ALUs. The division by a constant: (3) is handled by reciprocal methods on any modern optimising compiler.

It's interesting that Intel's 'Icelake' microarchitecture will significantly improve integer division - an operation that seems to have been neglected for a long time. I simply won't trust the 'Hacker's Delight' answer until I can find a sound theoretical basis for it. And then I have to work out which variant is the 'correct' answer.
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...