Integer cube root

后端 未结 6 1841
一生所求
一生所求 2021-02-05 07:18

I\'m looking for fast code for 64-bit (unsigned) cube roots. (I\'m using C and compiling with gcc, but I imagine most of the work required will be language- and compiler-agnost

相关标签:
6条回答
  • 2021-02-05 07:53
    // On my pc: Math.Sqrt 35 ns, cbrt64 <70ns, cbrt32 <25 ns, (cbrt12 < 10ns)
    
    // cbrt64(ulong x) is a C# version of:
    // http://www.hackersdelight.org/hdcodetxt/acbrt.c.txt     (acbrt1)
    
    // cbrt32(uint x) is a C# version of:
    // http://www.hackersdelight.org/hdcodetxt/icbrt.c.txt     (icbrt1)
    
    // Union in C#:
    // http://www.hanselman.com/blog/UnionsOrAnEquivalentInCSairamasTipOfTheDay.aspx
    
    using System.Runtime.InteropServices;  
    [StructLayout(LayoutKind.Explicit)]  
    public struct fu_32   // float <==> uint
    {
    [FieldOffset(0)]
    public float f;
    [FieldOffset(0)]
    public uint u;
    }
    
    private static uint cbrt64(ulong x)
    {
        if (x >= 18446724184312856125) return 2642245;
        float fx = (float)x;
        fu_32 fu32 = new fu_32();
        fu32.f = fx;
        uint uy = fu32.u / 4;
        uy += uy / 4;
        uy += uy / 16;
        uy += uy / 256;
        uy += 0x2a5137a0;
        fu32.u = uy;
        float fy = fu32.f;
        fy = 0.33333333f * (fx / (fy * fy) + 2.0f * fy);
        int y0 = (int)                                      
            (0.33333333f * (fx / (fy * fy) + 2.0f * fy));    
        uint y1 = (uint)y0;                                 
    
        ulong y2, y3;
        if (y1 >= 2642245)
        {
            y1 = 2642245;
            y2 = 6981458640025;
            y3 = 18446724184312856125;
        }
        else
        {
            y2 = (ulong)y1 * y1;
            y3 = y2 * y1;
        }
        if (y3 > x)
        {
            y1 -= 1;
            y2 -= 2 * y1 + 1;
            y3 -= 3 * y2 + 3 * y1 + 1;
            while (y3 > x)
            {
                y1 -= 1;
                y2 -= 2 * y1 + 1;
                y3 -= 3 * y2 + 3 * y1 + 1;
            }
            return y1;
        }
        do
        {
            y3 += 3 * y2 + 3 * y1 + 1;
            y2 += 2 * y1 + 1;
            y1 += 1;
        }
        while (y3 <= x);
        return y1 - 1;
    }
    
    private static uint cbrt32(uint x)
    {
        uint y = 0, z = 0, b = 0;
        int s = x < 1u << 24 ? x < 1u << 12 ? x < 1u << 06 ? x < 1u << 03 ? 00 : 03 :
                                                             x < 1u << 09 ? 06 : 09 :
                                              x < 1u << 18 ? x < 1u << 15 ? 12 : 15 :
                                                             x < 1u << 21 ? 18 : 21 :
                               x >= 1u << 30 ? 30 : x < 1u << 27 ? 24 : 27;
        do
        {
            y *= 2;
            z *= 4;
            b = 3 * y + 3 * z + 1 << s;
            if (x >= b)
            {
                x -= b;
                z += 2 * y + 1;
                y += 1;
            }
            s -= 3;
        }
        while (s >= 0);
        return y;
    }
    
    private static uint cbrt12(uint x) // x < ~255
    {
        uint y = 0, a = 0, b = 1, c = 0;
        while (a < x)
        {
            y++;
            b += c;
            a += b;
            c += 6;
        }
        if (a != x) y--;
        return y;
    } 
    
    0 讨论(0)
  • 2021-02-05 07:55

    You could try a Newton's step to fix your rounding errors:

    ulong r = (ulong)pow(n, 1.0/3);
    if(r==0) return r; /* avoid divide by 0 later on */
    ulong r3 = r*r*r;
    ulong slope = 3*r*r;
    
    ulong r1 = r+1;
    ulong r13 = r1*r1*r1;
    
    /* making sure to handle unsigned arithmetic correctly */
    if(n >= r13) r+= (n - r3)/slope;
    if(n < r3)   r-= (r3 - n)/slope;
    

    A single Newton step ought to be enough, but you may have off-by-one (or possibly more?) errors. You can check/fix those using a final check&increment step, as in your OQ:

    while(r*r*r > n) --r;
    while((r+1)*(r+1)*(r+1) <= n) ++r;
    

    or some such.

    (I admit I'm lazy; the right way to do it is to carefully check to determine which (if any) of the check&increment things is actually necessary...)

    0 讨论(0)
  • 2021-02-05 07:57

    The book "Hacker's Delight" has algorithms for this and many other problems. The code is online here. EDIT: That code doesn't work properly with 64-bit ints, and the instructions in the book on how to fix it for 64-bit are somewhat confusing. A proper 64-bit implementation (including test case) is online here.

    I doubt that your squareroot function works "correctly" - it should be ulong a for the argument, not n :) (but the same approach would work using cbrt instead of sqrt, although not all C math libraries have cube root functions).

    0 讨论(0)
  • 2021-02-05 07:57

    I've adapted the algorithm presented in 1.5.2 (the kth root) in Modern Computer Arithmetic (Brent and Zimmerman). For the case of (k == 3), and given a 'relatively' accurate over-estimate of the initial guess - this algorithm seems to out-perform the 'Hacker's Delight' code above.

    Not only that, but MCA as a text provides theoretical background as well as a proof of correctness and terminating criteria.

    Provided that we can produce a 'relatively' good initial over-estimate, I haven't been able to find a case that exceeds (7) iterations. (Is this effectively related to 64-bit values having 2^6 bits?) Either way, it's an improvement over the (21) iterations in the HacDel code - with linear O(b) convergence, despite having a loop body that is evidently much faster.

    The initial estimate I've used is based on a 'rounding up' of the number of significant bits in the value (x). Given (b) significant bits in (x), we can say: 2^(b - 1) <= x < 2^b. I state without proof (though it should be relatively easy to demonstrate) that: 2^ceil(b / 3) > x^(1/3)


    static inline uint32_t u64_cbrt (uint64_t x)
    {
        uint64_t r0 = 1, r1;
    
        /* IEEE-754 cbrt *may* not be exact. */
    
        if (x == 0) /* cbrt(0) : */
            return (0);
    
        int b = (64) - __builtin_clzll(x);
        r0 <<= (b + 2) / 3; /* ceil(b / 3) */
    
        do /* quadratic convergence: */
        {
            r1 = r0;
            r0 = (2 * r1 + x / (r1 * r1)) / 3;
        }
        while (r0 < r1);
    
        return ((uint32_t) r1); /* floor(cbrt(x)); */
    }
    

    A crbt call probably isn't all that useful - unlike the sqrt call which can be efficiently implemented on modern hardware. That said, I've seen gains for sets of values under 2^53 (exactly represented in IEEE-754 doubles), which surprised me.

    The only downside is the division by: (r * r) - this can be slow, as the latency of integer division continues to fall behind other advances in ALUs. The division by a constant: (3) is handled by reciprocal methods on any modern optimising compiler.

    It's interesting that Intel's 'Icelake' microarchitecture will significantly improve integer division - an operation that seems to have been neglected for a long time. I simply won't trust the 'Hacker's Delight' answer until I can find a sound theoretical basis for it. And then I have to work out which variant is the 'correct' answer.

    0 讨论(0)
  • 2021-02-05 07:57

    I would research how to do it by hand, and then translate that into a computer algorithm, working in base 2 rather than base 10.

    We end up with an algorithm something like (pseudocode):

    Find the largest n such that (1 << 3n) < input.
    result = 1 << n.
    For i in (n-1)..0:
        if ((result | 1 << i)**3) < input:
            result |= 1 << i.
    

    We can optimize the calculation of (result | 1 << i)**3 by observing that the bitwise-or is equivalent to addition, refactoring to result**3 + 3 * i * result ** 2 + 3 * i ** 2 * result + i ** 3, caching the values of result**3 and result**2 between iterations, and using shifts instead of multiplication.

    0 讨论(0)
  • 2021-02-05 08:04

    If pow is too expensive, you can use a count-leading-zeros instruction to get an approximation to the result, then use a lookup table, then some Newton steps to finish it.

    int k = __builtin_clz(n); // counts # of leading zeros (often a single assembly insn)
    int b = 64 - k;           // # of bits in n
    int top8 = n >> (b - 8);  // top 8 bits of n (top bit is always 1)
    int approx = table[b][top8 & 0x7f];
    

    Given b and top8, you can use a lookup table (in my code, 8K entries) to find a good approximation to cuberoot(n). Use some Newton steps (see comingstorm's answer) to finish it.

    0 讨论(0)
提交回复
热议问题