Picking good first estimates for Goldschmidt division

前端 未结 3 668
一生所求
一生所求 2021-02-07 08:06

I\'m calculating fixedpoint reciprocals in Q22.10 with Goldschmidt division for use in my software rasterizer on ARM.

This is done by just setting the numerator to 1, i.

相关标签:
3条回答
  • 2021-02-07 08:25

    A couple of ideas for you, though none that solve your problem directly as stated.

    1. Why this algo for division? Most divides I've seen in ARM use some varient of
      
            adcs hi, den, hi, lsl #1
            subcc hi, hi, den
            adcs lo, lo, lo
      

    repeated n bits times with a binary search off of the clz to determine where to start. That's pretty dang fast.

    1. If precision is a big problem, you are not limited to 32/64 bits for your fixed point representation. It'll be a bit slower, but you can do add/adc or sub/sbc to move values across registers. mul/mla are also designed for this kind of work.

    Again, not direct answers for you, but possibly a few ideas to go forward this. Seeing the actual ARM code would probably help me a bit as well.

    0 讨论(0)
  • 2021-02-07 08:27

    I could not resist spending an hour on your problem...

    This algorithm is described in section 5.5.2 of "Arithmetique des ordinateurs" by Jean-Michel Muller (in french). It is actually a special case of Newton iterations with 1 as starting point. The book gives a simple formulation of the algorithm to compute N/D, with D normalized in range [1/2,1[:

    e = 1 - D
    Q = N
    repeat K times:
      Q = Q * (1+e)
      e = e*e
    

    The number of correct bits doubles at each iteration. In the case of 32 bits, 4 iterations will be enough. You can also iterate until e becomes too small to modify Q.

    Normalization is used because it provides the max number of significant bits in the result. It is also easier to compute the error and number of iterations needed when the inputs are in a known range.

    Once your input value is normalized, you don't need to bother with the value of BASE until you have the inverse. You simply have a 32-bit number X normalized in range 0x80000000 to 0xFFFFFFFF, and compute an approximation of Y=2^64/X (Y is at most 2^33).

    This simplified algorithm may be implemented for your Q22.10 representation as follows:

    // Fixed point inversion
    // EB Apr 2010
    
    #include <math.h>
    #include <stdio.h>
    
    // Number X is represented by integer I: X = I/2^BASE.
    // We have (32-BASE) bits in integral part, and BASE bits in fractional part
    #define BASE 22
    typedef unsigned int uint32;
    typedef unsigned long long int uint64;
    
    // Convert FP to/from double (debug)
    double toDouble(uint32 fp) { return fp/(double)(1<<BASE); }
    uint32 toFP(double x) { return (int)floor(0.5+x*(1<<BASE)); }
    
    // Return inverse of FP
    uint32 inverse(uint32 fp)
    {
      if (fp == 0) return (uint32)-1; // invalid
    
      // Shift FP to have the most significant bit set
      int shl = 0; // normalization shift
      uint32 nfp = fp; // normalized FP
      while ( (nfp & 0x80000000) == 0 ) { nfp <<= 1; shl++; } // use "clz" instead
    
      uint64 q = 0x100000000ULL; // 2^32
      uint64 e = 0x100000000ULL - (uint64)nfp; // 2^32-NFP
      int i;
      for (i=0;i<4;i++) // iterate
        {
          // Both multiplications are actually
          // 32x32 bits truncated to the 32 high bits
          q += (q*e)>>(uint64)32;
          e = (e*e)>>(uint64)32;
          printf("Q=0x%llx E=0x%llx\n",q,e);
        }
      // Here, (Q/2^32) is the inverse of (NFP/2^32).
      // We have 2^31<=NFP<2^32 and 2^32<Q<=2^33
      return (uint32)(q>>(64-2*BASE-shl));
    }
    
    int main()
    {
      double x = 1.234567;
      uint32 xx = toFP(x);
      uint32 yy = inverse(xx);
      double y = toDouble(yy);
    
      printf("X=%f Y=%f X*Y=%f\n",x,y,x*y);
      printf("XX=0x%08x YY=0x%08x XX*YY=0x%016llx\n",xx,yy,(uint64)xx*(uint64)yy);
    }
    

    As noted in the code, the multiplications are not full 32x32->64 bits. E will become smaller and smaller and fits initially on 32 bits. Q will always be on 34 bits. We take only the high 32 bits of the products.

    The derivation of 64-2*BASE-shl is left as an exercise for the reader :-). If it becomes 0 or negative, the result is not representable (the input value is too small).

    EDIT. As a follow-up to my comment, here is a second version with an implicit 32-th bit on Q. Both E and Q are now stored on 32 bits:

    uint32 inverse2(uint32 fp)
    {
      if (fp == 0) return (uint32)-1; // invalid
    
      // Shift FP to have the most significant bit set
      int shl = 0; // normalization shift for FP
      uint32 nfp = fp; // normalized FP
      while ( (nfp & 0x80000000) == 0 ) { nfp <<= 1; shl++; } // use "clz" instead
      int shr = 64-2*BASE-shl; // normalization shift for Q
      if (shr <= 0) return (uint32)-1; // overflow
    
      uint64 e = 1 + (0xFFFFFFFF ^ nfp); // 2^32-NFP, max value is 2^31
      uint64 q = e; // 2^32 implicit bit, and implicit first iteration
      int i;
      for (i=0;i<3;i++) // iterate
        {
          e = (e*e)>>(uint64)32;
          q += e + ((q*e)>>(uint64)32);
        }
      return (uint32)(q>>shr) + (1<<(32-shr)); // insert implicit bit
    }
    
    0 讨论(0)
  • 2021-02-07 08:37

    Mads, you are not losing any precision at all. When you divide 512.00002f by 2^10, you merely decrease the exponent of your floating point number by 10. Mantissa remains the same. Of course unless the exponent hits its minimum value but that shouldn't happen since you're scaling to (0.5, 1].

    EDIT: Ok so you're using a fixed decimal point. In that case you should allow a different representation of the denominator in your algorithm. The value of D is from (0.5, 1] not only at the beginning but throughout the whole calculation (it's easy to prove that x * (2-x) < 1 for x < 1). So you should represent the denominator with decimal point at base = 32. This way you will have 32 bits of precision all the time.

    EDIT: To implement this you'll have to change the following lines of your code:

      //bitpos = 31 - clz(val) - BASE;
      bitpos = 31 - clz(val) - 31;
    ...
      //F = (2ULL<<BASE) - D;
      //N = F;
      //D = ((unsigned long long)D*F)>>BASE;
      F = -D;
      N = F >> (31 - BASE);
      D = ((unsigned long long)D*F)>>31;
    ...
        //F = (2<<(BASE)) - D;
        //D = ((unsigned long long)D*F)>>BASE;
        F = -D;
        D = ((unsigned long long)D*F)>>31;
    ...
        //N = ((unsigned long long)N*F)>>BASE;
        N = ((unsigned long long)N*F)>>31;
    

    Also in the end you'll have to shift N not by bitpos but some different value which I'm too lazy to figure out right now :).

    0 讨论(0)
提交回复
热议问题