Cube root on x87 FPU using Newton-Raphson method

后端 未结 2 650
无人共我
无人共我 2021-01-21 01:18

I am trying to write an assembly program using the 8086 processor that will find the cube root of a number. Obviously I am using floating points.

Algorithm based upon Ne

2条回答
  •  感情败类
    2021-01-21 01:39

    Intel's insn reference manual documents all the instructions, including fdiv and fdivr (x/y instead of y/x). If you really need to learn mostly-obsolete x87 (fdiv) instead of SSE2 (divss), then this x87 tutorial is essential reading, esp. the early chapter that explains the register stack. Also see this x87 FP comparison Q&A. See more links in the x86 tag wiki.


    re: EDIT2 code dump:

    You have 4 fld instructions inside the loop, but no p-suffixed operations. Your loop will overflow the 8-register FP stack on the 3rd iteration, at which point you'll get a NaN. (specifically, the indefinite-value NaN, which printf prints as 1#IND.

    I'd suggest designing your loop so an iteration starts with root in st(0), and ends with the next iteration's root value in st(0). Don't load or store to/from root inside the loop. Use fld1 to load 1.0 as your initial value outside the loop, and fstp [root] after the loop to pop st(0) into memory.


    You picked the most inconvenient way to do tmp / 3.0

                              ; stack = tmp   (and should otherwise be empty once you fix the rest of your code)
        fld     three         ; stack = 3.0, tmp
        fld     st(1)         ; stack = tmp, 3.0, tmp   ; should have used fxchg to just swap instead of making the stack deeper
        fdiv    st(0), st(1)  ; stack = tmp/3.0, 3.0, tmp
    

    fdiv, fsub, etc. have multiple register-register forms: one where st(0) is the destination, and one where it's the source. The form with st(0) as the source is also available with a pop, so you could

        fld     three         ; stack = 3.0, tmp
        fdivp                 ; stack = tmp / 3.0  popping the stack back to just one entry
        ; fdivp  st(1), st(0) ; this is what fdivp with no operands means
    

    It's actually even simpler than that if you use a memory operand directly instead of loading it. Since you want st(0) /= 3.0, you can do fdiv [three]. In that case, FP ops are just like integer ops, where you can do div dword ptr [integer_from_memory] to use a memory source operand.

    The non-commutative operations (subtract and divide) also have reverse versions (e.g. fdivr), which can save you an fxchg or let you use a memory operand even if you'd needed 3.0/tmp instead of tmp/3.0


    Dividing by 3 is the same as multiplying by 1/3, and fmul is much faster than fdiv. From a code-simplicity point of view, multiply is commutative, so another way to implement st(0) /= 3 is:

    fld    [one_third]
    fmulp                  ; shorthand for  fmulp st(1), st(0)
    
    ; or
    fmul   [one_third]
    

    Note that 1/3.0 has no exact representation in binary floating point, but all integers between +/- about 2^23 do (size of mantissa of single-precision REAL4). You should only care about this if you were expecting to work with exact multiples of three.


    Comments on the original code:

    You can hoist a division out of the loop by doing 2.0 / 3.0 and x/3.0 ahead of time. This is worth it if you expect the loop to run more than one iteration on average.


    You can duplicate the top of the stack with fld st(0), so you don't have to keep loading from memory.


    fimul [root] (integer mul) is a bug: Your root is in REAL4 (32bit float) format, not integer. fidiv is similarly a bug, and of course doesn't work with an x87 register as a source operand.

    Since you have root at the top of the stack, I think you can just fmul st(0) to use st(0) as both the explicit and implicit operand, resulting in st(0) = st(0) * st(0), with no change in the depth of the stack.


    You could also use sqrt as a better initial approximation than 1.0, or maybe +/-1 * sqrtf(fabsf(x)). I don't see an x87 instruction for applying the sign of one float to another, just fchs to unconditionally flip, and fabs to unconditionally clear the sign bit. There is an fcmov, but it requires a P6 or later CPU. You mentioned 8086, but then used .586, so IDK what you're targeting.


    Better loop body:

    Not debugged or tested, but your code full of repeated loads from the same data was making me crazy. This optimized version is here because I was curious, not because I think it's going to help the OP directly.

    Also, hopefully this is a good example of how to comment the data flow in code where it's tricky. (e.g. x87, or vectorized code with shuffles).

    ## x/3.0 in st(1)
    ## 2.0/3.0 in st(2)
    
    # before each iteration: st(0) = root
    #  after each iteration: st(0) = root * 2.0/3.0 + (x/3.0 / (root*root)), with constants undisturbed
    
    loop_body:
        fld     st(0)         ; stack: root, root, 2/3, x/3
        fmul    st(0), st(0)  ; stack: root^2, root, 2/3, x/3
        fdivr   st(0), st(3)  ; stack: x/3 / root^2, root, 2/3, x/3
        fxchg   st(1)         ; stack: root, x/3/root^2, 2/3, x/3
        fmul    st(0), st(2)  ; stack: root*2/3, x/3/root^2, 2/3, x/3
        faddp                 ; stack: root*2/3 + x/3/root^2, 2/3, x/3
    
    ; TODO: compare and loop back to loop_body
    
        fstp    [root]         ; store and pop
        fstp    st(0)          ; pop the two constants off the FP stack to empty it before returning
        fstp    st(0)
        ; finit is very slow, ~80cycles, don't use it if you don't have to.
    

    32bit function calling-conventions return FP results in st(0), so you could do that, but then the caller probably have to store somewhere.

提交回复
热议问题