float arithmetic and x86 and x64 context

问题

We are running some code in both VisualStudio process context (x86 context) and out of VisualStudio context (x64 context). I notice the following code provides a different result in both context (100000000000 in x86 and 99999997952 in x64)

float val = 1000f;
val = val * val;
return (ulong)(val * 100000.0f);

We need to obtain a ulong value from a float value in a reliable way, no matter the context and no matter the ulong value, it is just for hashing purpose. I tested this code in both x64 and x86 context and indeed obtained the same result, it looks reliable:

float operandFloat = (float)obj;
byte[] bytes = BitConverter.GetBytes(operandFloat);
Debug.Assert(bytes.Length == 4);
uint @uint = BitConverter.ToUInt32(bytes, 0);
return (ulong)@uint;

Is this code reliable?

回答1:

As others have speculated in the comments, the difference you're observing is the result of differential precision when doing floating-point arithmetic, arising out of a difference between how the 32-bit and 64-bit builds perform these operations.

Your code is translated by the 32-bit (x86) JIT compiler into the following object code:

fld   qword ptr ds:[0E63308h]  ; Load constant 1.0e+11 onto top of FPU stack.
sub   esp, 8                   ; Allocate 8 bytes of stack space.
fstp  qword ptr [esp]          ; Pop top of FPU stack, putting 1.0e+11 into
                               ;  the allocated stack space at [esp].
call  73792C70                 ; Call internal helper method that converts the
                               ;  double-precision floating-point value stored at [esp]
                               ;  into a 64-bit integer, and returns it in edx:eax.
                               ; At this point, edx:eax == 100000000000.

Notice that the optimizer has folded your arithmetic computation ((1000f * 1000f) * 100000f) to the constant 1.0e+11. It has stored this constant in the binary's data segment, and loads it onto the top of the x87 floating-point stack (the fld instruction). The code then allocates 8 bytes of stack space (enough for a 64-bit double-precision floating-point value) by subtracting the stack pointer (esp). The fstp instruction pops the value off the top of the x87 floating-point stack, and stores it in its memory operand. In this case, it stores it into the 8 bytes that we just allocated on the stack. All of this shuffling is rather pointless: it could have just loaded the floating-point constant 1.0e+11 directly into memory, by-passing the trip through the x87 FPU, but the JIT optimizer isn't perfect. Finally, the JIT emitted code to call an internal helper function that converts the double-precision floating-point value stored in memory (1.0e+11) into a 64-bit integer. The 64-bit integer result is returned in the register pair edx:eax, as is customary for 32-bit Windows calling conventions. When this code completes, edx:eax contains the 64-bit integer value 100000000000, or 1.0e+11, exactly as you would expect.

(Hopefully the terminology here is not too confusing. Note that there are two different "stacks". The x87 FPU has a series of registers, which are accessed like a stack. I refer to this as the FPU stack. Then, there is the stack with which you are probably familiar, the one stored in main memory and accessed via the stack pointer, esp.)

However, things are done a bit differently by the 64-bit (x86-64) JIT compiler. The big difference here is that 64-bit targets always use SSE2 instructions for floating-point operations, since all chips that support AMD64 also support SSE2, and SSE2 is more efficient and more flexible than the old x87 FPU. Specifically, the 64-bit JIT translates your code into the following:

movsd  xmm0, mmword ptr [7FFF7B1A44D8h]  ; Load constant into XMM0 register.
call   00007FFFDAC253B0                  ; Call internal helper method that converts the
                                         ;  floating-point value in XMM0 into a 64-bit int
                                         ;  that is returned in RAX.

Things immediately go wrong here, because the constant value being loaded by the first instruction is 0x42374876E0000000, which is the binary floating-point representation of 99999997952.0. The problem is not the helper function that is doing the conversion to a 64-bit integer. Instead, it is the JIT compiler itself, specifically the optimizer routine that is pre-computing the constant.

To gain some insight into how that goes wrong, we'll turn off JIT optimization and see what the code looks like:

movss    xmm0, dword ptr [7FFF7B1A4500h]  
movss    dword ptr [rbp-4], xmm0  
movss    xmm0, dword ptr [rbp-4]  
movss    xmm1, dword ptr [rbp-4]  
mulss    xmm0, xmm1  
mulss    xmm0, dword ptr [7FFF7B1A4504h]  
cvtss2sd xmm0, xmm0  
call     00007FFFDAC253B0

The first movss instruction loads a single-precision floating-point constant from memory into the xmm0 register. This time, however, that constant is 0x447A0000, which is the precise binary representation of 1000—the initial float value from your code.

The second movss instruction turns right around and stores this value from the xmm0 register into memory, and the third movss instruction re-loads the just-stored value from memory back into the xmm0 register. (Told you this was unoptimized code!) It also loads a second copy of that same value from memory into the xmm1 register, and then multiplies (mulss) the two single-precision values in xmm0 and xmm1 together. This is the literal translation of your val = val * val code. The result of this operation (which ends up in xmm0) is 0x49742400, or 1.0e+6, precisely as you would expect.

The second mulss instruction performs the val * 100000.0f operation. It implicitly loads the single-precision floating-point constant 1.0e+5 and multiplies it with the value in xmm0 (which, recall, is 1.0e+6). Unfortunately, the result of this operation is not what you would expect. Instead of 1.0e+11, it is actually 9.9999998e+10. Why? Because 1.0e+11 cannot be precisely represented as a single-precision floating-point value. The closest representation is 0x51BA43B7, or 9.9999998e+10.

Finally, the cvtss2sd instruction performs an in-place conversion of the (wrong!) scalar single-precision floating-point value in xmm0 to a scalar double-precision floating-point value. In a comment to the question, Neitsa suggested that this might be the source of the problem. In fact, as we have seen, the source of the problem is the previous instruction, the one that does the multiplication. The cvtss2sd just converts an already imprecise single-precision floating-point representation (0x51BA43B7) to an imprecise double-precision floating point representation: 0x42374876E0000000, or 99999997952.0.

And this is precisely the series of operations performed by the JIT compiler to produce the initial double-precision floating-point constant that is loaded into the xmm0 register in the optimized code.

Although I have been implying throughout this answer that the JIT compiler is to blame, that is not the case at all! If you had compiled the identical code in C or C++ while targeting the SSE2 instruction set, you would have gotten exactly the same imprecise result: 99999997952.0. The JIT compiler is performing just as one would expect it to—if, that is, one's expectations are correctly calibrated to the imprecision of floating-point operations!

So, what is the moral of this story? There are two of them. First, floating-point operations are tricky and there is a lot to know about them. Second, in light of this, always use the most precision that you have available when doing floating-point arithmetic!

The 32-bit code is producing the correct result because it is operating with double-precision floating-point values. With 64 bits to play with, a precise representation of 1.0e+11 is possible.

The 64-bit code is producing the incorrect result because it is using single-precision floating-point values. With only 32 bits to play with, a precise representation of 1.0e+11 is not possible.

You would not have had this problem if you had used the double type to begin with:

double val = 1000.0;
val = val * val;
return (ulong)(val * 100000.0);

This ensures the correct result on all architectures, with no need for ugly, non-portable bit-manipulation hacks like those suggested in the question. (Which still cannot ensure the correct result, since it doesn't solve the root of the problem, namely that your desired result cannot be directly represented in a 32-bit single-precision float.)

Even if you have to take input as a single-precision float, convert it immediately into a double, and do all of your subsequent arithmetic manipulations in the double-precision space. That would still have solved this problem, since the initial value of 1000 can be precisely represented as a float.

来源：https://stackoverflow.com/questions/41225712/float-arithmetic-and-x86-and-x64-context

标签

.net

x86

x86-64

floating-accuracy