问题
I'm learning more then I ever wanted to know about Floating point numbers.
Lets say I needed to add:
1 10000000 00000000000000000000000
1 01111000 11111000000000000000000
2’s complement form.
The first bit is the sign, the next 8 bits are the exponent and the last 23 bits are the mantisa.
Without doing a conversion to scientific notation, how do I add these two numbers? Can you walk through it step by step?
any good resources for this stuff? Videos and practice examples would be great.
回答1:
You have to scale the numbers so that they have the same exponent. Then you add the mantissa fields and, if necessary, normalise the result.
Oh, yes, and if they're different signs, you just call your subtraction function instead :-)
Let's do an example in decimal since it's easier to understand. Let's further assume they're stored with only eight digits to the right of the decimal (and the numbers are between 0 inclusive and 1 exclusive).
Add the two numbers:
sign exponent mantissa value
1 42 18453284 + 0.18453284 x 10^42
1 38 17654321 + 0.17654321 x 10^38
Scaling these numbers to the highest exponent gives something where you can add the mantissa fields.:
sign exponent mantissa value
1 42 18453284 + 0.18453284 x 10^42
1 42 1765 + 0.00001765 x 10^42
= == ========
1 42 18455049 + 0.18455049 x 10^42
And there you have your number. This also illustrates how accuracy can be lost due to the shifting. For example, IEEE754 single precision floats will have:
1e38 + 1e-38 = 1e38
such as with:
#include <stdio.h>
int main (void) {
float f1 = 1e38;
float f2 = 1e-38;
float f3 = f1 + f2;
float f4 = f1 - f3;
printf ("%.50f\n", f4);
return 0;
}
In terms of what happens with overflow, that's part of the normalisation I mentioned. Let's add 99999.9999
to 99999.9993
. Since they already have the same exponent, no need to scale, so we just add:
sign exponent mantissa value
1 5 99999999 + 0.99999999 x 10^5
1 5 99999993 + 0.99999999 x 10^5
= == ========
1 5 199999992 ???
You can see here that we have a carry situation so we can't put that carry into the number, being limited to eight digits. What we do then is to shift the number to the right so that we can insert the carry. Since that shift is effectively a divide-by-ten, we have to increment the exponent to counter that.
So:
sign exponent mantissa value
1 5 199999992 ???
becomes:
sign exponent mantissa value
1 6 19999999 + 0.19999999 x 10^6
In reality, it's not just a simple right-shift since you need to round to the nearest number. If the number you're shifting out is five or more, you need to add one to the digit on the left. That's why I chose 99999.9993
as the second number. If I had added 99999.9999
to itself, I would have ended up with:
sign exponent mantissa value
1 5 199999998 ???
which, on right shift, would have triggered quite a few carries towards the left:
sign exponent mantissa value
1 6 20000000 + 0.2 x 10^6
来源:https://stackoverflow.com/questions/7884343/adding-32-bit-floating-point-numbers