ieee-754 | 易学教程

How to convert 32-bit binary to float

阅读更多关于 How to convert 32-bit binary to float

问题 I want to perform IEEE 754 conversion from 32-bit binary to float in python. i have tried this import struct f = int('11000001101011000111101011100001', 2) print struct.unpack('f', struct.pack('i', f))[0] but this doesn't work for numbers with negative sign bit. Expected output should be like this: bintofloat(11000001101011000111101011100001) >>> -21.56 回答1: You could use struct as follows: import struct f = int('01000001101011000111101011100001', 2) print struct.unpack('f', struct.pack('I',

How do you print out an IEEE754 number (without printf)?

阅读更多关于 How do you print out an IEEE754 number (without printf)?

问题 For the purposes of this question, I do not have the ability to use printf facilities (I can't tell you why, unfortunately, but let's just assume for now that I know what I'm doing). For an IEEE754 single precision number, you have the following bits: SEEE EEEE EFFF FFFF FFFF FFFF FFFF FFFF where S is the sign, E is the exponent and F is the fraction. Printing the sign is relatively easy for all cases, as is catching all the special cases like NaN ( E == 0xff, F != 0 ), Inf ( E == 0xff, F ==

Rounding Floating Point Numbers after addition (guard, sticky, and round bits)

阅读更多关于 Rounding Floating Point Numbers after addition (guard, sticky, and round bits)

问题 I haven't been able to find a good explanation of this anywhere on the web yet, so I'm hoping somebody here can explain it for me. I want to add two binary numbers by hand: 1.001 2 * 2 2 1.010,0000,0000,0000,0000,0011 2 * 2 1 I can add them no problem, I get the following result after de-normalizing the first number, adding the two, and re-normalizing them. 1.1100,0000,0000,0000,0000,0011 2 * 2 2 The issue is, that number will not fit into single-precision IEEE 754 format without truncating

IEEE Std 754 Floating-Point: let t := a - b, does the standard guarantee that a == b + t?

阅读更多关于 IEEE Std 754 Floating-Point: let t := a - b, does the standard guarantee that a == b + t?

问题 Assume that t , a , b are all double (IEEE Std 754) variables, and both values of a , b are NOT NaN (but may be Inf ). After t = a - b , do I necessarily have a == b + t ? 回答1: Absolutely not. One obvious case is a=DBL_MAX , b=-DBL_MAX . Then t=INFINITY , so b+t is also INFINITY . What may be more surprising is that there are cases where this happens without any overflow. Basically, they're all of the form where a-b is inexact. For example, if a is DBL_EPSILON/4 and b is -1 , a-b is 1

`std::sin` is wrong in the last bit

阅读更多关于 `std::sin` is wrong in the last bit

问题 I am porting some program from Matlab to C++ for efficiency. It is important for the output of both programs to be exactly the same (**). I am facing different results for this operation: std::sin(0.497418836818383950) = 0.477158760259608410 (C++) sin(0.497418836818383950) = 0.47715876025960846000 (Matlab) N[Sin[0.497418836818383950], 20] = 0.477158760259608433 (Mathematica) So, as far as I know both C++ and Matlab are using IEEE754 defined double arithmetic. I think I have read somewhere

How to test if numeric conversion will change value?

阅读更多关于 How to test if numeric conversion will change value?

问题 I'm performing some data type conversions where I need to represent uint , long , ulong and decimal as IEEE 754 double floating point values. I want to be able to detect if the IEEE 754 data type cannot contain the value before I perform the conversion. A brute force solution would be to wrap a try-catch around a cast to double looking for OverflowException . Reading through certain of the CLR documentation implies that some conversions just silently change the value without any exceptions.

IEEE 754: How exactly does it work?

阅读更多关于 IEEE 754: How exactly does it work?

问题 Why does the following code behave as it does in C? float x = 2147483647; //2^31 printf("%f\n", x); //Outputs 2147483648 Here is my thought process: 2147483647 = 0 1001 1101 1111 1111 1111 1111 1111 111 (0.11111111111111111111111)base2 = (1-(0.5)^23)base10 => (1.11111111111111111111111)base2 = (1 + 1-(0.5)^23)base10 = (1.99999988)base10 Therefore, to convert the IEEE 754 notation back to decimal: 1.99999988 * 2^30 = 2147483520 So technically, the C program must have printed out 2147483520,

IEEE double such that sqrt(x*x) ≠ x

阅读更多关于 IEEE double such that sqrt(x*x) ≠ x

问题 Does there exist an IEEE double x>0 such that sqrt(x*x) ≠ x , under the condition that the computation x*x does not overflow or underflow to Inf , 0 , or a denormal number? This is given that sqrt returns the nearest representable result, and so does x*x (both as mandated by the IEEE standard, "square root operation be calculated as if in infinite precision, and then rounded to one of the two nearest floating-point numbers of the specified precision that surround the infinitely precise result

Understanding compilation result for std::isnan

阅读更多关于 Understanding compilation result for std::isnan

问题 I always assumed, that there is practically no difference between testing for NAN via x!=x or std::isnan(x) However, gcc provides different assemblers for both versions (live on godbolt.org): ;x!=x: ucomisd %xmm0, %xmm0 movl $1, %edx setne %al cmovp %edx, %eax ret ;std::isnan(x) ucomisd %xmm0, %xmm0 setp %al ret However, I'm struggling to understand both version. My naive try to compile std::isnan(x) would be: ucomisd %xmm0, %xmm0 setne %al ;return true when not equal ret but I must be

Invertability of IEEE 754 floating-point division

阅读更多关于 Invertability of IEEE 754 floating-point division

问题 What is the invertability of the IEEE 754 floating-point division? I mean is it guaranteed by the standard that if double y = 1.0 / x then x == 1.0 / y , i.e. x can be restored precisely bit by bit? The cases when y is infinity or NaN are obvious exceptions. 回答1: Yes, there are IEEE 754 double-precision(*) values x that are such x != 1.0 / (1.0 / x) . It is easy to build an example of a normal value with this property by hand: the one that's written 0x1.fffffffffffffp0 in C99's hexadecimal