ieee-754

How to convert 32-bit binary to float

回眸只為那壹抹淺笑 提交于 2019-12-10 12:13:56
问题 I want to perform IEEE 754 conversion from 32-bit binary to float in python. i have tried this import struct f = int('11000001101011000111101011100001', 2) print struct.unpack('f', struct.pack('i', f))[0] but this doesn't work for numbers with negative sign bit. Expected output should be like this: bintofloat(11000001101011000111101011100001) >>> -21.56 回答1: You could use struct as follows: import struct f = int('01000001101011000111101011100001', 2) print struct.unpack('f', struct.pack('I',

How do you print out an IEEE754 number (without printf)?

◇◆丶佛笑我妖孽 提交于 2019-12-10 05:31:57
问题 For the purposes of this question, I do not have the ability to use printf facilities (I can't tell you why, unfortunately, but let's just assume for now that I know what I'm doing). For an IEEE754 single precision number, you have the following bits: SEEE EEEE EFFF FFFF FFFF FFFF FFFF FFFF where S is the sign, E is the exponent and F is the fraction. Printing the sign is relatively easy for all cases, as is catching all the special cases like NaN ( E == 0xff, F != 0 ), Inf ( E == 0xff, F ==

Rounding Floating Point Numbers after addition (guard, sticky, and round bits)

六眼飞鱼酱① 提交于 2019-12-10 03:30:27
问题 I haven't been able to find a good explanation of this anywhere on the web yet, so I'm hoping somebody here can explain it for me. I want to add two binary numbers by hand: 1.001 2 * 2 2 1.010,0000,0000,0000,0000,0011 2 * 2 1 I can add them no problem, I get the following result after de-normalizing the first number, adding the two, and re-normalizing them. 1.1100,0000,0000,0000,0000,0011 2 * 2 2 The issue is, that number will not fit into single-precision IEEE 754 format without truncating

IEEE Std 754 Floating-Point: let t := a - b, does the standard guarantee that a == b + t?

旧时模样 提交于 2019-12-10 02:56:19
问题 Assume that t , a , b are all double (IEEE Std 754) variables, and both values of a , b are NOT NaN (but may be Inf ). After t = a - b , do I necessarily have a == b + t ? 回答1: Absolutely not. One obvious case is a=DBL_MAX , b=-DBL_MAX . Then t=INFINITY , so b+t is also INFINITY . What may be more surprising is that there are cases where this happens without any overflow. Basically, they're all of the form where a-b is inexact. For example, if a is DBL_EPSILON/4 and b is -1 , a-b is 1

`std::sin` is wrong in the last bit

百般思念 提交于 2019-12-10 01:58:57
问题 I am porting some program from Matlab to C++ for efficiency. It is important for the output of both programs to be exactly the same (**). I am facing different results for this operation: std::sin(0.497418836818383950) = 0.477158760259608410 (C++) sin(0.497418836818383950) = 0.47715876025960846000 (Matlab) N[Sin[0.497418836818383950], 20] = 0.477158760259608433 (Mathematica) So, as far as I know both C++ and Matlab are using IEEE754 defined double arithmetic. I think I have read somewhere

How to test if numeric conversion will change value?

岁酱吖の 提交于 2019-12-09 19:12:28
问题 I'm performing some data type conversions where I need to represent uint , long , ulong and decimal as IEEE 754 double floating point values. I want to be able to detect if the IEEE 754 data type cannot contain the value before I perform the conversion. A brute force solution would be to wrap a try-catch around a cast to double looking for OverflowException . Reading through certain of the CLR documentation implies that some conversions just silently change the value without any exceptions.

IEEE 754: How exactly does it work?

烂漫一生 提交于 2019-12-09 18:38:31
问题 Why does the following code behave as it does in C? float x = 2147483647; //2^31 printf("%f\n", x); //Outputs 2147483648 Here is my thought process: 2147483647 = 0 1001 1101 1111 1111 1111 1111 1111 111 (0.11111111111111111111111)base2 = (1-(0.5)^23)base10 => (1.11111111111111111111111)base2 = (1 + 1-(0.5)^23)base10 = (1.99999988)base10 Therefore, to convert the IEEE 754 notation back to decimal: 1.99999988 * 2^30 = 2147483520 So technically, the C program must have printed out 2147483520,

IEEE double such that sqrt(x*x) ≠ x

空扰寡人 提交于 2019-12-09 02:33:00
问题 Does there exist an IEEE double x>0 such that sqrt(x*x) ≠ x , under the condition that the computation x*x does not overflow or underflow to Inf , 0 , or a denormal number? This is given that sqrt returns the nearest representable result, and so does x*x (both as mandated by the IEEE standard, "square root operation be calculated as if in infinite precision, and then rounded to one of the two nearest floating-point numbers of the specified precision that surround the infinitely precise result

Understanding compilation result for std::isnan

痴心易碎 提交于 2019-12-08 16:14:21
问题 I always assumed, that there is practically no difference between testing for NAN via x!=x or std::isnan(x) However, gcc provides different assemblers for both versions (live on godbolt.org): ;x!=x: ucomisd %xmm0, %xmm0 movl $1, %edx setne %al cmovp %edx, %eax ret ;std::isnan(x) ucomisd %xmm0, %xmm0 setp %al ret However, I'm struggling to understand both version. My naive try to compile std::isnan(x) would be: ucomisd %xmm0, %xmm0 setne %al ;return true when not equal ret but I must be

Invertability of IEEE 754 floating-point division

白昼怎懂夜的黑 提交于 2019-12-08 14:42:11
问题 What is the invertability of the IEEE 754 floating-point division? I mean is it guaranteed by the standard that if double y = 1.0 / x then x == 1.0 / y , i.e. x can be restored precisely bit by bit? The cases when y is infinity or NaN are obvious exceptions. 回答1: Yes, there are IEEE 754 double-precision(*) values x that are such x != 1.0 / (1.0 / x) . It is easy to build an example of a normal value with this property by hand: the one that's written 0x1.fffffffffffffp0 in C99's hexadecimal