ieee-754

IEE 754 total order in standard C++11

跟風遠走 提交于 2019-12-05 16:47:02
问题 According to the IEEE floating point wikipage (on IEEE 754), there is a total order on double-precision floating points (i.e. on C++11 implementations having IEEE-754 floats, like gcc 4.8 on Linux / x86-64). Of course, operator < on double is often providing a total order, but NaN are known to be exceptions (it is well known folklore that x != x is a way of testing if x , declared as double x; is a NaN). The reason I am asking is that I want to have a.g. std::set<double> (actually, a set of

What numerical algorithm is simplified by defining sqrt(-0.0) as -0.0?

為{幸葍}努か 提交于 2019-12-05 14:55:18
问题 The IEEE 754 standard defines the square root of negative zero as negative zero. This choice is easy enough to rationalize, but other choices, such as defining sqrt(-0.0) as NaN , can be rationalized too and are easier to implement in hardware. If the fear was that programmers would write if (x >= 0.0) then sqrt(x) else 0.0 and be bitten by this expression evaluating to NaN when x is -0.0 , then sqrt(-0.0) could have been defined as +0.0 (actually, for this particular expression, the results

Increase a double to the next closest value?

旧城冷巷雨未停 提交于 2019-12-05 14:20:13
This isn't a question for a real-life project; I'm only curious. We can increase an int using the increment operator ( i++ ). You can define this operation as: This increases the variable with the closest value to i . Which is in this case simply +1. But I was thinking of defining the number of double values available in a specific range according the IEEE 754-2008 system. I would be able to set up a graph which demonstrates these amounts in some ranges and see how it is decreasing. I guess there should be a bitwise way of increasing a double to the closest value greater than the original

Parse HEX float

不羁的心 提交于 2019-12-05 11:24:24
I have integer, for example, 4060 . How I can get HEX float ( \x34\xC8\x7D\x45 ) from it? JS hasn't float type, so I don't know how to do this conversion. Thank you. The above answer is no longer valid. Buffer has been deprecated (see https://nodejs.org/api/buffer.html#buffer_new_buffer_size ). New Solution: function numToFloat32Hex(v,le) { if(isNaN(v)) return false; var buf = new ArrayBuffer(4); var dv = new DataView(buf); dv.setFloat32(0, v, true); return ("0000000"+dv.getUint32(0,!(le||false)).toString(16)).slice(-8).toUpperCase(); } For example: numToFloat32Hex(4060,true) // returns

How do you print out an IEEE754 number (without printf)?

 ̄綄美尐妖づ 提交于 2019-12-05 09:38:05
For the purposes of this question, I do not have the ability to use printf facilities (I can't tell you why, unfortunately, but let's just assume for now that I know what I'm doing). For an IEEE754 single precision number, you have the following bits: SEEE EEEE EFFF FFFF FFFF FFFF FFFF FFFF where S is the sign, E is the exponent and F is the fraction. Printing the sign is relatively easy for all cases, as is catching all the special cases like NaN ( E == 0xff, F != 0 ), Inf ( E == 0xff, F == 0 ) and 0 ( E == 0, F == 0 , considered special just because the exponent bias isn't used in that case)

java关键字strictfp的用途

怎甘沉沦 提交于 2019-12-05 08:06:35
strictfp的意思是FP-strict,也就是说精确浮点的意思。在Java虚拟机进行浮点运算时,如果没有指定strictfp关键字时,Java的编译器以及运行环境在对浮点运算的表达式是采取一种近似于我行我素的行为来完成这些操作,以致于得到的结果往往无法令你满意。而一旦使用了strictfp来声明一个类、接口或者方法时,那么所声明的范围内Java的编译器以及运行环境会完全依照浮点规范IEEE-754来执行。因此如果你想让你的浮点运算更加精确,而且不会因为不同的硬件平台所执行的结果不一致的话,那就请用关键字strictfp。 可以将一个类、接口以及方法声明为strictfp,但是不允许对接口中的方法以及构造函数声明strictfp关键字 一旦使用了关键字strictfp来声明某个类、接口或者方法时,那么在这个关键字所声明的范围内所有浮点运算都是精确的,符合IEEE-754规范的。例如一个类被声明为strictfp,那么该类中所有的方法都是strictfp的。 来源: oschina 链接: https://my.oschina.net/u/102108/blog/85173

Rounding Floating Point Numbers after addition (guard, sticky, and round bits)

≯℡__Kan透↙ 提交于 2019-12-05 05:18:51
I haven't been able to find a good explanation of this anywhere on the web yet, so I'm hoping somebody here can explain it for me. I want to add two binary numbers by hand: 1.001 2 * 2 2 1.010,0000,0000,0000,0000,0011 2 * 2 1 I can add them no problem, I get the following result after de-normalizing the first number, adding the two, and re-normalizing them. 1.1100,0000,0000,0000,0000,0011 2 * 2 2 The issue is, that number will not fit into single-precision IEEE 754 format without truncating or rounding one bit. My assignment asks that we put this number into single-precision IEEE 754 format

Negative zero literal in golang

﹥>﹥吖頭↗ 提交于 2019-12-05 03:44:10
IEEE754 supports the negative zero. But this code a := -0.0 fmt.Println(a, 1/a) outputs 0 +Inf where I would have expected -0 -Inf Other languages whose float format is based on IEEE754 let you create negative zero literals Java : float a = -0f; System.out.printf("%f %f", a, 1/a); // outputs "-0,000000 -Infinity" C# : var a = -0d; Console.WriteLine(1/a); // outputs "-Infinity" Javascript : ​var a = -0; console.log(a, 1/a);​ // logs "0 -Infinity" But I couldn't find the equivalent in Go. How do you write a negative zero literal in go ? There is a registered issue . And it happens to give a kind

Convert float to bigint (aka portable way to get binary exponent & mantissa)

 ̄綄美尐妖づ 提交于 2019-12-05 03:22:20
问题 In C++, I have a bigint class that can hold an integer of arbitrary size. I'd like to convert large float or double numbers to bigint. I have a working method, but it's a bit of a hack. I used IEEE 754 number specification to get the binary sign, mantissa and exponent of the input number. Here is the code (Sign is ignored here, that's not important): float input = 77e12; bigint result; // extract sign, exponent and mantissa, // according to IEEE 754 single precision number format unsigned int

How to get Python division by -0.0 and 0.0 to result in -Inf and Inf, respectively?

空扰寡人 提交于 2019-12-05 01:56:42
I have a situation where it is reasonable to have a division by 0.0 or by -0.0 where I would expect to see +Inf and -Inf, respectively, as results. It seems that Python enjoys throwing a ZeroDivisionError: float division by zero in either case. Obviously, I figured that I could simply wrap this with a test for 0.0. However, I can't find a way to distinguish between +0.0 and -0.0. (FYI you can easily get a -0.0 by typing it or via common calculations such as -1.0 * 0.0). IEEE handles this all very nicely, but Python seems to take pains to hide the well thought out IEEE behavior. In fact, the