ieee-754

Converting IEEE 754 floating point in Haskell Word32/64 to and from Haskell Float/Double

半腔热情 提交于 2019-11-28 20:06:39
Question In Haskell, the base libraries and Hackage packages provide several means of converting binary IEEE-754 floating point data to and from the lifted Float and Double types. However, the accuracy, performance, and portability of these methods are unclear. For a GHC-targeted library intended to (de)serialize a binary format across platforms, what is the best approach for handling IEEE-754 floating point data? Approaches These are the methods I've encountered in existing libraries and online resources. FFI Marshaling This is the approach used by the data-binary-ieee754 package. Since Float

sign changes when going from int to float and back

大憨熊 提交于 2019-11-28 19:06:50
Consider the following code, which is an SSCCE of my actual problem: #include <iostream> int roundtrip(int x) { return int(float(x)); } int main() { int a = 2147483583; int b = 2147483584; std::cout << a << " -> " << roundtrip(a) << '\n'; std::cout << b << " -> " << roundtrip(b) << '\n'; } The output on my computer (Xubuntu 12.04.3 LTS) is: 2147483583 -> 2147483520 2147483584 -> -2147483648 Note how the positive number b ends up negative after the roundtrip. Is this behavior well-specified? I would have expected int-to-float round-tripping to at least preserve the sign correctly... Hm, on

How does this float square root approximation work?

若如初见. 提交于 2019-11-28 18:23:22
问题 I found a rather strange but working square root approximation for float s; I really don't get it. Can someone explain me why this code works? float sqrt(float f) { const int result = 0x1fbb4000 + (*(int*)&f >> 1); return *(float*)&result; } I've test it a bit and it outputs values off of std::sqrt() by about 1 to 3%. I know of the Quake III's fast inverse square root and I guess it's something similar here (without the newton iteration) but I'd really appreciate an explanation of how it

Half-precision floating-point in Java

倖福魔咒の 提交于 2019-11-28 16:32:34
Is there a Java library anywhere that can perform computations on IEEE 754 half-precision numbers or convert them to and from double-precision? Either of these approaches would be suitable: Keep the numbers in half-precision format and compute using integer arithmetic & bit-twiddling (as MicroFloat does for single- and double-precision) Perform all computations in single or double precision, converting to/from half precision for transmission (in which case what I need is well-tested conversion functions.) Edit : conversion needs to be 100% accurate - there are lots of NaNs, infinities and

Java - Convert hex to IEEE-754 64-bit float - double precision

有些话、适合烂在心里 提交于 2019-11-28 14:12:40
I'm trying to convert the following hex string: "41630D54FFF68872" to 9988776.0 (float-64). With a single precision float-32 I would do: int intBits = Long.valueOf("hexFloat32", 16).intValue(); float floatValue = Float.intBitsToFloat(intBits); but this throws a: java.lang.NumberFormatException: Infinite or NaN when using the 64-bits hex above. How do I convert a hex to a double precision float encoded with IEEE-754 with 64 bits? Thank you Michael Madsen You want double-precision, so Float isn't the right class - that's for single precision. Instead, you want the Double class, specifically

What is the rationale for exponent and mantissa sizes in IEEE floating point standards?

混江龙づ霸主 提交于 2019-11-28 14:09:56
I have a decent understanding of how floating point works, but I want to know how the specific exponent and mantissa sizes were decided upon. Are they optimal in some way? How can optimality be measured for floating point representations (I assume there are several ways)? I imagine these issues are addressed in the official standard, but I don't have access to it. According to this interview with Will Kahan , they were based on the VAX F and G formats of the era. Of course that doesn't answer the question of how those formats were chosen... For 32-bit IEEE floats, the reasoning is that the

how many whole numbers in IEEE 754

久未见 提交于 2019-11-28 13:59:45
I am trying to figure out how many different whole numbers exist in the ieee 754. The number I got was 1778384895 but I couldn't find any resource to check myself. Thanks a lot in advance. I will assume single precision floats. We got the zero, which although can be represented as negative zero, is still just zero as an integer so I count it as one. Numbers with exponent less than 127 are not integers. Exponent Free bits # Numbers 127 0 1 128 1 2 129 2 4 ... 149 22 2^22 These sum up to 2^23-1 . If exponent is greater than 149, all the numbers are integers. So that's an additional 105*2^23

Computing a correctly rounded / an almost correctly rounded floating-point cubic root

寵の児 提交于 2019-11-28 13:30:34
Suppose that correctly rounded standard library functions such as found in CRlibm are available. Then how would one compute the correctly rounded cubic root of a double-precision input? This question is not an “actual problem that [I] face”, to quote the FAQ. It is a little bit like homework this way. But the cubic root is a frequently found operation and one could imagine this question being an actual problem that someone faces. Since “best Stack Overflow questions have a bit of source code in them”, here is a bit of source code: y = pow(x, 1. / 3.); The above does not compute a correctly

Why does frexp() not yield scientific notation?

耗尽温柔 提交于 2019-11-28 13:27:42
Scientific notation is the common way to express a number with an explicit order of magnitude. First a nonzero digit, then a radix point, then a fractional part, and the exponent. In binary, there is only one possible nonzero digit. Floating-point math involves an implicit first digit equal to one, then the mantissa bits "follow the radix point." So why does frexp() put the radix point to the left of the implicit bit, and return a number in [0.5, 1) instead of scientific-notation-like [1, 2)? Is there some overflow to beware of? Effectively it subtracts one more than the bias value specified

float128 and double-double arithmetic

筅森魡賤 提交于 2019-11-28 12:59:39
问题 I've seen in wikipedia that someway to implement quad-precision is to use double-double arithmetic even if it's not exactly the same precision in terms of bits: https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format In this case, we use two double to store the value. So we make two operations to compute the result, one for each double of the result. In this case we can have round-off errors on each double or their is a mechanism that avoid this? 回答1: “In this case, we use two