ieee-754

Does double z=x-y guarantee that z+y==x for IEEE 754 floating point?

限于喜欢 提交于 2019-11-29 14:46:15
I have a problem that can be reduced to this problem statement: Given a series of doubles where each is in the range [0, 1e7] , modify the last element such that the sum of the numbers equals exactly a target number. The series of doubles already sums to the target number within an epsilon (1e-7), but they are not ==. The following code is working, but is it guaranteed to work for all inputs that meet the requirements described in the first sentence? public static double[] FixIt(double[] input, double targetDouble) { var result = new double[input.Length]; if (input.Length == 0) return result;

Accuracy of floating point arithmetic

旧巷老猫 提交于 2019-11-29 13:30:54
I'm having trouble understanding the output of this program int main() { double x = 1.8939201459282359e-308; double y = 4.9406564584124654e-324; printf("%23.16e\n", 1.6*y); printf("%23.16e\n", 1.7*y); printf("%23.16e\n", 1.8*y); printf("%23.16e\n", 1.9*y); printf("%23.16e\n", 2.0*y); printf("%23.16e\n", x + 1.6*y); printf("%23.16e\n", x + 1.7*y); printf("%23.16e\n", x + 1.8*y); printf("%23.16e\n", x + 1.9*y); printf("%23.16e\n", x + 2.0*y); } The output is 9.8813129168249309e-324 9.8813129168249309e-324 9.8813129168249309e-324 9.8813129168249309e-324 9.8813129168249309e-324 1.8939201459282364e

How to check if float can be exactly represented as an integer

六眼飞鱼酱① 提交于 2019-11-29 13:10:13
I'm looking to for a reasonably efficient way of determining if a floating point value ( double ) can be exactly represented by an integer data type ( long , 64 bit). My initial thought was to check the exponent to see if it was 0 (or more precisely 127 ). But that won't work because 2.0 would be e=1 m=1... So basically, I am stuck. I have a feeling that I can do this with bit masks, but I'm just not getting my head around how to do that at this point. So how can I check to see if a double is exactly representable as a long? Thanks Mysticial Here's one method that could work in most cases. I'm

double-double implementation resilient to FPU rounding mode

落爺英雄遲暮 提交于 2019-11-29 12:52:20
Context: double-double arithmetic “Double-double” is a representation of numbers as the sum of two double-precision numbers without overlap in the significands. This representation takes advantage of existing double-precision hardware implementations for “near quadruple-precision” computations. One typical low-level C function in a double-double implementation may take two double-precision numbers a and b with |a| ≥ |b| and compute the double-double number (s, e) that represents their sum: s = a + b; e = b - (s - a); (Adapted from this article .) These implementations typically assume round-to

How to convert an IEEE 754 single-precision binary floating-point to decimal?

 ̄綄美尐妖づ 提交于 2019-11-29 12:43:57
I am working on a program that needs to convert a 32-bit number into a decimal number. The number that I get from input is a 32 bit number represented as floating point. The first bit is the sign, the next 8 bits are the exponent, and the other 23 bits are mantissa. I am working the program in C. In input, I get that number as a char[] array, and after that I am making a new int[] array where I store the sign , the exponent and the mantissa. But, I have problem with the mantissa when I am trying to store it in some datatype, because I need to use the mantissa as a number, not as an array:

Number of consecutive zeros in the decimal representation of a double

余生颓废 提交于 2019-11-29 11:54:18
What is the maximum number of consecutive non-leading non-trailing zeros (resp. nines) in the exact decimal representation of an IEEE 754 double-precision number? Context Consider the problem of converting a double to decimal, rounding up (resp. down), when the only primitive you are able to use is an existing function that converts to the nearest (correctly rounded to any desired number of digits). You could get a few additional digits and remove them yourself. For instance, to round 1.875 down to one digit after the dot, you could convert it to the nearest decimal representation with two or

Why don't operations on double-precision values give expected results?

时间秒杀一切 提交于 2019-11-29 11:18:32
System.out.println(2.14656); 2.14656 System.out.println(2.14656%2); 0.14656000000000002 WTF? The do give the expected results. Your expectations are incorrect. When you type the double-precision literal 2.14656 , what you actually get is the closest double-precision value, which is: 2.14656000000000002359001882723532617092132568359375 the println happens to round this when it prints it out (to 17 significant digits), so you see the nice value that you expect. After the modulus operation (which is exact), the value is: 0.14656000000000002359001882723532617092132568359375 Again, this is rounded

Why does casting Double.NaN to int not throw an exception in Java?

可紊 提交于 2019-11-29 09:11:26
So I know the IEEE 754 specifies some special floating point values for values that are not real numbers. In Java, casting those values to a primitive int does not throw an exception like I would have expected. Instead we have the following: int n; n = (int)Double.NaN; // n == 0 n = (int)Double.POSITIVE_INFINITY; // n == Integer.MAX_VALUE n = (int)Double.NEGATIVE_INFINITY; // n == Integer.MIN_VALUE What is the rationale for not throwing exceptions in these cases? Is this an IEEE standard, or was it merely a choice by the designers of Java? Are there bad consequences that I am unaware of if

How can I convert 4 bytes storing an IEEE 754 floating point number to a float value in C?

你离开我真会死。 提交于 2019-11-29 08:41:46
My program reads into 4 bytes an IEEE 754 floating point number from a file. I need to portable convert those bytes to my C compilers float type. In other words I need a function with the prototype float IEEE_754_to_float(uint8_t raw_value[4]) for my C program. If your implementation can guarantee correct endianness: float raw2ieee(uint8_t *raw) { // either union { uint8_t bytes[4]; float fp; } un; memcpy(un.bytes, raw, 4); return un.fp; // or, as seen in the fast inverse square root: return *(float *)raw; } If the endianness is the same, then like so: float f; memcpy(&f, raw_value, sizeof f);

How is fma() implemented

痴心易碎 提交于 2019-11-29 07:32:34
According to the documentation , there is a fma() function in math.h . That is very nice, and I know how FMA works and what to use it for. However, I am not so certain how this is implemented in practice? I'm mostly interested in the x86 and x86_64 architectures. Is there a floating-point (non-vector) instruction for FMA, perhaps as defined by IEEE-754 2008? Is FMA3 or FMA4 instruction used? Is there an intrinsic to make sure that a real FMA is used, when the precision is relied upon? The actual implementation varies from platform to platform, but speaking very broadly: If you tell your