ieee-754

next higher/lower IEEE double precision number

对着背影说爱祢 提交于 2019-11-26 18:24:09
问题 I am doing high precision scientific computations. In looking for the best representation of various effects, I keep coming up with reasons to want to get the next higher (or lower) double precision number available. Essentially, what I want to do is add one to the least significant bit in the internal representation of a double. The difficulty is that the IEEE format is not totally uniform. If one were to use low-level code and actually add one to the least significant bit, the resulting

Why converting from float to double changes the value?

自闭症网瘾萝莉.ら 提交于 2019-11-26 17:55:39
I've been trying to find out the reason, but I couldn't. Can anybody help me? Look at the following example. float f = 125.32f; System.out.println("value of f = " + f); double d = (double) 125.32f; System.out.println("value of d = " + d); This is the output: value of f = 125.32 value of d = 125.31999969482422 Eric Postpischil The value of a float does not change when converted to a double . There is a difference in the displayed numerals because more digits are required to distinguish a double value from its neighbors, which is required by the Java documentation . That is the documentation for

How to check if C++ compiler uses IEEE 754 floating point standard

我与影子孤独终老i 提交于 2019-11-26 17:43:12
I would like to ask a question that follows this one which is pretty well answered by the define check if the compiler uses the standard. However this woks for C only. Is there a way to do the same in C++? I do not wish to covert floating point types to text or use some pretty complex conversion functions. I just need the compiler check. If you know a list of such compatible compilers please post the link. I could not find it. Actually you have an easier way to achieve this in C++. From the C++ standard 18.2.1.1 the class numeric_limits exists within std . In order to access said static member

Fused multiply add and default rounding modes

蹲街弑〆低调 提交于 2019-11-26 17:42:51
问题 With GCC 5.3 the following code compield with -O3 -fma float mul_add(float a, float b, float c) { return a*b + c; } produces the following assembly vfmadd132ss %xmm1, %xmm2, %xmm0 ret I noticed GCC doing this with -O3 already in GCC 4.8. Clang 3.7 with -O3 -mfma produces vmulss %xmm1, %xmm0, %xmm0 vaddss %xmm2, %xmm0, %xmm0 retq but Clang 3.7 with -Ofast -mfma produces the same code as GCC with -O3 fast . I am surprised that GCC does with -O3 because from this answer it says The compiler is

Maximum number of decimal digits that can affect a double

给你一囗甜甜゛ 提交于 2019-11-26 16:47:42
问题 Consider decimal representations of the form d1.d2d3d4d5...dnExxx where xxx is an arbitrary exponent and both d1 and dn are nonzero. Is the maximum n known such that there exists a decimal representation d1.d2d3d4d5...dnExxx such that the interval (d1.d2d3d4d5...dnExxx, d1.d2d3d4d5...((dn)+1)Exxx) contains an IEEE 754 double? n should be at least 17. The question is how much above 17. This number n has something to do with the number of digits that it is enough to consider in a decimal-to

Uses for negative zero floating point value?

牧云@^-^@ 提交于 2019-11-26 16:33:28
问题 Consider the following C++ code: double someZero = 0; std::cout << 0 - someZero << '\n'; // prints 0 std::cout << -someZero << std::endl; // prints -0 The question arises: what is negative zero good for, and should it be defensively avoided (i.e. use subtraction instead of smacking a minus onto a variable)? 回答1: From Wikipedia: It is claimed that the inclusion of signed zero in IEEE 754 makes it much easier to achieve numerical accuracy in some critical problems[1], in particular when

Floating point comparison revisited

天涯浪子 提交于 2019-11-26 15:18:51
问题 This topic has come up many times on StackOverflow, but I believe this is a new take. Yes, I have read Bruce Dawson's articles and What Every Computer Scientist Should Know About Floating-Point Arithmetic and this nice answer. As I understand it, on a typical system there are four basic problems when comparing floating-point numbers for equality: Floating point calculations are not exact Whether a-b is "small" depends on the scale of a and b Whether a-b is "small" depends on the type of a and

Converting Int to Float or Float to Int using Bitwise operations (software floating point)

≡放荡痞女 提交于 2019-11-26 14:12:12
问题 I was wondering if you could help explain the process on converting an integer to float, or a float to an integer. For my class, we are to do this using only bitwise operators, but I think a firm understanding on the casting from type to type will help me more in this stage. From what I know so far, for int to float, you will have to convert the integer into binary, normalize the value of the integer by finding the significand, exponent, and fraction, and then output the value in float from

How to subtract IEEE 754 numbers?

痞子三分冷 提交于 2019-11-26 14:05:38
问题 How do I subtract IEEE 754 numbers? For example: 0,546875 - 32.875... -> 0,546875 is 0 01111110 10001100000000000000000 in IEEE-754 -> -32.875 is 1 10000111 01000101111000000000000 in IEEE-754 So how do I do the subtraction? I know I have to to make both exponents equal but what do I do after that? 2'Complement of -32.875 mantissa and add with 0.546875 mantissa? 回答1: Really not any different than you do it with pencil and paper. Okay a little different 123400 - 5432 = 1.234*10^5 - 5.432*10^3

Are all integer values perfectly represented as doubles? [duplicate]

旧街凉风 提交于 2019-11-26 13:48:59
问题 This question already has an answer here: Representing integers in doubles 5 answers My question is whether all integer values are guaranteed to have a perfect double representation. Consider the following code sample that prints "Same": // Example program #include <iostream> #include <string> int main() { int a = 3; int b = 4; double d_a(a); double d_b(b); double int_sum = a + b; double d_sum = d_a + d_b; if (double(int_sum) == d_sum) { std::cout << "Same" << std::endl; } } Is this