ieee-754 | 易学教程

Floating-point: “The leading 1 is 'implicit' in the significand.” — …huh?

阅读更多关于 Floating-point: “The leading 1 is 'implicit' in the significand.” — …huh?

问题 I'm learning about the representation of floating-point IEEE 754 numbers, and my textbook says: To pack even more bits into the significand, IEEE 754 makes the leading 1-bit of normalized binary numbers implicit. Hence, the number is actually 24 bits long in single precision (implied 1 and 23-bit fraction), and 53 bits long in double precision (1 + 52). I don't get what "implicit" means here... what's the difference between an explicit bit and an implicit bit? Don't all numbers have the bit,

Accuracy of floating point arithmetic

阅读更多关于 Accuracy of floating point arithmetic

问题 I'm having trouble understanding the output of this program int main() { double x = 1.8939201459282359e-308; double y = 4.9406564584124654e-324; printf("%23.16e\n", 1.6*y); printf("%23.16e\n", 1.7*y); printf("%23.16e\n", 1.8*y); printf("%23.16e\n", 1.9*y); printf("%23.16e\n", 2.0*y); printf("%23.16e\n", x + 1.6*y); printf("%23.16e\n", x + 1.7*y); printf("%23.16e\n", x + 1.8*y); printf("%23.16e\n", x + 1.9*y); printf("%23.16e\n", x + 2.0*y); } The output is 9.8813129168249309e-324 9

double-double implementation resilient to FPU rounding mode

阅读更多关于 double-double implementation resilient to FPU rounding mode

问题 Context: double-double arithmetic “Double-double” is a representation of numbers as the sum of two double-precision numbers without overlap in the significands. This representation takes advantage of existing double-precision hardware implementations for “near quadruple-precision” computations. One typical low-level C function in a double-double implementation may take two double-precision numbers a and b with |a| ≥ |b| and compute the double-double number (s, e) that represents their sum: s

Why don't operations on double-precision values give expected results?

阅读更多关于 Why don't operations on double-precision values give expected results?

问题 System.out.println(2.14656); 2.14656 System.out.println(2.14656%2); 0.14656000000000002 WTF? 回答1: The do give the expected results. Your expectations are incorrect. When you type the double-precision literal 2.14656 , what you actually get is the closest double-precision value, which is: 2.14656000000000002359001882723532617092132568359375 the println happens to round this when it prints it out (to 17 significant digits), so you see the nice value that you expect. After the modulus operation

Standard for the sine of very large numbers

阅读更多关于 Standard for the sine of very large numbers

问题 I am writing an (almost) IEEE 854 compliant floating point implementation in TeX (which only has support for 32-bit integers). This standard only specifies the result of + , - , * , / , comparison, remainder, and sqrt : for those operations, the result should be identical to rounding the exact result to a representable number (according to the rounding mode). I seem to recall that IEEE specifies that transcendental functions ( sin , exp ...) should yield faithful results (in the default round

Does the C++ standard specify anything on the representation of floating point numbers?

阅读更多关于 Does the C++ standard specify anything on the representation of floating point numbers?

问题 For types T for which std::is_floating_point<T>::value is true , does the C++ standard specify anything on the way that T should be implemented? For example, does T has even to follow a sign/mantissa/exponent representation? Or can it be completely arbitrary? 回答1: From N3337: [basic.fundamental/8]: There are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as

Converting IEEE 754 floating point in Haskell Word32/64 to and from Haskell Float/Double

阅读更多关于 Converting IEEE 754 floating point in Haskell Word32/64 to and from Haskell Float/Double

问题 Question In Haskell, the base libraries and Hackage packages provide several means of converting binary IEEE-754 floating point data to and from the lifted Float and Double types. However, the accuracy, performance, and portability of these methods are unclear. For a GHC-targeted library intended to (de)serialize a binary format across platforms, what is the best approach for handling IEEE-754 floating point data? Approaches These are the methods I've encountered in existing libraries and

Half-precision floating-point in Java

阅读更多关于 Half-precision floating-point in Java

问题 Is there a Java library anywhere that can perform computations on IEEE 754 half-precision numbers or convert them to and from double-precision? Either of these approaches would be suitable: Keep the numbers in half-precision format and compute using integer arithmetic & bit-twiddling (as MicroFloat does for single- and double-precision) Perform all computations in single or double precision, converting to/from half precision for transmission (in which case what I need is well-tested

Why does table-based sin approximation literature always use this formula when another formula seems to make more sense?

阅读更多关于 Why does table-based sin approximation literature always use this formula when another formula seems to make more sense?

问题 The literature on computing the elementary function sin with tables refers to the formula: sin(x) = sin(Cn) * cos(h) + cos(Cn) * sin(h) where x = Cn + h , Cn is a constant for which sin(Cn) and cos(Cn) have been pre-computed and are available in a table, and, if following Gal's method, Cn has been chosen so that both sin(Cn) and cos(Cn) are closely approximated by floating-point numbers. The quantity h is close to 0.0 . An example of reference to this formula is this article (page 7). I don't

Is 3*x+x always exact?

阅读更多关于 Is 3*x+x always exact?

问题 Assuming strict IEEE 754 (no excess precision) and round to nearest even mode, is 3*x+x always == 4*x (and thus exact in absence of overflow) and why? I was not able to exhibit a counter-example, so I went into lengthy dicussion of every possible trailing bit pattern abc and rounding case, but I feel like I could have missed a case, and also missed a simpler demonstration... I also have an intuition that this could be extended to (2^n-1) x + x == 2^n x and testing every combination of