ieee-754 | 易学教程

IEEE-754 Double (64-bit floating point) vs. Long (64-bit Integer) Revisited

阅读更多关于 IEEE-754 Double (64-bit floating point) vs. Long (64-bit Integer) Revisited

问题 I'm revisiting a question (How to test if numeric conversion will change value?) that as far as I was concerned was fully solved. The problem was to detect when a particular numeric value would overflow JavaScript's IEEE-754 Number type. The previous question was using C# and the marked answer worked perfectly. Now I'm doing the exact same task but this time in Java and it doesn't work. AFAIK, Java uses IEEE-754 for its double data type. So I should be able to cast it back and forth to force

Are there any modern platforms with non-IEEE C/C++ float formats?

阅读更多关于 Are there any modern platforms with non-IEEE C/C++ float formats?

问题 I am writing a video game, Humm and Strumm, which requires a network component in its game engine. I can deal with differences in endianness easily, but I have hit a wall in attempting to deal with possible float memory formats. I know that modern computers have all a standard integer format, but I have heard that they may not all use the IEEE standard for floating-point integers. Is this true? While certainly I could just output it as a character string into each packet, I would still have

Converting from floating-point to decimal with floating-point computations

阅读更多关于 Converting from floating-point to decimal with floating-point computations

I am trying to convert a floating-point double-precision value x to decimal with 12 (correctly rounded) significant digits. I am assuming that x is between 10^110 and 10^111 such that its decimal representation will be of the form x.xxxxxxxxxxxE110 . And, just for fun, I am trying to use floating-point arithmetic only. I arrived to the pseudo-code below, where all operations are double-precision operations, The notation 1e98 is for the double nearest to the mathematical 10^98, and 1e98_2 is the double nearest to the result of the mathematical subtraction 10^98- 1e98 . The notation fmadd(X * Y

Rounding quirk in JavaScript or IEEE-754?

阅读更多关于 Rounding quirk in JavaScript or IEEE-754?

问题 I've come across a curious issue in one of my unit tests where I'm getting unexpected rounding results in JavaScript: (2.005).toFixed(2) // produces "2.00" (2.00501).toFixed(2) // produces "2.01" Initially I suspected this was a Webkit only issue but it repros in Gecko which implies to me that it is an expected side effect of either ECMA-262 or IEEE-754. I'm assuming the binary representation of 2.005 is ever so slightly less? Or does ECMA-262 specify a round-to-even methodology for toFixed ?

How to output IEEE-754 format integer as a float

阅读更多关于 How to output IEEE-754 format integer as a float

问题 I have a unsigned long integer value which represents a float using IEEE-754 format. What is the quickest way of printing it out as a float in C++? I know one way, but am wondering if there is a convenient utility in C++ that would be better. Example of the way that I know is: union { unsigned long ul; float f; } u; u.ul = 1084227584; // in HEX, this is 0x40A00000 cout << "float value is: " << u.f << endl; (This prints out "float value is: 5" ) 回答1: The union method you suggested is the usual

Is 0 divided by infinity guaranteed to be 0?

阅读更多关于 Is 0 divided by infinity guaranteed to be 0?

According to this question , n/inf is expected to be zero for n != 0 . What about when n == 0 ? According to IEEE-754, is (0 / inf) == 0 always true? Mathematically, 0/0 is indeterminate, and 0/anything_else is zero. IEEE-754 works the same way. So 0/infinity will yield a zero. 0/0 will yield a NaN. Note: not all C++ implementations support IEEE floating point, and some that do so don't completely meet IEEE specifications, so this is not necessarily a C++ question. 来源： https://stackoverflow.com/questions/29426734/is-0-divided-by-infinity-guaranteed-to-be-0

Why doesn't python decimal library return the specified number of signficant figures for some inputs

阅读更多关于 Why doesn't python decimal library return the specified number of signficant figures for some inputs

NB : this question is about significant figures . It is not a question about "digits after the decimal point" or anything like that. EDIT : This question is not a duplicate of Significant figures in the decimal module . The two questions are asking about entirely different problems. I want to know why the function about does not return the desired value for a specific input. None of the answers to Significant figures in the decimal module address this question. The following function is supposed to return a string representation of a float with the specified number of significant figures:

Get raw bytes of a float in Swift

阅读更多关于 Get raw bytes of a float in Swift

问题 How can I read the raw bytes of a Float or Double in Swift? Example: let x = Float(1.5) let bytes1: UInt32 = getRawBytes(x) let bytes2: UInt32 = 0b00111111110000000000000000000000 I want bytes1 and bytes2 to contain the same value, since this binary number is the Float representation of 1.5 . I need it to do bit-wise operations like & and >> (these are not defined on a float). 回答1: Update for Swift 3: As of Swift 3, all floating point types have bitPattern property which returns an unsigned

Any IEEE754(r) compliant implementations for Java?

阅读更多关于 Any IEEE754(r) compliant implementations for Java?

问题 Are there any fully compliant IEEE754r implementations available for Java that offer support for all the features Java chose to omit (or rather high level languages in general like to omit): Traps Sticky flags Directed rounding modes Extended/long double Quad precision DPD (densly packed decimals) Clarification before anyone gets it wrong: I'm not looking for the JVM to offer any support for the above, just some classes that do implement the types and operations in software, basically

is float16 supported in matlab?

阅读更多关于 is float16 supported in matlab?

Does MATLAB support float16 operations? If so, how to convert a double matrix to float16? I am doing an arithmetic operation on a large matrix where 16-bit floating representation is sufficient for my representation. Representing by a double datatype takes 4 times more memory. Is your matrix full? Otherwise, try sparse -- saves a lot of memory if there's lots of zero-valued elements. AFAIK, float16 is not supported. Lowest you can go in float -datatype is with single , which is a 32-bit datatype: A = single( rand(50) ); You could multiply by a constant and cast to int16 , but you'd lose