ieee-754 | 易学教程

How to calculate float type precision and does it make sense?

阅读更多关于 How to calculate float type precision and does it make sense?

问题 I have a problem understanding the precision of float type. The msdn writes that precision from 6 to 9 digits. But I note that precision depends from on the size of the number: float smallNumber = 1.0000001f; Console.WriteLine(smallNumber); // 1.0000001 bigNumber = 100000001f; Console.WriteLine(bigNumber); // 100000000 The smallNumber is more precise than big, I understand IEEE754, but I don't understand how MSDN calculate precision, and does it make sense? Also, you can play with the

Square roots by Newton’s method

阅读更多关于 Square roots by Newton’s method

问题 The following Scheme program implements Newton’s method for computing the square root of a number: (import (scheme small)) (define (sqrt x) (define (sqrt-iter guess) (if (good-enough? guess) guess (sqrt-iter (improve guess)))) (define (good-enough? guess) (define tolerance 0.001) (< (abs (- (square guess) x)) tolerance)) (define (improve guess) (if (= guess 0) guess (average guess (/ x guess)))) (define (average x y) (/ (+ x y) 2)) (define initial-guess 1.0) (sqrt-iter initial-guess))

How to convert float to binary without using unsafe code?

阅读更多关于 How to convert float to binary without using unsafe code?

问题 Is there a way to convert a floating point number ( f32 or f64 ) to a data type that I can access bitwise, like u32 / u64 ? That is, something corresponding to: fn as_bits(i: f64) -> u64 { unsafe { mem::transmute(i) } } but without the unsafe . This code is safe per the rules, even though it may not return the same values on all platforms, specifically for NaNs. The reverse safe interface would also be nice. 回答1: Rust 1.20 introduced f64::to_bits and f32::to_bits: fn main() { println!("{}",

Can -ffast-math be safely used on a typical project?

阅读更多关于 Can -ffast-math be safely used on a typical project?

问题 While answering a question where I suggested -ffast-math , a comment pointed out that it is dangerous. My personal feeling is that outside scientific calculations, it is OK. I also asume that serious financial applications use fixed point instead of floating point. Of course if you want to use it in your project the ultimate answer is to test it on your project and see how much it affects it. But I think a general answer can be given by people who tried and have experience with such

IEEE floating point signalling NaN (sNaN) in Haskell

阅读更多关于 IEEE floating point signalling NaN (sNaN) in Haskell

问题 Is there any way to define signaling NaN in Haskell? I found two approaches to deal with NaNs: 1) use 0/0, which produces quite nan 2) package Data.Number.Transfinite, which has no signaling NaNs too. PS Is there any way to put Word64 bit by bit into Double without writing C library? 回答1: I have found one non-portable way: {-# LANGUAGE ForeignFunctionInterface #-} import Data.Word (Word64, Word32) import Unsafe.Coerce import Foreign import Foreign.C.Types foreign import ccall "fenv.h

0.0 and -0.0 in Java (IEEE 754)

阅读更多关于 0.0 and -0.0 in Java (IEEE 754)

问题 Java is totally compatible with IEEE 754 right? But I'm confused about how java decide the sign of float point addition and substraction. Here is my test result： double a = -1.5; double b = 0.0; double c = -0.0; System.out.println(b * a); //-0.0 System.out.println(c * a); //0.0 System.out.println(b + b); //0.0 System.out.println(c + b); //0.0 System.out.println(b + c); //0.0 System.out.println(b - c); //0.0 System.out.println(c - b); //-0.0 System.out.println(c + c); //-0.0 I think in the

Does IEEE-754 float, double and quad guarantee exact representation of -2, -1, -0, 0, 1, 2?

阅读更多关于 Does IEEE-754 float, double and quad guarantee exact representation of -2, -1, -0, 0, 1, 2?

问题 All is in the title: does IEEE-754 float , double and quad guarantee exact representation of -2 , -1 , -0 , 0 , 1 , 2 ? 回答1: It guarantees precise representations of all integers until the number of significant binary digits exceeds the range of the mantissa. 回答2: IEEE 754 floating point numbers can be used to store precisely integers of a certain ranges. For example: binary32 , implemented in C/C++ as float , provides 24 bits of precision and therefore can represent with full precision 16

Odd behavior when converting C strings to/from doubles

阅读更多关于 Odd behavior when converting C strings to/from doubles

问题 I'm having trouble understanding C's rules for what precision to assume when printing doubles, or when converting strings to doubles. The following program should illustrate my point: #include <errno.h> #include <float.h> #include <stdio.h> #include <stdlib.h> #include <string.h> int main(int argc, char **argv) { double x, y; const char *s = "1e-310"; /* Should print zero */ x = DBL_MIN/100.; printf("DBL_MIN = %e, x = %e\n", DBL_MIN, x); /* Trying to read in floating point number smaller than

Why does NaN - NaN == 0.0 with the Intel C++ Compiler?

阅读更多关于 Why does NaN - NaN == 0.0 with the Intel C++ Compiler?

问题 It is well-known that NaNs propagate in arithmetic, but I couldn't find any demonstrations, so I wrote a small test: #include <limits> #include <cstdio> int main(int argc, char* argv[]) { float qNaN = std::numeric_limits<float>::quiet_NaN(); float neg = -qNaN; float sub1 = 6.0f - qNaN; float sub2 = qNaN - 6.0f; float sub3 = qNaN - qNaN; float add1 = 6.0f + qNaN; float add2 = qNaN + qNaN; float div1 = 6.0f / qNaN; float div2 = qNaN / 6.0f; float div3 = qNaN / qNaN; float mul1 = 6.0f * qNaN;

Why does NaN - NaN == 0.0 with the Intel C++ Compiler?

阅读更多关于 Why does NaN - NaN == 0.0 with the Intel C++ Compiler?