floating-point-precision

Precision of repr(f), str(f), print(f) when f is float

风格不统一 提交于 2019-11-28 01:13:37
问题 If I run: >>> import math >>> print(math.pi) 3.141592653589793 Then pi is printed with 16 digits, However, according to: >>> import sys >>> sys.float_info.dig 15 My precision is 15 digits. So, should I rely on the last digit of that value (i.e. that the value of π indeed is 3.141592653589793nnnnnn). 回答1: TL;DR The last digit of str(float) or repr(float) can be "wrong" in that it seems that the decimal representation is not correctly rounded. >>> 0.100000000000000040123456 0.10000000000000003

Does parseDouble exist in JavaScript?

一个人想着一个人 提交于 2019-11-27 21:05:02
In JavaScript, I have a number which is 21 digits, and I want to parse it. Does a parseDouble method exist in JavaScript? user2864740 It's not possible to natively deal with a 21-digit precision number in JavaScript. JavaScript only has one kind of number: "number", which is a IEEE-754 Double Precision ("double") value. As such, parseFloat in JavaScript is the equivalent of a "parse double" in other languages. However, a number/"double" only provides 16 significant digits (decimal) of precision and so reading in a number with 21-digits will lose the 5 least significant digits 1 . For more

What's the C++ suffix for long double literals?

独自空忆成欢 提交于 2019-11-27 20:58:56
In C++ (and C), a floating point literal without suffix defaults to double , while the suffix f implies a float . But what is the suffix to get a long double ? Without knowing, I would define, say, const long double x = 3.14159265358979323846264338328; But my worry is that the variable x contains fewer significant bits of 3.14159265358979323846264338328 than 64, because this is a double literal. Is this worry justified? Vlad from Moscow From the C++ Standard The type of a floating literal is double unless explicitly specified by a suffix. The suffixes f and F specify float, the suffixes l and

Turn float into string

我的未来我决定 提交于 2019-11-27 16:23:50
I have proceeded to state when I need to turn IEEE-754 single and double precision numbers into strings with base 10 . There is FXTRACT instruction available, but it provides only exponent and mantissa for base 2, as the number calculation formula is: value = (-1)^sign * 1.(mantissa) * 2^(exponent-bias) If I had some logarithmic instructions for specific bases, I would be able to change base of 2 exponent - bias part in expression, but currently I don't know what to do. I was also thinking of using standard rounded conversion into integer, but it seems to be unusable as it doesn't offer

next higher/lower IEEE double precision number

耗尽温柔 提交于 2019-11-27 14:37:38
I am doing high precision scientific computations. In looking for the best representation of various effects, I keep coming up with reasons to want to get the next higher (or lower) double precision number available. Essentially, what I want to do is add one to the least significant bit in the internal representation of a double. The difficulty is that the IEEE format is not totally uniform. If one were to use low-level code and actually add one to the least significant bit, the resulting format might not be the next available double. It might, for instance, be a special case number such as

Why are my BigDecimal objects initialized with unexpected rounding errors?

时光总嘲笑我的痴心妄想 提交于 2019-11-27 14:35:27
问题 In Ruby 2.2.0, why does: BigDecimal.new(34.13985572755337, 9) equal 34.0 but BigDecimal.new(34.13985572755338, 9) equal 34.1398557 ? Note that I am running this on a 64 bit machine. 回答1: Initialize with Strings Instead of Floats In general, you can't get reliable behavior with Floats. You're making the mistake of initializing your BigDecimals with Float values instead of String values, which introduces some imprecision right at the beginning. For example, on my 64-bit system: float1 = 34

Is the most significant decimal digits precision that can be converted to binary and back to decimal without loss of significance 6 or 7.225?

依然范特西╮ 提交于 2019-11-27 08:37:50
I've come across two different precision formulas for floating-point numbers. ⌊(N-1) log 10 (2)⌋ = 6 decimal digits (Single-precision) and N log 10 (2) ≈ 7.225 decimal digits (Single-precision) Where N = 24 Significant bits (Single-precision) The first formula is found at the top of page 4 of " IEEE Standard 754 for Binary Floating-Point Arithmetic " written by, Professor W. Kahan . The second formula is found on the Wikipedia article " Single-precision floating-point format " under section IEEE 754 single-precision binary floating-point format: binary32 . For the first formula, Professor W.

pow() seems to be out by one here

穿精又带淫゛_ 提交于 2019-11-27 08:33:28
What's going on here: #include <stdio.h> #include <math.h> int main(void) { printf("17^12 = %lf\n", pow(17, 12)); printf("17^13 = %lf\n", pow(17, 13)); printf("17^14 = %lf\n", pow(17, 14)); } I get this output: 17^12 = 582622237229761.000000 17^13 = 9904578032905936.000000 17^14 = 168377826559400928.000000 13 and 14 do not match with wolfram alpa cf: 12: 582622237229761.000000 582622237229761 13: 9904578032905936.000000 9904578032905937 14: 168377826559400928.000000 168377826559400929 Moreover, it's not wrong by some strange fraction - it's wrong by exactly one! If this is down to me reaching

Converting Int to Float loses precision for large numbers in Swift

こ雲淡風輕ζ 提交于 2019-11-27 06:54:51
问题 XCode 6.3.1 Swift 1.2 let value: Int = 220904525 let intmax = Int.max let float = Float(value) // Here is an error probably let intFromFloat = Int(float) let double = Double(value) println("intmax=\(intmax) value=\(value) float=\(float) intFromFloat=\(intFromFloat) double=\(double)") // intmax=9223372036854775807 value=220904525 float=2.20905e+08 intFromFloat=220904528 double=220904525.0 The initial value is 220904525. But when I convert it to float it becomes 220904528. Why? 回答1: This is due

Why does for loop using a double fail to terminate

两盒软妹~` 提交于 2019-11-27 04:49:32
问题 I'm looking through old exam questions (currently first year of uni.) and I'm wondering if someone could explain a bit more thoroughly why the following for loop does not end when it is supposed to. Why does this happen? I understand that it skips 100.0 because of a rounding-error or something, but why? for(double i = 0.0; i != 100; i = i +0.1){ System.out.println(i); } 回答1: The number 0.1 cannot be exactly represented in binary, much like 1/3 cannot be exactly represented in decimal, as such