floating-point-precision

Ruby - Multiplication issue

不想你离开。 提交于 2020-01-01 12:27:10
问题 My output is like this - ruby-1.9.2-p290 :011 > 2.32 * 3 => 6.959999999999999 And I remember sometime back on another machine I had got it like.. 2.32 * 3 = 6 What is my mistake? Thanks a ton for reading this. :) 回答1: If you really want to round down to an integer then just (3 * 2.32).to_i but I think that's unlikely. Usually you just want to format the slightly imprecise floating point number to something like this "%0.2f" % (3 * 2.32) => "6.96" If you really want to work with the exact

Confusion with floating point numbers

走远了吗. 提交于 2019-12-29 01:44:08
问题 int main() { float x=3.4e2; printf("%f",x); return 0; } Output: 340.000000 // It's ok. But if write x=3.1234e2 the output is 312.339996 and if x=3.12345678e2 the output is 312.345673 . Why are the outputs like these? I think if I write x=3.1234e2 the output should be 312.340000 , but the actual output is 312.339996 using GCC compiler. 回答1: Not all fractional numbers have an exact binary equivalent so it is rounded to the nearest value. Simplified example, if you have 3 bits for the fraction,

C++ handling of excess precision

喜你入骨 提交于 2019-12-28 05:35:26
问题 I'm currently looking at code which does multi-precision floating-point arithmetic. To work correctly, that code requires values to be reduced to their final precision at well-defined points. So even if an intermediate result was computed to an 80 bit extended precision floating point register, at some point it has to be rounded to 64 bit double for subsequent operations. The code uses a macro INEXACT to describe this requirement, but doesn't have a perfect definition. The gcc manual mentions

OpenCL, C++: Unexpected Results of simple sum float vector program

徘徊边缘 提交于 2019-12-25 10:18:07
问题 It is simple program that read two float4 vectors from files then calculate sum of opposite numbers. The Result of it were not expected!! The main File: #include <limits.h> #include <stdio.h> #include <stdlib.h> #include <iostream> #include <iomanip> #include <array> #include <fstream> #include <sstream> #include <string> #include <algorithm> #include <iterator> #ifdef __APPLE__ #include <OpenCL/opencl.h> #else #include <CL/cl.h> #include <time.h> #endif const int number_of_points = 16; //

gcc double printf precision - wrong output

允我心安 提交于 2019-12-24 10:54:19
问题 #include <stdio.h> #include <wchar.h> int main() { double f = 1717.1800000000001; wprintf(L"double %.20G\n", f); return 0; } outputs (and expected below): double 1717.1800000000000637 double 1717.1800000000001 This is on Ubuntu 11.10 x64 (but also when compiling for 32 bit). The problem that I try to solve, is that on Windows it outputs the number exactly like in code, and I need to make low-level formatting (swprintf) to work like in Windows, for portability issues. 回答1: 1717.1800000000001

Loss of precision after subtracting double from double [duplicate]

断了今生、忘了曾经 提交于 2019-12-24 04:31:54
问题 This question already has answers here : Closed 7 years ago . Possible Duplicate: Retain precision with Doubles in java Alright so I've got the following chunk of code: int rotation = e.getWheelRotation(); if(rotation < 0) zoom(zoom + rotation * -.05); else if(zoom - .05 > 0) zoom(zoom - rotation * .05); System.out.println(zoom); Now, the zoom variable is of type double, initially set to 1. So, I would expect the results to be like 1 - .05 = .95; .95 - .05 = .9; .9 - .05 = .85; etc. This

Comparing doubles using ULPs (Units in the last place)

╄→尐↘猪︶ㄣ 提交于 2019-12-24 00:44:09
问题 I have succeeded in writing an Ulps based function that compares two doubles for equality. According to this page, the comparison can be made using a combination of absolute and relative epsilon or using integers (Ulps). I have made both epsilon based and Ulps based functions. This is the epsilon based function: var IsAlmostEqual_Epsilon = function(a, b) { if (a == b) return true; var diff = Math.abs(a - b); if (diff < 4.94065645841247E-320) return true; a = Math.abs(a); b = Math.abs(b); var

Do doubles suffer from overflow?

帅比萌擦擦* 提交于 2019-12-23 16:52:16
问题 Is it possible to have an overflow (wrap around) with a double or a float? What will happen if the maximum (or minimum) value is reached on x86 or x64 hardware? 回答1: On an IEEE-754 compliant system, overflow results in a special "infinity" (or "minus infinity") value, beyond which any further increment will have no effect. 回答2: No. Floats go to Inf or -Inf 来源: https://stackoverflow.com/questions/10239741/do-doubles-suffer-from-overflow

Is it 52 or 53 bits of floating point precision?

南笙酒味 提交于 2019-12-23 15:35:26
问题 I keep on seeing this nonsense about 53 bits of precision in 64-bit IEEE floating point representation. Would someone please explain to me how in the world a bit that is stuck with a 1 in it contributes ANYTHING to the numeric precision? If you had a floating point unit with bit0 stuck-on with 1, you would of course know that it produces 1 less bit of precision than normally. Where are those sensibilities on this? Further, just the exponent, the scaling factor without the mantissa, completely

Math.Tan() near -Pi/2 wrong in .NET, right in Java?

試著忘記壹切 提交于 2019-12-23 12:26:48
问题 I have an unit test failing on a Math.Tan(-PI/2) returning the wrong version in .NET. The 'expected' value is taken from Wolfram online (using the spelled-out constant for -Pi/2). See for yourselves here. As correctly observed in the comments, the mathematical result of tan(-pi/2) is infinity. However, the constant Math.PI does not perfectly represent PI, so this is a 'near the limit' input. Here's the code. double MINUS_HALF_PI = -1.570796326794896557998981734272d; Console.WriteLine(MINUS