Does double z=x-y guarantee that z+y==x for IEEE 754 floating point?

隐身守侯 提交于 2019-11-30 09:23:30

问题


I have a problem that can be reduced to this problem statement:

Given a series of doubles where each is in the range [0, 1e7], modify the last element such that the sum of the numbers equals exactly a target number. The series of doubles already sums to the target number within an epsilon (1e-7), but they are not ==.


The following code is working, but is it guaranteed to work for all inputs that meet the requirements described in the first sentence?

public static double[] FixIt(double[] input, double targetDouble)
{
    var result = new double[input.Length];
    if (input.Length == 0) return result;

    double sum = 0;
    for (int i = 0; i < input.Length - 1; i++)
    {
        sum += input[i];
        result[i] = input[i];
    }

    double remainder = targetDouble - sum;
    result[result.Length - 1] = remainder;
    return result;
}

var arr1 = Enumerable.Repeat(Math.PI / 13, 13).ToArray();
var arr2 = FixIt(arr1, Math.PI);

Debug.Print(Math.PI.ToString("R")); //3.1415926535897931
Debug.Print(arr1.Sum().ToString("R")); //3.1415926535897922
Debug.Print(arr2.Sum().ToString("R")); //3.1415926535897931

A previous version of this question asked about modifying the first element, but modifying the last element simplifies the problem to a known sum and a known target, leaving us with just the question of whether last = target-sum implies that sum+last == target.

(Without NaN of course, and the restrictions on ranges imply some restrictions on last as well that might help.)

Regarding the real problem: We've had this problem surface a number of times in a variety of ways, but what we are trying to do at the moment is reduce the floating point error that crops up due to numerical instabilities in a linear programming solver (Coin-OR CBC). For example, there are 6 variables which all must be in the range [0,X] and the sum of the variables must also be X. Due to numerical instability, the solver occasionally returns slightly negative values and values that do not sum to exactly X. We've overcome the negative number issues - now just trying to resolve the sum to X issue. (Yes, there may be constraints that are being disobeyed by us changing the results, but making sure these numbers sum to X is of higher priority, where the other constraints are not as important.)


回答1:


z = x-y; does not guarantee z+y == x, and there is not always a solution for the problem of finding a z such z+y == x. A proof follows.

We assume IEEE-754 binary floating-point arithmetic with rounding to nearest, ties to even. The basic 64-bit format is used, but the result holds for other formats. Note that the 64-bit format uses 53-bit significands, meaning that only numbers with 53 or fewer significant binary digits can be represented.

Consider a target x equal to 1+2−52. Let y be 2−53. Then, after z = x-y;, z+y == x evaluates to false. The arithmetic details are shown below, but:

  • z = x-y; sets z to 1, and then z+y produces 1, which is less than x.
  • If we increase z to the next representable number, 1+2−52, then z+y produces 1+2−51, which is greater than x.
  • So there is no value of z that makes z+y == x true.

Details:

The mathematical result of xy is 1+2−53. As this has 54 significant bits (from 20 to 2−53), it is not representable, and the computed result of x-y must be rounded. The two nearest numbers are 1 and 1+2−52. The ties-to-even rule produces the former number, 1, as the low bit of its significand is 0, while the low bit for 1+2−52 is 1.

Thus z = x-y; sets z to 1.

Then the mathematical result of z+y is 1+2−53. As above, this is rounded to 1, so the computed result of z+y is 1. So z+y == x compares 1 to 1+2−52 and produces false.

Furthermore, no value of z could make the comparison true. If we increment z by the smallest available step, from 1 to 1+2−52, the mathematical sum of z+y is then 1+2−52+2−53. This is midway between the two representable numbers 1+2−52 and 1+2−51. The former has a low bit of 1, and the latter has a low bit of 0, so the computed result of this z+y is 1+2−51, which is of course not equal to 1+2−52.

Floating-point addition is weakly monotonic, so there are no values of z that would produce 1+2−52 for z+y.




回答2:


No, it doesn't. Here's a concrete counter-example; coded in Python, but you can easily repeat the same experiment in C#:

>>> x = 0.24999916553497312
>>> y =  1.0000153779983518
>>> z = -0.7500162124633787
>>> z == x - y
True
>>> z + y == x
False

Here's a small counter-example with x, y, z all positive:

>>> x = 0.4500000000000001
>>> y = 0.20000000000000004
>>> z = 0.2500000000000001
>>> z == x - y
True
>>> z + y == x
False



回答3:


Floating point arithmetic is not exact, by definition (unless you're only dealing with integers (edit for correctness: up to 253 i.e. 9007199254740992)); you will always have rounding discrepancies. If you want the rounding to match what humans expect: use decimal instead of double. If you do the same thing with decimal, it will work correctly for any set of numbers that aren't pathological in terms of the decimal digits.



来源:https://stackoverflow.com/questions/55412915/does-double-z-x-y-guarantee-that-zy-x-for-ieee-754-floating-point

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!