Converting from floating-point to decimal with floating-point computations

核能气质少年 提交于 2019-12-06 10:28:33

Let y/d be the exact operation, and q=rnd(y/d) be the result rounded to nearest float.
Then the true error multiplied by d is rt=(rnd(y/d)-y/d)*d=q*d-y and the operation we performed with fmadd is r=rnd(q*d-y)
Why q*d-y is exact (fmadd does no final rounding) is less clear to explain, but say that q*d has a limited number of bits (<nbits(q)+nbits(d)), the exponent of y is that of q*d (+/- 1) and since the error is |rt|<0.5*ulp(q)*d, that means that first nbits(q) are vanishing... That answers to question 1.

So q*1e98 - y = r , where |r|*2^1074 <= 0.5e98 < 5*10^98 (2nd inequality is lucky)

q*(10^98) - y = r + (10^98-1e98)*q where |10^98-1e98|*q*2^1074 <= 0.5e95 (assuming at least 15 digits precision, log(2^53)/log(10) > 15)

So you ask whether |q*(10^98)-y|*2^1074>5*10^97

You have an approximation of |q*(10^98)-y| which is r+1e98_2*q

Since |r| < 5*10^98, and |r+(10^98-1e98)*q|<|r| if signs are opposite, I think that answers positively to question 2. But I wouldn't be so sure if 1e98_2 were < 0.

If r and 1e98_2 have same sign it might exceed 5*10^97, thus your further handling with discussion of r3 = 1e98_2*q + r versus h=0.5e98*2^-1074

For question 3, at first sight, I'd say that two things might make the algorithm fail:

  • 1e98_2 is not exact (10^98-1e98-1e98_2 = -3.6e63 approx.)

  • and h is not ht=0.5*10^98*2^-1074 but a bit smaller as we saw above.

The true error r3t is approximately (1e98_2-3e63)*q + r < r3 (and only the case when >0 is interesting us, because 1e98_2>0).

So an approximation of error r3 falling above approximated tie h when the true error r3t is below the true tie ht could lead to an incorrect rounding. Is it possible, and if yes how frequent is your question 3?

To mitigate above inequality risk, you tried to truncate the magnitude of r3, thus r3 <= 1e98_2*q + r. I felt a bit tired to perform a true analysis of error bounds...

So I scanned for an error, and the first failing example I found was 1.0000000001835e110 (I assume correctly rounded to nearest double, but it is in fact 1000000000183.49999984153799821120915424942630528225695526491963291846957919215885146546696544423465444842668032e98).

In this case, r and 1e98_2 have same sign, and

  • (x/1e98) > 1000000000183.50000215

  • q significand is thus rounded to 1000000000184

  • r3>h (r3*2^1074 is approx. 5.000001584620017e97) and we incorrectly incremented q+s, when it should have been q-s, definitely a bug.

My answers are:

  1. yes, r=fmadd(q * 1e98 - y) is exactly 1e98*(error made during division), but we don't care of the division, it's just providing a guess, what counts is that the subtraction is exact.

  2. yes, the sign is correct because |r| < 5*10^98, and |r+(10^98-1e98)*q|<|r| if signs are opposite. But I wouldn't be so sure if 1e98_2 were < 0.

  3. Taking first failing example (1.0000000001835e110 - 1.0e110)/1.0e110 ulp -> 1.099632e6, a very very naive conjecture would be to say that 1 case out of a million, r3 is falling over h... So once q+s corrected into q-s, the occurence of r3>h while r3t<ht is much much smaller than 1/1,000,000 in any case... there are more than 10^15 doubles in the range of interest, so consider this is not a serious answer...

  4. Yes, the discussion above is solely about the guess q, independently of the way it was produced, and the subtraction in 1. will still be exact...

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!