Easiest way to convert a decimal float to bit representation manually based on IEEE 754, without using any library

99封情书 提交于 2019-12-01 10:32:22

See the files src/lib/floating_point.ml and src/lib/floating_point.mli in Frama-C. They implement the conversion from decimal representation to floating-point for single-precision and double-precision (you cannot obtain the former from the latter because of double rounding issues), without any external library. The files are covered by the LGPL 2.1. This implementation is the subject of a couple of blog posts starting at this one and continuing with this one.

This is probably close to the simplest conversion function one can make, as in writing this function, I had no performance constraints and only hoped to keep the code as simple and as correct as possible, without wanting a dependence towards an existing library such as MPFR.

...
type parsed_float = {
  f_nearest : float ;
  f_lower : float ;
  f_upper : float ;
}

val single_precision_of_string: string -> parsed_float
val double_precision_of_string: string -> parsed_float
...

I don't understand your treatment of the fraction. As shown, you are doing decimal fraction arithmetic, which would give correct results but introduces its own implementation difficulties. Doing binary fraction arithmetic would depend on converting the fraction to a binary fraction in order to convert it to a binary fraction.

I think it might be simpler to work entirely in binary integers, though you would still need an extended form, such as BigInteger.

To do that, first note the number of digits after the decimal point, D. Convert the decimal digit string to an integer N, ignoring the decimal point. The value is N/10**D, using "**" to represent power. Calculate 10**D as a binary integer.

Calculate N/10**D by binary long division, stopping when you have F+2 significant bits in the result, where F is the number of fraction bits in your floating point format. Note the location of the binary point in this result.

The most significant one bit will not be used if the number is in the normal range. To correctly round down to F fraction bits, you need both the least significant of the F+2 bits, call it G, and also a bit S that is zero if, and only if, the remainder is zero. If G is 0, use the F fraction bits unchanged. If G and S are both one, you need to round up. If G is one and S is zero the exact result is half way between two representable values, and you should round to even.

Calculate the exponent from the position of the most significant bit relative to the binary point, after processing any carry-out due to rounding up. If the exponent is in range, you are done. If it is too bit, return infinity of the appropriate sign. If it is too small, you need to denormalize. To get the rounding right, recompute G and S from the bits you are dropping and the old value of S.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!