How to add and subtract 16 bit floating point half precision numbers?

爱⌒轻易说出口 提交于 2019-12-24 03:15:10

问题


How do I add and subtract 16 bit floating point half precision numbers?

Say I need to add or subtract:

1 10000 0000000000

1 01111 1111100000

2’s complement form.


回答1:


Assuming you are using a denormalized representation similar to that of IEEE single/double precision, just compute the sign = (-1)^S, the mantissa as 1.M if E != 0 and 0.M if E == 0, and the exponent = E - 2^(n-1), operate on these natural representations, and convert back to the 16-bit format.

sign1 = -1 mantissa1 = 1.0 exponent1 = 1

sign2 = -1 mantissa2 = 1.11111 exponent2 = 0

sum: sign = -1 mantissa = 1.111111 exponent = 1

Representation: 1 10000 1111110000

Naturally, this assumes excess encoding of the exponent.




回答2:


The OpenEXR library defines a half-precision floating point class. It's C++, but the code for casting between native IEEE754 float and half should be easy to adapt. see: Half/half.h as a start.



来源:https://stackoverflow.com/questions/7623776/how-to-add-and-subtract-16-bit-floating-point-half-precision-numbers

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!