Why is there no 2-byte float and does an implementation already exist?

懵懂的女人 提交于 2019-11-30 12:33:26

问题


Assuming I am really pressed for memory and want a smaller range (similar to short vs int). Shader languages already support half for a floating-point type with half the precision (not just convert back and forth for the value to be between -1 and 1, that is, return a float like this: shortComingIn / maxRangeOfShort). Is there an implementation that already exists for a 2-byte float?

I am also interested to know any (historical?) reasons as to why there is no 2-byte float.


回答1:


Re: Implementations: Someone has apparently written half for C, which would (of course) work in C++: https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/cellperformance-snippets/half.c

Re: Why is float four bytes: Probably because below that, their precision is so limited.




回答2:


if you have low memory, did you consider dropping the float concept? Floats use up a lot of bits just for saving where the decimal point is.. You can work around this if you know where you need the decimal point, let's say you want to save a Dollar value, you could just save it in Cents:

uint16_t cash = 50000;
std::cout << "Cash: $" << (cash / 100) << "." << ((cash % 100) < 10 ? "0" : "") << (cash % 100) << std::endl;

That is of course only an option if it's possible for you to predetermine the position of the decimal point. But if you can, always prefer it, because this also speeds up all calculations!

rgds, Kira :-)




回答3:


There is an IEEE 754 standard for 16-bit floats.

It's a new format, having been standardized in 2008 based on a GPU released in 2002.




回答4:


To go a bit further than Kiralein on switching to integers, we could define a range and permit the integer values of a short to represent equal divisions over the range, with some symmetry if straddling zero:

short mappedval = (short)(val/range);

Differences between these integer versions and using half precision floats:

  1. Integers are equally spaced over the range, whereas floats are more densely packed near zero
  2. Using integers will use integer math in the CPU rather than floating-point. That is often faster because integer operations are simpler. Having said that, mapping the values onto an asymmetric range would require extra additions etc to retrieve the value at the end.
  3. The absolute precision loss is more predictable; you know the error in each value so the total loss can be calculated in advance, given the range. Conversely, the relative error is more predictable using floating point.
  4. There may be a small selection of operations which you can do using pairs of values, particularly bitwise operations, by packing two shorts into an int. This can halve the number of cycles needed (or more, if short operations involve a cast to int) and maintains 32-bit width. This is just a diluted version of bit-slicing where 32 bits are acted on in parallel, which is used in crypto.



回答5:


TL;DR: 16-bit floats do exist and there are various software as well as hardware implementations

There are currently 2 common standard 16-bit float formats: IEEE-754 binary16 and Google's bfloat16. Since they're standardized, obviously if anyone who knows the spec can write an implementation. Some examples:

  • https://github.com/ramenhut/half
  • https://github.com/minhhn2910/cuda-half2
  • https://github.com/tianshilei1992/half_precision
  • https://github.com/acgessler/half_float

Or if you don't want to use them, you can also design a different 16-bit float format and implement it


2-byte floats are generally not used, because even float's precision is not enough for normal operations and double should always be used by default unless you're limited by bandwidth or cache size. Floating-point literals are also double when using without a suffix in C and C-like languages. See

  • Why are double preferred over float?
  • Should I use double or float?
  • When do you use float and when do you use double

However less-than-32-bit floats do exist. They're mainly used for storage purposes, like in graphics when 96 bits per pixel (32 bits per channel * 3 channels) are far too wasted, and will be converted to a normal 32-bit float for calculations (except on some special hardware). Various 10, 11, 14-bit float types exist in OpenGL. Many HDR formats use a 16-bit float for each channel, and Direct3D 9.0 as well as some GPUs like the Radeon R300 and R420 have a 24-bit float format. A 24-bit float is also supported by compilers in some 8-bit microcontrollers like PIC where 32-bit float support is too costly. 8-bit or narrower float types are less useful but due to their simplicity, they're often taught in computer science curriculum. Besides, a small float is also used in ARM's instruction encoding for small floating-point immediates.

The IEEE 754-2008 revision officially added a 16-bit float format, A.K.A binary16 or half-precision, with a 5-bit exponent and an 11-bit mantissa

Some compilers had support for IEEE-754 binary16, but mainly for conversion or vectorized oprations and not for computation (because they're not precise enough). For example ARM's toolchain has __fp16 which can be chosen between 2 variants: IEEE and alternative depending on whether you want more range or NaN/inf representations. GCC and Clang also support __fp16 along with the standardized name _Float16. See How to enable __fp16 type on gcc for x86_64

Recently due to the rise of AI, another format called bfloat16 (brain floating-point format) which is a simple truncation of the top 16 bits of IEEE-754 binary32 became common

The motivation behind the reduced mantissa is derived from Google's experiments that showed that it is fine to reduce the mantissa so long it's still possible to represent tiny values closer to zero as part of the summation of small differences during training. Smaller mantissa brings a number of other advantages such as reducing the multiplier power and physical silicon area.

  • float32: 242=576 (100%)
  • float16: 112=121 (21%)
  • bfloat16: 82=64 (11%)

Many compilers like GCC and ICC now also gained the ability to support bfloat16

More information about bfloat16:

  • bfloat16 - Hardware Numerics Definition
  • Using bfloat16 with TensorFlow models
  • What is tf.bfloat16 "truncated 16-bit floating point"?



回答6:


There are probably a variety of types in different implementations. A float equivalent of stdint.h seems like a good idea. Call (alias?) the types by their sizes. (float16_t?) A float being 4 bytes is only right now, but it probably won't get smaller. Terms like half and long mostly become meaningless with time. With 128 or 256 bit computers they could come to mean anything.

I'm working with images (1+1+1 byte/pixel) and I want to express each pixel's value relative to the average. So floating point or carefully fixed point, but not 4 times as big as the raw data please. A 16 bit float sounds about right.

This GCC 7.3 doesn't know "half", maybe in a C++ context.




回答7:


If your CPU supports F16C, then you can get something up and running fairly quickly with something such as:

// needs to be compiled with -mf16c enabled
#include <immintrin.h>
#include <cstdint>

struct float16
{
private:
  uint16_t _value;
public:

  inline float16() : _value(0) {}
  inline float16(const float16&) = default;
  inline float16(float16&&) = default;
  inline float16(const float f) : _value(_cvtss_sh(f, _MM_FROUND_CUR_DIRECTION)) {}

  inline float16& operator = (const float16&) = default;
  inline float16& operator = (float16&&) = default;
  inline float16& operator = (const float f) { _value = _cvtss_sh(f, _MM_FROUND_CUR_DIRECTION); return *this; }

  inline operator float () const 
    { return _cvtsh_ss(_value); }

  inline friend std::istream& operator >> (std::istream& input, float16& h) 
  { 
    float f = 0;
    input >> f;
    h._value = _cvtss_sh(f, _MM_FROUND_CUR_DIRECTION);
    return input;
  }
};

Maths is still performed using 32bit floats (the F16C extensions only provides conversions between 16/32bit floats - no instructions exist to compute arithmetic with 16bit floats).



来源:https://stackoverflow.com/questions/5766882/why-is-there-no-2-byte-float-and-does-an-implementation-already-exist

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!