Why do compilers fix the digits of floating point number to 6?

不想你离开。 提交于 2019-12-12 23:33:15

问题


According to The C++ Programming Language - 4th, section 6.2.5:

There are three floating-points types: float (single-precision), double (double-precision), and long double (extended-precision)

Refer to: http://en.wikipedia.org/wiki/Single-precision_floating-point_format

The true significand includes 23 fraction bits to the right of the binary point and an implicit leading bit (to the left of the binary point) with value 1 unless the exponent is stored with all zeros. Thus only 23 fraction bits of the significand appear in the memory format but the total precision is 24 bits (equivalent to log10(224) ≈ 7.225 decimal digits).

→ The maximum digits of floating point number is 7 digits on binary32 interchange format. (a computer number format that occupies 4 bytes (32 bits) in computer memory)

When I test on different compilers (like GCC, VC compiler)
→ It always outputs 6 as the value.

Take a look into float.h of each compiler
→ I found that 6 is fixed.

Question:

  • Do you know why there is a different here (between actual value theoretical value - 7 - and actual value - 6)?
    It sounds like "7" is more reasonable because when I test using below code, the value is still valid, while "8" is invalid
  • Why don't the compilers check the interchange format for giving decision about the numbers of digits represented in floating-point (instead of using a fixed value)?

Code:

#include <iostream> 
#include <limits>

using namespace std;

int main( )
{
    cout << numeric_limits<float> :: digits10 << endl;

    float f = -9999999;

    cout.precision ( 10 );

    cout << f << endl;
}

回答1:


You're not reading the documentation.


std::numeric_limits<float>::digits10 is 6:

The value of std::numeric_limits<T>::digits10 is the number of base-10 digits that can be represented by the type T without change, that is, any number with this many decimal digits can be converted to a value of type T and back to decimal form, without change due to rounding or overflow. For base-radix types, it is the value of digits (digits-1 for floating-point types) multiplied by log10(radix) and rounded down.

The standard 32-bit IEEE 754 floating-point type has a 24 bit fractional part (23 bits written, one implied), which may suggest that it can represent 7 digit decimals (24 * std::log10(2) is 7.22), but relative rounding errors are non-uniform and some floating-point values with 7 decimal digits do not survive conversion to 32-bit float and back: the smallest positive example is 8.589973e9, which becomes 8.589974e9 after the roundtrip. These rounding errors cannot exceed one bit in the representation, and digits10 is calculated as (24-1)*std::log10(2), which is 6.92. Rounding down results in the value 6.


std::numeric_limits<float>::max_digits10 is 9:

The value of std::numeric_limits<T>::max_digits10 is the number of base-10 digits that are necessary to uniquely represent all distinct values of the type T, such as necessary for serialization/deserialization to text. This constant is meaningful for all floating-point types.

Unlike most mathematical operations, the conversion of a floating-point value to text and back is exact as long as at least max_digits10 were used (9 for float, 17 for double): it is guaranteed to produce the same floating-point value, even though the intermediate text representation is not exact. It may take over a hundred decimal digits to represent the precise value of a float in decimal notation.




回答2:


std::numeric_limits<float>::digits10 equates to FLT_DIG, which is defined by the C standard :

number of decimal digits, q, such that any floating-point number with q decimal digits can be rounded into a floating-point number with p radix b digits and back again without change to the q decimal digits,

⎧ p log10 b if b is a power of 10

⎩ ⎣( p − 1) log10 b⎦ otherwise

FLT_DIG 6

DBL_DIG 10

LDBL_DIG 10

The reason for the value 6 (and not 7), is due to rounding errors - not all floating point values with 7 decimal digits can be losslessly represented by a 32-bit float. Rounding errors are limited to 1 bit though, so the FLT_DIG value was calculated based on 23 bits (instead of the full 24) :

23 * log10(2) = 6.92

which is rounded down to 6.



来源:https://stackoverflow.com/questions/29510356/why-do-compilers-fix-the-digits-of-floating-point-number-to-6

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!