Ranges of floating point datatype in C?

随声附和 提交于 2019-11-27 04:02:39

These numbers come from the IEEE-754 standard, which defines the standard representation of floating point numbers. Wikipedia article at the link explains how to arrive at these ranges knowing the number of bits used for the signs, mantissa, and the exponent.

A 32 bit floating point number has 23 + 1 bits of mantissa and an 8 bit exponent (-126 to 127 is used though) so the largest number you can represent is:

(1 + 1 / 2 + ... 1 / (2 ^ 23)) * (2 ^ 127) = 
(2 ^ 23 + 2 ^ 23 + .... 1) * (2 ^ (127 - 23)) = 
(2 ^ 24 - 1) * (2 ^ 104) ~= 3.4e38

The values for the float data type come from having 32 bits in total to represent the number which are allocated like this:

1 bit: sign bit

8 bits: exponent p

23 bits: mantissa

The exponent is stored as p + BIAS where the BIAS is 127, the mantissa has 23 bits and a 24th hidden bit that is assumed 1. This hidden bit is the most significant bit (MSB) of the mantissa and the exponent must be chosen so that it is 1.

This means that the smallest number you can represent is 01000000000000000000000000000000 which is 1x2^-126 = 1.17549435E-38.

The largest value is 011111111111111111111111111111111, the mantissa is 2 * (1 - 1/65536) and the exponent is 127 which gives (1 - 1 / 65536) * 2 ^ 128 = 3.40277175E38.

The same principles apply to double precision except the bits are:

1 bit: sign bit

11 bits: exponent bits

52 bits: mantissa bits

BIAS: 1023

So technically the limits come from the IEEE-754 standard for representing floating point numbers and the above is how those limits come about

Timothy Jones

As dasblinkenlight already answered, the numbers come from the way that floating point numbers are represented in IEEE-754, and Andreas has a nice breakdown of the maths.

However - be careful that the precision of floating point numbers isn't exactly 6 or 15 significant decimal digits as the table suggests, since the precision of IEEE-754 numbers depends on the number of significant binary digits.

  • float has 24 significant binary digits - which depending on the number represented translates to 6-8 decimal digits of precision.

  • double has 53 significant binary digits, which is approximately 15 decimal digits.

Another answer of mine has further explanation if you're interested.

Infinity, NaN and subnormals

These are important caveats that no other answer has mentioned so far.

First read this introduction to IEEE 754 and subnormal numbers: What is a subnormal floating point number?

Then, for single precision floats (32-bit):

  • IEEE 754 says that if the exponent is all ones (0xFF == 255), then it represents either NaN or Infinity.

    This is why the largest non-infinite number has exponent 0xFE == 254 and not 0xFF.

    Then with the bias, it becomes:

    254 - 127 == 127
    
  • FLT_MIN is the smallest normal number. But there are smaller subnormal ones! Those take up the -127 exponent slot.

All asserts of the following program pass on Ubuntu 18.04 amd64:

#include <assert.h>
#include <float.h>
#include <inttypes.h>
#include <math.h>
#include <stdlib.h>
#include <stdio.h>

float float_from_bytes(
    uint32_t sign,
    uint32_t exponent,
    uint32_t fraction
) {
    uint32_t bytes;
    bytes = 0;
    bytes |= sign;
    bytes <<= 8;
    bytes |= exponent;
    bytes <<= 23;
    bytes |= fraction;
    return *(float*)&bytes;
}

int main(void) {
    /* All 1 exponent and non-0 fraction means NaN.
     * There are of course many possible representations,
     * and some have special semantics such as signalling vs not.
     */
    assert(isnan(float_from_bytes(0, 0xFF, 1)));
    assert(isnan(NAN));
    printf("nan                  = %e\n", NAN);

    /* All 1 exponent and 0 fraction means infinity. */
    assert(INFINITY == float_from_bytes(0, 0xFF, 0));
    assert(isinf(INFINITY));
    printf("infinity             = %e\n", INFINITY);

    /* ANSI C defines FLT_MAX as the largest non-infinite number. */
    assert(FLT_MAX == 0x1.FFFFFEp127f);
    /* Not 0xFF because that is infinite. */
    assert(FLT_MAX == float_from_bytes(0, 0xFE, 0x7FFFFF));
    assert(!isinf(FLT_MAX));
    assert(FLT_MAX < INFINITY);
    printf("largest non infinite = %e\n", FLT_MAX);

    /* ANSI C defines FLT_MIN as the smallest non-subnormal number. */
    assert(FLT_MIN == 0x1.0p-126f);
    assert(FLT_MIN == float_from_bytes(0, 1, 0));
    assert(isnormal(FLT_MIN));
    printf("smallest normal      = %e\n", FLT_MIN);

    /* The smallest non-zero subnormal number. */
    float smallest_subnormal = float_from_bytes(0, 0, 1);
    assert(smallest_subnormal == 0x0.000002p-126f);
    assert(0.0f < smallest_subnormal);
    assert(!isnormal(smallest_subnormal));
    printf("smallest subnormal   = %e\n", smallest_subnormal);

    return EXIT_SUCCESS;
}

GitHub upstream.

Compile and run with:

gcc -ggdb3 -O0 -std=c11 -Wall -Wextra -Wpedantic -Werror -o subnormal.out subnormal.c
./subnormal.out

Output:

nan                  = nan
infinity             = inf
largest non infinite = 3.402823e+38
smallest normal      = 1.175494e-38
smallest subnormal   = 1.401298e-45

It's a consequence of the size of the exponent part of the type, as in IEEE 754 for example. You can examine the sizes with FLT_MAX, FLT_MIN, DBL_MAX, DBL_MIN in float.h.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!