Ranges of floating point datatype in C?

后端 未结 6 1609
死守一世寂寞
死守一世寂寞 2020-11-30 03:49

I am reading a C book, talking about ranges of floating point, the author gave the table:

Type     Smallest Positive Value  Largest value      Precision
====         


        
相关标签:
6条回答
  • 2020-11-30 04:16

    As dasblinkenlight already answered, the numbers come from the way that floating point numbers are represented in IEEE-754, and Andreas has a nice breakdown of the maths.

    However - be careful that the precision of floating point numbers isn't exactly 6 or 15 significant decimal digits as the table suggests, since the precision of IEEE-754 numbers depends on the number of significant binary digits.

    • float has 24 significant binary digits - which depending on the number represented translates to 6-8 decimal digits of precision.

    • double has 53 significant binary digits, which is approximately 15 decimal digits.

    Another answer of mine has further explanation if you're interested.

    0 讨论(0)
  • 2020-11-30 04:18

    Infinity, NaN and subnormals

    These are important caveats that no other answer has mentioned so far.

    First read this introduction to IEEE 754 and subnormal numbers: What is a subnormal floating point number?

    Then, for single precision floats (32-bit):

    • IEEE 754 says that if the exponent is all ones (0xFF == 255), then it represents either NaN or Infinity.

      This is why the largest non-infinite number has exponent 0xFE == 254 and not 0xFF.

      Then with the bias, it becomes:

      254 - 127 == 127
      
    • FLT_MIN is the smallest normal number. But there are smaller subnormal ones! Those take up the -127 exponent slot.

    All asserts of the following program pass on Ubuntu 18.04 amd64:

    #include <assert.h>
    #include <float.h>
    #include <inttypes.h>
    #include <math.h>
    #include <stdlib.h>
    #include <stdio.h>
    
    float float_from_bytes(
        uint32_t sign,
        uint32_t exponent,
        uint32_t fraction
    ) {
        uint32_t bytes;
        bytes = 0;
        bytes |= sign;
        bytes <<= 8;
        bytes |= exponent;
        bytes <<= 23;
        bytes |= fraction;
        return *(float*)&bytes;
    }
    
    int main(void) {
        /* All 1 exponent and non-0 fraction means NaN.
         * There are of course many possible representations,
         * and some have special semantics such as signalling vs not.
         */
        assert(isnan(float_from_bytes(0, 0xFF, 1)));
        assert(isnan(NAN));
        printf("nan                  = %e\n", NAN);
    
        /* All 1 exponent and 0 fraction means infinity. */
        assert(INFINITY == float_from_bytes(0, 0xFF, 0));
        assert(isinf(INFINITY));
        printf("infinity             = %e\n", INFINITY);
    
        /* ANSI C defines FLT_MAX as the largest non-infinite number. */
        assert(FLT_MAX == 0x1.FFFFFEp127f);
        /* Not 0xFF because that is infinite. */
        assert(FLT_MAX == float_from_bytes(0, 0xFE, 0x7FFFFF));
        assert(!isinf(FLT_MAX));
        assert(FLT_MAX < INFINITY);
        printf("largest non infinite = %e\n", FLT_MAX);
    
        /* ANSI C defines FLT_MIN as the smallest non-subnormal number. */
        assert(FLT_MIN == 0x1.0p-126f);
        assert(FLT_MIN == float_from_bytes(0, 1, 0));
        assert(isnormal(FLT_MIN));
        printf("smallest normal      = %e\n", FLT_MIN);
    
        /* The smallest non-zero subnormal number. */
        float smallest_subnormal = float_from_bytes(0, 0, 1);
        assert(smallest_subnormal == 0x0.000002p-126f);
        assert(0.0f < smallest_subnormal);
        assert(!isnormal(smallest_subnormal));
        printf("smallest subnormal   = %e\n", smallest_subnormal);
    
        return EXIT_SUCCESS;
    }
    

    GitHub upstream.

    Compile and run with:

    gcc -ggdb3 -O0 -std=c11 -Wall -Wextra -Wpedantic -Werror -o subnormal.out subnormal.c
    ./subnormal.out
    

    Output:

    nan                  = nan
    infinity             = inf
    largest non infinite = 3.402823e+38
    smallest normal      = 1.175494e-38
    smallest subnormal   = 1.401298e-45
    
    0 讨论(0)
  • 2020-11-30 04:28

    A 32 bit floating point number has 23 + 1 bits of mantissa and an 8 bit exponent (-126 to 127 is used though) so the largest number you can represent is:

    (1 + 1 / 2 + ... 1 / (2 ^ 23)) * (2 ^ 127) = 
    (2 ^ 23 + 2 ^ 23 + .... 1) * (2 ^ (127 - 23)) = 
    (2 ^ 24 - 1) * (2 ^ 104) ~= 3.4e38
    
    0 讨论(0)
  • 2020-11-30 04:35

    These numbers come from the IEEE-754 standard, which defines the standard representation of floating point numbers. Wikipedia article at the link explains how to arrive at these ranges knowing the number of bits used for the signs, mantissa, and the exponent.

    0 讨论(0)
  • 2020-11-30 04:40

    It's a consequence of the size of the exponent part of the type, as in IEEE 754 for example. You can examine the sizes with FLT_MAX, FLT_MIN, DBL_MAX, DBL_MIN in float.h.

    0 讨论(0)
  • 2020-11-30 04:41

    The values for the float data type come from having 32 bits in total to represent the number which are allocated like this:

    1 bit: sign bit

    8 bits: exponent p

    23 bits: mantissa

    The exponent is stored as p + BIAS where the BIAS is 127, the mantissa has 23 bits and a 24th hidden bit that is assumed 1. This hidden bit is the most significant bit (MSB) of the mantissa and the exponent must be chosen so that it is 1.

    This means that the smallest number you can represent is 01000000000000000000000000000000 which is 1x2^-126 = 1.17549435E-38.

    The largest value is 011111111111111111111111111111111, the mantissa is 2 * (1 - 1/65536) and the exponent is 127 which gives (1 - 1 / 65536) * 2 ^ 128 = 3.40277175E38.

    The same principles apply to double precision except the bits are:

    1 bit: sign bit

    11 bits: exponent bits

    52 bits: mantissa bits

    BIAS: 1023

    So technically the limits come from the IEEE-754 standard for representing floating point numbers and the above is how those limits come about

    0 讨论(0)
提交回复
热议问题