I am reading a C book, talking about ranges of floating point, the author gave the table:
Type Smallest Positive Value Largest value Precision
==== ======================= ============= =========
float 1.17549 x 10^-38 3.40282 x 10^38 6 digits
double 2.22507 x 10^-308 1.79769 x 10^308 15 digits
I dont know where the numbers in the columns Smallest Positive and Largest Value come from.
A 32 bit floating point number has 23 + 1 bits of mantissa and an 8 bit exponent (-126 to 127 is used though) so the largest number you can represent is:
(1 + 1 / 2 + ... 1 / (2 ^ 23)) * (2 ^ 127) =
(2 ^ 23 + 2 ^ 23 + .... 1) * (2 ^ (127 - 23)) =
(2 ^ 24 - 1) * (2 ^ 104) ~= 3.4e38
The values for the float data type come from having 32 bits in total to represent the number which are allocated like this:
1 bit: sign bit
8 bits: exponent p
23 bits: mantissa
The exponent is stored as p + BIAS
where the BIAS is 127, the mantissa has 23 bits and a 24th hidden bit that is assumed 1. This hidden bit is the most significant bit (MSB) of the mantissa and the exponent must be chosen so that it is 1.
This means that the smallest number you can represent is 01000000000000000000000000000000
which is 1x2^-126 = 1.17549435E-38
.
The largest value is 011111111111111111111111111111111
, the mantissa is 2 * (1 - 1/65536) and the exponent is 127 which gives (1 - 1 / 65536) * 2 ^ 128 = 3.40277175E38
.
The same principles apply to double precision except the bits are:
1 bit: sign bit
11 bits: exponent bits
52 bits: mantissa bits
BIAS: 1023
So technically the limits come from the IEEE-754 standard for representing floating point numbers and the above is how those limits come about
As dasblinkenlight already answered, the numbers come from the way that floating point numbers are represented in IEEE-754, and Andreas has a nice breakdown of the maths.
However - be careful that the precision of floating point numbers isn't exactly 6 or 15 significant decimal digits as the table suggests, since the precision of IEEE-754 numbers depends on the number of significant binary digits.
float
has 24 significant binary digits - which depending on the number represented translates to 6-8 decimal digits of precision.double
has 53 significant binary digits, which is approximately 15 decimal digits.
Another answer of mine has further explanation if you're interested.
Infinity, NaN and subnormals
These are important caveats that no other answer has mentioned so far.
First read this introduction to IEEE 754 and subnormal numbers: What is a subnormal floating point number?
Then, for single precision floats (32-bit):
IEEE 754 says that if the exponent is all ones (
0xFF == 255
), then it represents either NaN or Infinity.This is why the largest non-infinite number has exponent
0xFE == 254
and not0xFF
.Then with the bias, it becomes:
254 - 127 == 127
FLT_MIN
is the smallest normal number. But there are smaller subnormal ones! Those take up the-127
exponent slot.
All asserts of the following program pass on Ubuntu 18.04 amd64:
#include <assert.h>
#include <float.h>
#include <inttypes.h>
#include <math.h>
#include <stdlib.h>
#include <stdio.h>
float float_from_bytes(
uint32_t sign,
uint32_t exponent,
uint32_t fraction
) {
uint32_t bytes;
bytes = 0;
bytes |= sign;
bytes <<= 8;
bytes |= exponent;
bytes <<= 23;
bytes |= fraction;
return *(float*)&bytes;
}
int main(void) {
/* All 1 exponent and non-0 fraction means NaN.
* There are of course many possible representations,
* and some have special semantics such as signalling vs not.
*/
assert(isnan(float_from_bytes(0, 0xFF, 1)));
assert(isnan(NAN));
printf("nan = %e\n", NAN);
/* All 1 exponent and 0 fraction means infinity. */
assert(INFINITY == float_from_bytes(0, 0xFF, 0));
assert(isinf(INFINITY));
printf("infinity = %e\n", INFINITY);
/* ANSI C defines FLT_MAX as the largest non-infinite number. */
assert(FLT_MAX == 0x1.FFFFFEp127f);
/* Not 0xFF because that is infinite. */
assert(FLT_MAX == float_from_bytes(0, 0xFE, 0x7FFFFF));
assert(!isinf(FLT_MAX));
assert(FLT_MAX < INFINITY);
printf("largest non infinite = %e\n", FLT_MAX);
/* ANSI C defines FLT_MIN as the smallest non-subnormal number. */
assert(FLT_MIN == 0x1.0p-126f);
assert(FLT_MIN == float_from_bytes(0, 1, 0));
assert(isnormal(FLT_MIN));
printf("smallest normal = %e\n", FLT_MIN);
/* The smallest non-zero subnormal number. */
float smallest_subnormal = float_from_bytes(0, 0, 1);
assert(smallest_subnormal == 0x0.000002p-126f);
assert(0.0f < smallest_subnormal);
assert(!isnormal(smallest_subnormal));
printf("smallest subnormal = %e\n", smallest_subnormal);
return EXIT_SUCCESS;
}
Compile and run with:
gcc -ggdb3 -O0 -std=c11 -Wall -Wextra -Wpedantic -Werror -o subnormal.out subnormal.c
./subnormal.out
Output:
nan = nan
infinity = inf
largest non infinite = 3.402823e+38
smallest normal = 1.175494e-38
smallest subnormal = 1.401298e-45
It's a consequence of the size of the exponent part of the type, as in IEEE 754 for example. You can examine the sizes with FLT_MAX, FLT_MIN, DBL_MAX, DBL_MIN in float.h.
来源:https://stackoverflow.com/questions/10108053/ranges-of-floating-point-datatype-in-c