I\'m trying to understand why uint64_t
type can not show pow(2,64)-1
properly. The cplusplus standard is 199711L.
I checked the pow()
TL;DR: It's not that uint64_t
type cannot show pow(2,64)-1
properly but the reverse: double
can't store precisely 264 - 1 due to the lack of significand bits. You can only do that with types with 64 bits of precision or more (like long double
on many platforms). Try std::pow(2.0L, 64) - 1.0L
(note the L
suffix) or powl(2.0L, 64) - 1.0L;
and see
Anyway you shouldn't use a floating-point type for integer math right from the beginning. Not only it's far slower to calculate pow(2, x)
than 1ULL << x
, it'll also cause the issue you saw due to the limited precision of double
. Use uint64_t max2 = -1
instead, or ((unsigned __int128)1ULL << 64) - 1
if the compiler supports __int128
pow(2, 64) - 1
is a double
expression, not int
, as pow
doesn't have any overload that returns an integral type. The integer 1
will be promoted to the same rank as the result of pow
However because IEEE-754 double precision is only 64-bit long, you can never store values that have 64 significant bits or more like 264-1
So pow(2, 64) - 1
will be rounded to the closest representable value, which is pow(2, 64)
itself, and pow(2, 64) - 1 == pow(2, 64)
will result in 1. The largest value that's smaller than it is 18446744073709549568 = 264 - 2048. You can check that with std::nextafter
On some platforms (notably x86, except on MSVC) long double does have 64 bits of significand, so you'll get the correct value in that case. The following snippet
double max1 = pow(2, 64) - 1;
std::cout << "pow(2, 64) - 1 = " << std::fixed << max1 << '\n';
std::cout << "Previous representable value: " << std::nextafter(max1, 0) << '\n';
std::cout << (pow(2, 64) - 1 == pow(2, 64)) << '\n';
long double max2 = pow(2.0L, 64) - 1.0L;
std::cout << std::fixed << max2 << '\n';
prints out
pow(2, 64) - 1 = 18446744073709551616.000000
Previous representable value: 18446744073709549568.000000
1
18446744073709551615.000000
You can clearly see long double
can store the correct value as expected
On many other platforms double
may be IEEE-754 quadruple-precision or double-double. Both have more than 64 bits of significand so you can do the same thing. But of course the overhead will be higher
Floating point numbers have finite precision.
On your system (and typically, assuming binary64 IEEE-754 format) 18446744073709551615
is not a number that has a representation in the double
format. The closest number that does have a representation happens to be 18446744073709551616
.
Subtracting (and adding) together two floating point numbers of wildly different magnitudes usually produces an error. This error can be significant in relation to the smaller operand. In the case of 18446744073709551616. - 1. -> 18446744073709551616.
the error of the subtraction is 1, which is in fact the same value as the smaller operand.
When a floating point value is converted to an integer type, and the value cannot fit in the integer type, the behaviour of the program is undefined - even when the integer type is unsigned.