Why uint64_t cannot show pow(2, 64) - 1 properly?

前端 未结 2 1009
眼角桃花
眼角桃花 2021-01-29 06:21

I\'m trying to understand why uint64_t type can not show pow(2,64)-1 properly. The cplusplus standard is 199711L.

I checked the pow()

相关标签:
2条回答
  • 2021-01-29 06:49

    TL;DR: It's not that uint64_t type cannot show pow(2,64)-1 properly but the reverse: double can't store precisely 264 - 1 due to the lack of significand bits. You can only do that with types with 64 bits of precision or more (like long double on many platforms). Try std::pow(2.0L, 64) - 1.0L (note the L suffix) or powl(2.0L, 64) - 1.0L; and see

    Anyway you shouldn't use a floating-point type for integer math right from the beginning. Not only it's far slower to calculate pow(2, x) than 1ULL << x, it'll also cause the issue you saw due to the limited precision of double. Use uint64_t max2 = -1 instead, or ((unsigned __int128)1ULL << 64) - 1 if the compiler supports __int128


    pow(2, 64) - 1 is a double expression, not int, as pow doesn't have any overload that returns an integral type. The integer 1 will be promoted to the same rank as the result of pow

    However because IEEE-754 double precision is only 64-bit long, you can never store values that have 64 significant bits or more like 264-1

    • 64-bit unsigned integers which cannot map onto a double
    • Are all integer values perfectly represented as doubles?

    So pow(2, 64) - 1 will be rounded to the closest representable value, which is pow(2, 64) itself, and pow(2, 64) - 1 == pow(2, 64) will result in 1. The largest value that's smaller than it is 18446744073709549568 = 264 - 2048. You can check that with std::nextafter

    On some platforms (notably x86, except on MSVC) long double does have 64 bits of significand, so you'll get the correct value in that case. The following snippet

    double max1 = pow(2, 64) - 1;
    std::cout << "pow(2, 64) - 1 = " << std::fixed << max1 << '\n';
    std::cout << "Previous representable value: " << std::nextafter(max1, 0) << '\n';
    std::cout << (pow(2, 64) - 1 == pow(2, 64)) << '\n';
    
    long double max2 = pow(2.0L, 64) - 1.0L;
    std::cout << std::fixed << max2 << '\n';
    

    prints out

    pow(2, 64) - 1 = 18446744073709551616.000000
    Previous representable value: 18446744073709549568.000000
    1
    18446744073709551615.000000
    

    You can clearly see long double can store the correct value as expected

    On many other platforms double may be IEEE-754 quadruple-precision or double-double. Both have more than 64 bits of significand so you can do the same thing. But of course the overhead will be higher

    0 讨论(0)
  • 2021-01-29 06:55

    Floating point numbers have finite precision.

    On your system (and typically, assuming binary64 IEEE-754 format) 18446744073709551615 is not a number that has a representation in the double format. The closest number that does have a representation happens to be 18446744073709551616.

    Subtracting (and adding) together two floating point numbers of wildly different magnitudes usually produces an error. This error can be significant in relation to the smaller operand. In the case of 18446744073709551616. - 1. -> 18446744073709551616. the error of the subtraction is 1, which is in fact the same value as the smaller operand.

    When a floating point value is converted to an integer type, and the value cannot fit in the integer type, the behaviour of the program is undefined - even when the integer type is unsigned.

    0 讨论(0)
提交回复
热议问题