问题
Particularly I'm interested if int32_t
is always losslessly converted to double
.
Does the following code always return true?
int is_lossless(int32_t i)
{
double d = i;
int32_t i2 = d;
return (i2 == i);
}
What is for int64_t
?
回答1:
Question: Does the following code always return true?
Always is a big statement and therefore the answer is no.
The C++ Standard makes no mention whether or not the floating-point types which are known to C++ (float
, double
and long double
) are of the IEEE-754 type. The standard explicitly states:
There are three floating-point types:
float
,double
, andlong double
. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined. [Note: This document imposes no requirements on the accuracy of floating-point operations; see also [support.limits]. — end note] Integral and floating-point types are collectively called arithmetic types. Specialisations of the standard library templatestd::numeric_limits
shall specify the maximum and minimum values of each arithmetic type for an implementation.source: C++ standard: basic fundamentals
Most commonly, the type double
represents the IEEE 754 double-precision binary floating-point format binary64, and can be depicted as:
and decoded as:
However, there is a plethora of other floating-point formats out there that are decoded differently and not necessarly have the same properties as the well known IEEE-754. Nonetheless, they are all-by-all similar:
- They are n bits long
- One bit represents the sign
- m bits represent the significant with or without a hidden first bit
- e bits represent some form of an exponent of a given base (2 or 10)
To know Whether or not a double can represent all 32-bit signed integer or not, you must answer the following question (assuming our floating-point number is in base 2):
- Does my floating-point representation have a hidden first bit in the significant? If so, assume m=m+1
- A 32bit signed integer is represented by 1 sign bit and 31 bits representing the number. Is the significant large enough to hold those 31 bits?
- Is the exponent large enough that it can represent a number of the form 1.xxxxx 2^31?
If you can answer yes to the last two questions, then yes a int32
can always be represented by the double
that is implemented on this particular system.
Note: I ignored decimal32
and decimal64
numbers, as I have no direct knowledge about them.
回答2:
When is integer to floating point conversion lossless?
When the floating point type has enough precision and range to encode all possible values of the integer type.
Does the following
int32_t
code always return true? --> Yes.
Does the followingint64_t
code always return true? --> No.
As DBL_MAX
is at least 1E+37, the range is sufficient for at least int122_t
, let us look to precision.
With common double, with its base 2, sign bit, 53 bit significand, and exponent, all values of int54_t
with its 53 value bits can be represented exactly. INT54_MIN
is also representable. With this double
, it has DBL_MANT_DIG == 53
and in this case that is the number of base-2 digits in the floating-point significand.
The smallest magnitude non-representable value would be INT54_MAX + 2
. Type int55_t
and wider have values not exactly representable as a double
.
With uintN_t
types, there is 1 more value bit. The typical double
can then encode all uint53_t
and narrower.
With other possible double
encodings, as C specifies DBL_DIG >= 10
, all values of int34_t
can round trip.
Code is always true with int32_t
, regardless of double
encoding.
What is for
int64_t
?
UB potential with int64_t
.
The conversion in int64_t i ... double d = i;
, when inexact, makes for a implementation defined result of the 2 nearest candidates. This is often a round to nearest. Then i
values near INT64_MAX
can convert to a double
one more than INT64_MAX
.
With int64_t i2 = d;
, the conversion of the double
value one more than INT64_MAX
to int64_t
is undefined behavior (UB).
A simple prior test to detect this:
#define INT64_MAX_P1 ((INT64_MAX/2 + 1) * 2.0)
if (d == INT64_MAX_P1) return false; // not lossless
回答3:
Note : my answer supposes the double
follow IEEE 754, and both int32_t
and int64_t
are 2's complement.
Does the following code always return true?
the mantissa/significand of a double
is longer than 32b so int32_t
=> double is always done without error because there is no possible precision error (and there is no possible overflow/underflow, the exponent cover more than the needed range of values)
What is for int64_t?
but 53 bits of mantissa/significand (including 1 implicit) of a double
is not enough to save 64b of a int64_t
=> int64_t
having upper and lower bits enough distant cannot be store in a double
without precision error (there is still no possible overflow/underflow, the exponent still cover more than the needed range of values)
回答4:
If your platform uses IEEE754 for the double
, then yes, any int32_t
can be represented perfectly in a double
. This is not the case for all possible values that an int64_t
can have.
(It is possible on some platforms to tweak the mantissa / exponent sizes of floating point types to make the transformation lossy, but such a type would not be an IEEE754 double
.)
To test for IEEE754, use
static_assert(std::numeric_limits<double>::is_iec559, "IEEE 754 floating point");
来源:https://stackoverflow.com/questions/63244349/when-is-integer-to-floating-point-conversion-lossless