sign changes when going from int to float and back

后端 未结 2 739
感情败类
感情败类 2020-12-14 00:04

Consider the following code, which is an SSCCE of my actual problem:

#include 

int roundtrip(int x)
{
    return int(float(x));
}

int main(         


        
相关标签:
2条回答
  • 2020-12-14 00:51

    Pascal's answer is OK - but lacks details which entails that some users do not get it ;-) . If you are interested in how it looks on lower level (assuming coprocessor and not software handles floating point operations) - read on.

    In 32 bits of float (IEEE 754) you can store all of integers from within [-224...224] range. Integers outside the range may also have exact representation as float but not all of them have. The problem is that you can have only 24 significant bits to play with in float.

    Here is how conversion from int->float typically looks like on low level:

    fild dword ptr[your int]
    fstp dword ptr[your float]
    

    It is just sequence of 2 coprocessor instructions. First loads 32bit int onto comprocessor's stack and converts it into 80 bit wide float.

    Intel® 64 and IA-32 Architectures Software Developer’s Manual

    (PROGRAMMING WITH THE X87 FPU):

    When floating-point, integer, or packed BCD integer values are loaded from memory into any of the x87 FPU data registers, the values are automatically converted into double extended-precision floating-point format (if they are not already in that format).

    Since FPU registers are 80bit wide floats - there is no issue with fild here as 32bit int perfectly fits in 64bit significand of floating point format.

    So far so good.

    The second part - fstp is bit tricky and may be surprising. It is supposed to store 80bit floating point in 32bit float. Although it is all about integer values (in the question) coprocessor may actually perform 'rounding'. Ke? How do you round integer value even if it is stored in floating point format? ;-).

    I'll explain it shortly - let's first see what rounding modes x87 provides (they are IEE 754 rounding modes' incarnation). X87 fpu has 4 rounding modes controlled by bits #10 and #11 of fpu's control word:

    • 00 - to nearest even - Rounded result is the closest to the infinitely precise result. If two values are equally close, the result is the even value (that is, the one with the least-significant bit of zero). Default
    • 01 - toward -Inf
    • 10 - toward +inf
    • 11 - toward 0 (ie. truncate)

    You can play with rounding modes using this simple code (although it may be done differently - showing low level here):

    enum ROUNDING_MODE
    {
        RM_TO_NEAREST  = 0x00,
        RM_TOWARD_MINF = 0x01,
        RM_TOWARD_PINF = 0x02,
        RM_TOWARD_ZERO = 0x03 // TRUNCATE
    };
    
    void set_round_mode(enum ROUNDING_MODE rm)
    {
        short csw;
        short tmp = rm;
    
        _asm
        {
            push ax
            fstcw [csw]
            mov ax, [csw]
            and ax, ~(3<<10)
            shl [tmp], 10
            or ax, tmp
            mov [csw], ax
            fldcw [csw]
            pop ax
        }
    }
    

    Ok nice but still how is that related to integer values? Patience ... to understand why you might need rounding modes involved in int to float conversion check most obvious way of converting int to float - truncation (not default) - that may look like this:

    • record sign
    • negate your int if less than zero
    • find position of leftmost 1
    • shift int to the right/left so that 1 found above is positioned on bit #23
    • record number of shifts during the process so that you can calculate exponent

    And the code simulating this bahavior may look like this:

    float int2float(int value)
    {
        // handles all values from [-2^24...2^24]
        // outside this range only some integers may be represented exactly
        // this method will use truncation 'rounding mode' during conversion
    
        // we can safely reinterpret it as 0.0
        if (value == 0) return 0.0;
    
        if (value == (1U<<31)) // ie -2^31
        {
            // -(-2^31) = -2^31 so we'll not be able to handle it below - use const
            value = 0xCF000000;
            return *((float*)&value);
        }
    
        int sign = 0;
    
        // handle negative values
        if (value < 0)
        {
            sign = 1U << 31;
            value = -value;
        }
    
        // although right shift of signed is undefined - all compilers (that I know) do
        // arithmetic shift (copies sign into MSB) is what I prefer here
        // hence using unsigned abs_value_copy for shift
        unsigned int abs_value_copy = value;
    
        // find leading one
        int bit_num = 31;
        int shift_count = 0;
    
        for(; bit_num > 0; bit_num--)
        {
            if (abs_value_copy & (1U<<bit_num))
            {
                if (bit_num >= 23)
                {
                    // need to shift right
                    shift_count = bit_num - 23;
                    abs_value_copy >>= shift_count;
                }
                else
                {
                    // need to shift left
                    shift_count = 23 - bit_num;
                    abs_value_copy <<= shift_count;
                }
                break;
            }
        }
    
        // exponent is biased by 127
        int exp = bit_num + 127;
    
        // clear leading 1 (bit #23) (it will implicitly be there but not stored)
        int coeff = abs_value_copy & ~(1<<23);
    
        // move exp to the right place
        exp <<= 23;
    
        int ret = sign | exp | coeff;
    
        return *((float*)&ret);
    }
    

    Now example - truncation mode converts 2147483583 to 2147483520.

    2147483583 = 01111111_11111111_11111111_10111111
    

    During int->float conversion you must shift leftmost 1 to bit #23. Now leading 1 is in bit#30. In order to place it in bit #23 you must perform right shift by 7 positions. During that you loose (they will not fit in 32bit float format) 7 lsb bits from the right (you truncate/chop). They were:

    01111111 = 63
    

    And 63 is what original number lost:

    2147483583 -> 2147483520 + 63
    

    Truncating is easy but may not necessarily be what you want and/or is best for all cases. Consider below example:

    67108871 = 00000100_00000000_00000000_00000111
    

    Above value cannot be exactly represented by float but check what truncation does to it. As previously - we need to shift leftmost 1 to bit #23. This requires value to be shifted right exactly 3 positions loosing 3 LSB bits (as of now I'll write numbers differently showing where implicit 24th bit of float is and will bracket explicit 23bits of significand):

    00000001.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)
    

    Truncation chops 3 trailing bits leaving us with 67108864 (67108864+7(3 chopped bits)) = 67108871 (remember although we shift we compensate with exponent manipulation - omitted here).

    Is that good enough? Hey 67108872 is perfectly representable by 32bit float and should be much better than 67108864 right? CORRECT and this is where you might want to talk about rounding when converting int to 32bit float.

    Now let's see how default 'rounding to nearest even' mode works and what are its implications in OP's case. Consider the same example one more time.

    67108871 = 00000100_00000000_00000000_00000111
    

    As we know we need 3 right shifts to place leftmost 1 in bit #23:

    00000000_1.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)
    

    Procedure of 'rounding to nearest even' involves finding 2 numbers that bracket input value 67108871 from bottom and above as close as possible. Keep in mind that we still operate within FPU on 80bits so although I show some bits being shifted out they are still in FPU reg but will be removed during rounding operation when storing output value.

    00000000_1.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)
    

    2 values that closely bracket 00000000_1.[0000000_00000000_00000000] 111 * 2^26 are:

    from top:

      00000000_1.[0000000_00000000_00000000] 111 * 2^26
                                         +1
    = 00000000_1.[0000000_00000000_00000001] * 2^26 = 67108872
    

    and from below:

      00000000_1.[0000000_00000000_00000000] * 2^26 = 67108864
    

    Obviously 67108872 is much closer to 67108871 than 67108864 hence conversion from 32bit int value 67108871 gives 67108872 (in rounding to nearest even mode).

    Now OP's numbers (still rounding to nearest even):

     2147483583 = 01111111_11111111_11111111_10111111
    = 00000000_1.[1111111_11111111_11111111] 0111111 * 2^30
    

    bracket values:

    top:

      00000000_1.[1111111_111111111_11111111] 0111111 * 2^30
                                          +1
    = 00000000_10.[0000000_00000000_00000000] * 2^30
    =  00000000_1.[0000000_00000000_00000000] * 2^31 = 2147483648
    

    bottom:

    00000000_1.[1111111_111111111_11111111] * 2^30 = 2147483520
    

    Keep in mind that even word in 'rounding to nearest even' matters only when input value is halfway between bracket values. Only then word even matters and 'decides' which bracket value should be selected. In the above case even does not matter and we must simply choose nearer value, which is 2147483520

    Last OP's case shows the problem where even word matters. :

     2147483584 = 01111111_11111111_11111111_11000000
    = 00000000_1.[1111111_11111111_11111111] 1000000 * 2^30
    

    bracket values are the same as previously:

    top: 00000000_1.[0000000_00000000_00000000] * 2^31 = 2147483648

    bottom: 00000000_1.[1111111_111111111_11111111] * 2^30 = 2147483520

    There is no nearer value now (2147483648-2147483584=64=2147483584-2147483520) so we must rely on even and select top (even) value 2147483648.

    And here OP's problem is that Pascal had briefly described. FPU works only on signed values and 2147483648 cannot be stored as signed int as its max value is 2147483647 hence issues.

    Simple proof (without documentation quotes) that FPU works only on signed values ie. treats every value as signed is by debugging this:

    unsigned int test = (1u << 31);
    
    _asm
    {
        fild [test]
    }
    

    Although it looks like test value should be treated as unsigned it will be loaded as -231 as there is no separate instructions for loading signed and unsigned values into FPU. Likewise you'll not find instructions that will allow you to store unsigned value from FPU to mem. Everything is just a bit pattern treated as signed regardless of how you might have declared it in your program.

    Was long but hope someone will learn something out of it.

    0 讨论(0)
  • 2020-12-14 01:07

    Your program is invoking undefined behavior because of an overflow in the conversion from floating-point to integer. What you see is only the usual symptom on x86 processors.

    The float value nearest to 2147483584 is 231 exactly (the conversion from integer to floating-point usually rounds to the nearest, which can be up, and is up in this case. To be specific, the behavior when converting from integer to floating-point is implementation-defined, most implementations define rounding as being “according to the FPU rounding mode”, and the FPU's default rounding mode is to round to the nearest).

    Then, while converting from the float representing 231 to int, an overflow occurs. This overflow is undefined behavior. Some processors raise an exception, others saturate. The IA-32 instruction cvttsd2si typically generated by compilers happens to always return INT_MIN in case of overflow, regardless of whether the float is positive or negative.

    You should not rely on this behavior even if you know you are targeting an Intel processor: when targeting x86-64, compilers can emit, for the conversion from floating-point to integer, sequences of instructions that take advantage of the undefined behavior to return results other than what you might otherwise expect for the destination integer type.

    0 讨论(0)
提交回复
热议问题