How can I read a signed integer from a buffer of uint8_t without invoking un- or implementation-defined behaviour?

心已入冬 提交于 2020-03-18 04:50:27

问题


Here's a simple function that tries to do read a generic twos-complement integer from a big-endian buffer, where we'll assume std::is_signed_v<INT_T>:

template<typename INT_T>
INT_T read_big_endian(uint8_t const *data) {
    INT_T result = 0;
    for (size_t i = 0; i < sizeof(INT_T); i++) {
        result <<= 8;
        result |= *data;
        data++;
    }
    return result;
}

Unfortunately, this is undefined behaviour, as the last <<= shifts into the sign bit.


So now we try the following:

template<typename INT_T>
INT_T read_big_endian(uint8_t const *data) {
    std::make_unsigned_t<INT_T> result = 0;
    for (size_t i = 0; i < sizeof(INT_T); i++) {
        result <<= 8;
        result |= *data;
        data++;
    }
    return static_cast<INT_T>(result);
}

But we're now invoking implementation-defined behaviour in the static_cast, converting from unsigned to signed.


How can I do this while staying in the "well-defined" realm?


回答1:


Start by assembling bytes into an unsigned value. Unless you need to assemble groups of 9 or more octets, a conforming C99 implementation is guaranteed to have such a type that is large enough to hold them all (a C89 implementation would be guaranteed to have an unsigned type large enough to hold at least four).

In most cases, where you want to convert a sequence of octets to a number, you'll know how many octets you're expecting. If data is encoded as 4 bytes, you should use four bytes regardless of the sizes of int and long (a portable function should return type long).

unsigned long octets_to_unsigned32_little_endian(unsigned char *p)
{
  return p[0] | 
    ((unsigned)p[1]<<8) |
    ((unsigned long)p[2]<<16) |
    ((unsigned long)p[3]<<24);
}
long octets_to_signed32_little_endian(unsigned char *p)
{
  unsigned long as_unsigned = octets_to_unsigned32_little_endian(p);
  if (as_unsigned < 0x80000000)
    return as_unsigned;
  else
    return (long)(as_unsigned^0x80000000UL)-0x40000000L-0x40000000L;
}

Note that the subtraction is done as two parts, each within the range of a signed long, to allow for the possibility of systems where LNG_MIN is -2147483647. Attempting to convert byte sequence {0,0,0,0x80} on such a system may yield Undefined Behavior [since it would compute the value -2147483648] but the code should process in fully portable fashion all values which would be within the range of "long".




回答2:


Unfortunately, this is undefined behaviour, as the last <<= shifts into the sign bit.

Actually, in C++17, left-shifting a signed integer that has a negative value is undefined behavior. Left-shifting a signed integer that has a positive value into the sign bit is implementation defined behavior. See also:

2 The value of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are zero-filled. If E1 has an unsigned type, the value of the result is E1 × 2**E2, reduced modulo one more than the maximum value representable in the result type. Otherwise, if E1 has a signed type and non-negative value, and E1 × 2**E2 is representable in the corresponding unsigned type of the result type, then that value, converted to the result type, is the resulting value; otherwise, the behavior is undefined.

(C++17 final working draft, Section 8.8 Shift operators [expr.shift], Paragraph 2, page 132 - emphasis mine)


With C++20, shifting into the sign bit changed from implementation defined to defined behavior:

2 The value of E1 << E2 is the unique value congruent to E1 × 2**E2 modulo 2**N, where N is the width of the type of the result. [Note: E1 is left-shifted E2 bit positions; vacated bits are zero-filled. — end note]

(C++20 latest working draft, Section 7.6.7 Shift operators [expr.shift], Paragraph 2, page 129)

Example:

int i = 2147483647;  // here: 2**31-1 == INT_MAX, sizeof(int) = 32
int j = i << 1;      // i.e. -2

Assertion: -2 is the unique value that is congruent to 2147483647 * 2 % 2**32

Check:

        a ≡ b (mod n)      | i.e. there exists an integer k:
<=> a - b = k * n
 => -2 - 2147483647 * 2 = k * 2**32
<=> -4294967296 = k * 2**32
<=> k = -1                 | i.e. there is an integer!

The value -2 is unique because there is no other value in the domain [INT_MIN .. INT_MAX] that satisfies this congruence relation.


This is a consequence of C++20 mandating two's complement representation of signed integer types:

3 [..] For each value x of a signed integer type, the value of the corresponding unsigned integer type congruent to x modulo 2 N has the same value of corresponding bits in its value representation. 41) This is also known as two’s complement representation. [..]

(C++20 latest working draft, Section 6.8.1 Fundamental types [basic.fundamental], Paragraph 3, page 66)


This means that with C++20, your original example invokes defined behavior, as-is.


Additional note: not that this proves anything, but the GCC/Clang undefined behavior sanitizer (invoked with -fsanitize=undefined) only triggers when compiling this example for std <= C++17 and then only complains about the shifting of the negative value (both as expected):

#include <stdio.h>
#include <limits.h>

int main(int argc, char **argv)
{
    int i = INT_MAX - 1 + argc;
    int j = i << 1;
    int k = j << 1;

    printf("%d %d %d\n", i, j, k);

    return 0;
}

Example session (on Fedora 31):

$ g++ -std=c++17 -Wall -Og sign.cc -o sign -fsanitize=undefined
$ ./sign                                                       
sign.cc:8:15: runtime error: left shift of negative value -2
2147483647 -2 -4
$ g++ -std=c++2a -Wall -Og sign.cc -o sign -fsanitize=undefined 
$ ./sign
2147483647 -2 -4


来源:https://stackoverflow.com/questions/46700362/how-can-i-read-a-signed-integer-from-a-buffer-of-uint8-t-without-invoking-un-or

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!