Reading an int through char* buffer behaves different whether it is positive or negative

问题

Background: I was wondering how to (manually) deserialize binary data if we got them through a char * buffer.

Assumptions: As a minimal example, we will consider here that:

I have only one int serialized through a char* buffer.
I want to get the original int back from the buffer.
sizeof(int) == 4 on the target system/platform.
The endianness of the target system/platform is little-endian.

Note: This is out of purely general interest therefore I don't want to use anything alike to std::memcpy because we'll not see the strange behaviour I encountered.

Test: I have built the following test case:

#include <iostream>
#include <bitset>

int main()
{
    // Create neg_num and neg_num_bytes then display them
    int neg_num(-5000);
    char * neg_num_bytes = reinterpret_cast<char*>(&neg_num);
    display(neg_num, neg_num_bytes);

    std::cout << '\n';

    // Create pos_num and pos_num_bytes then display them
    int pos_num(5000);
    char * pos_num_bytes = reinterpret_cast<char*>(&pos_num);
    display(pos_num, pos_num_bytes);

    std::cout << '\n';

    // Get neg_num back from neg_num_bytes through bitmask operations
    int neg_num_back = 0;
    for(std::size_t i = 0; i < sizeof neg_num; ++i)
        neg_num_back |= static_cast<int>(neg_num_bytes[i]) << CHAR_BIT*i; // For little-endian

    // Get pos_num back from pos_num_bytes through bitmask operations
    int pos_num_back = 0;
    for(std::size_t i = 0; i < sizeof pos_num; ++i)
        pos_num_back |= static_cast<int>(pos_num_bytes[i]) << CHAR_BIT*i; // For little-endian

    std::cout << "Reconstructed neg_num: " << neg_num_back << ": " << std::bitset<CHAR_BIT*sizeof neg_num_back>(neg_num_back);
    std::cout << "\nReconstructed pos_num: " << pos_num_back << ":  " << std::bitset<CHAR_BIT*sizeof pos_num_back>(pos_num_back) << std::endl;

    return 0;
}

Where display() is defined as:

// Warning: num_bytes must have a size of sizeof(int)
void display(int num, char * num_bytes)
{
    std::cout << num << " (from int)  : " << std::bitset<CHAR_BIT*sizeof num>(num) << '\n';
    std::cout << num << " (from char*): ";
    for(std::size_t i = 0; i < sizeof num; ++i)
        std::cout << std::bitset<CHAR_BIT>(num_bytes[sizeof num -1 -i]); // For little-endian
    std::cout << std::endl;
}

The output I get is:

-5000 (from int)  : 11111111111111111110110001111000
-5000 (from char*): 11111111111111111110110001111000

5000 (from int)  : 00000000000000000001001110001000
5000 (from char*): 00000000000000000001001110001000

Reconstructed neg_num: -5000: 11111111111111111110110001111000
Reconstructed pos_num: -120:  11111111111111111111111110001000

I know the test case code is quite hard to read. To explain it briefly:

I create an int.
I create a char* array pointing the first byte of the previously created int (to simulate that I have a real int stored in a char* buffer). Its size is consequently 4.
I display the int and its binary representation
I display the int and the concatenation of each bytes stored in the char* buffer to compare that they are the same (in reverse order due to endianness purposes).
Try to get the original int back from the buffer.
I display the reconstructed int as well as its binary representation.

I performed this procedure for both negative and positive values. This is why the code is less readable as it should be (sorry for that).

As we can see, the negative value could be reconstructed successfully, but it did not work for the positive one (I expected 5000 and I got -120).

I've made the test with several other negative values and positive values and the conclusion is still the same, it works fine with negative numbers but fails with positive numbers.

Question: I'm in trouble to understand why does concatenating 4 chars into an int via bit-wise shifts change the char values for positive numbers when they stay unchanged with negative values ?

When we look at the binary representation, we can see that the reconstructed numbers is not composed of the chars that I have concatenated.

Is it related with the static_cast<int> ? If I remove it, the integral promotion rule will implicitly apply it anyway. But I need this to be done since I need to convert it into an int in order to not lose the result of the shifts.
If this is the heart of the issue, how to solve it ?

Additionally: Is there a better way to get back the value than bit-wise shifting ? Something that is not dependent to the endianness of the system/platform.

Perhaps this should be another separate question.

回答1:

There are two main things that affect the outcome here:

The type char can be signed or unsigned, it's an implementation detail left to the compiler.
When integer conversion happens, signed values are sign-extended.

What is probably happening here is that char is signed on your system and with your compiler. That means when you convert the byte to an int and the high bit is set, the value will be sign-extended (for example binary 10000001 will be sign-extended to 1111111111111111111111111000001).

This of course affect your bitwise operations.

The solution is to use an explicit unsigned data type, i.e. unsigned char. I also suggest you use unsigned int (or uint32_t) for your type-conversions and temporary storage of the data, and only convert the full result to plain int.

回答2:

This is because static_cast<int>(pos_num_bytes[i]) will return a negative int in some of your cases.

You can replace the last loop with this, if you want to see the issue:

for (std::size_t i = 0; i < sizeof pos_num; ++i)
{
    pos_num_back |= static_cast<int>(pos_num_bytes[i])  << CHAR_BIT * i; // For littel-endian
    std::cout << "\pos_num_back: " << std::bitset<CHAR_BIT * sizeof pos_num_back>(pos_num_back) << std::endl;
    std::cout << std::bitset<CHAR_BIT * sizeof pos_num_bytes[i]>(pos_num_bytes[i]) << std::endl;
    std::cout << std::bitset<CHAR_BIT * sizeof pos_num_back>(static_cast<int>(pos_num_bytes[i])) << std::endl;

};

or you can run this maybe to have the intended result ?

// Get pos_num back from pos_num_bytes through bitmask operations
int pos_num_back = 0;
char* p_pos_num_back = (char*)(&pos_num_back);
for (std::size_t i = 0; i < sizeof pos_num; ++i)
{
    p_pos_num_back[i] |= pos_num_bytes[i];
};

来源：https://stackoverflow.com/questions/58432136/reading-an-int-through-char-buffer-behaves-different-whether-it-is-positive-or

标签

c++

binary-deserialization