reinterpret_cast between char* and std::uint8_t* - safe?

荒凉一梦 提交于 2019-11-27 00:25:41

问题


Now we all sometimes have to work with binary data. In C++ we work with sequences of bytes, and since the beginning char was the our building block. Defined to have sizeof of 1, it is the byte. And all library I/O functions use char by default. All is good but there was always a little concern, a little oddity that bugged some people - the number of bits in a byte is implementation-defined.

So in C99, it was decided to introduce several typedefs to let the developers easily express themselves, the fixed-width integer types. Optional, of course, since we never want to hurt portability. Among them, uint8_t, migrated into C++11 as std::uint8_t, a fixed width 8-bit unsigned integer type, was the perfect choice for people who really wanted to work with 8 bit bytes.

And so, developers embraced the new tools and started building libraries that expressively state that they accept 8-bit byte sequences, as std::uint8_t*, std::vector<std::uint8_t> or otherwise.

But, perhaps with a very deep thought, the standardization committee decided not to require implementation of std::char_traits<std::uint8_t> therefore prohibiting developers from easily and portably instantiating, say, std::basic_fstream<std::uint8_t> and easily reading std::uint8_ts as a binary data. Or maybe, some of us don't care about the number of bits in a byte and are happy with it.

But unfortunately, two worlds collide and sometimes you have to take a data as char* and pass it to a library that expects std::uint8_t*. But wait, you say, isn't char variable bit and std::uint8_t is fixed to 8? Will it result into a loss of data?

Well, there is an interesting Standardese on this. The char defined to hold exactly one byte and byte is the lowest addressable chunk of memory, so there can't be a type with bit width lesser than that of char. Next, it is defined to be able to hold UTF-8 code units. This gives us the minimum - 8 bits. So now we have a typedef which is required to be 8 bits wide and a type that is at least 8 bits wide. But are there alternatives? Yes, unsigned char. Remember that signedness of char is implementation-defined. Any other type? Thankfully, no. All other integral types have required ranges which fall outside of 8 bits.

Finally, std::uint8_t is optional, that means that the library which uses this type will not compile if it's not defined. But what if it compiles? I can say with a great degree of confidence that this means that we are on a platform with 8 bit bytes and CHAR_BIT == 8.

Once we have this knowledge, that we have 8-bit bytes, that std::uint8_t is implemented as either char or unsigned char, can we assume that we can do reinterpret_cast from char* to std::uint8_t* and vice versa? Is it portable?

This is where my Standardese reading skills fail me. I read about safely derived pointers ([basic.stc.dynamic.safety]) and, as far as I understand, the following:

std::uint8_t* buffer = /* ... */ ;
char* buffer2 = reinterpret_cast<char*>(buffer);
std::uint8_t buffer3 = reinterpret_cast<std::uint8_t*>(buffer2);

is safe if we don't touch buffer2. Correct me if I'm wrong.

So, given the following preconditions:

  • CHAR_BIT == 8
  • std::uint8_t is defined.

Is it portable and safe to cast char* and std::uint8_t* back and forth, assuming that we're working with binary data and the potential lack of sign of char doesn't matter?

I would appreciate references to the Standard with explanations.

EDIT: Thanks, Jerry Coffin. I'm going to add the quote from the Standard ([basic.lval], §3.10/10):

If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:

...

— a char or unsigned char type.

EDIT2: Ok, going deeper. std::uint8_t is not guaranteed to be a typedef of unsigned char. It can be implemented as extended unsigned integer type and extended unsigned integer types are not included in §3.10/10. What now?


回答1:


Ok, let's get truly pedantic. After reading this, this and this, I'm pretty confident that I understand the intention behind both Standards.

So, doing reinterpret_cast from std::uint8_t* to char* and then dereferencing the resulting pointer is safe and portable and is explicitly permitted by [basic.lval].

However, doing reinterpret_cast from char* to std::uint8_t* and then dereferencing the resulting pointer is a violation of strict aliasing rule and is undefined behavior if std::uint8_t is implemented as extended unsigned integer type.

However, there are two possible workarounds, first:

static_assert(std::is_same_v<std::uint8_t, char> ||
    std::is_same_v<std::uint8_t, unsigned char>,
    "This library requires std::uint8_t to be implemented as char or unsigned char.");

With this assert in place, your code will not compile on platforms on which it would result in undefined behavior otherwise.

Second:

std::memcpy(uint8buffer, charbuffer, size);

Cppreference says that std::memcpy accesses objects as arrays of unsigned char so it is safe and portable.

To reiterate, in order to be able to reinterpret_cast between char* and std::uint8_t* and work with resulting pointers portably and safely in a 100% standard-conforming way, the following conditions must be true:

  • CHAR_BIT == 8.
  • std::uint8_t is defined.
  • std::uint8_t is implemented as char or unsigned char.

On a practical note, the above conditions are true on 99% of platforms and there is likely no platform on which the first 2 conditions are true while the 3rd one is false.




回答2:


If uint8_t exists at all, essentially the only choice is that it's a typedef for unsigned char (or char if it happens to be unsigned). Nothing (but a bitfield) can represent less storage than a char, and the only other type that can be as small as 8 bits is a bool. The next smallest normal integer type is a short, which must be at least 16 bits.

As such, if uint8_t exists at all, you really only have two possibilities: you're either casting unsigned char to unsigned char, or casting signed char to unsigned char.

The former is an identity conversion, so obviously safe. The latter falls within the "special dispensation" given for accessing any other type as a sequence of char or unsigned char in §3.10/10, so it also gives defined behavior.

Since that includes both char and unsigned char, a cast to access it as a sequence of char also gives defined behavior.

Edit: As far as Luc's mention of extended integer types goes, I'm not sure how you'd manage to apply it to get a difference in this case. C++ refers to the C99 standard for the definitions of uint8_t and such, so the quotes throughout the remainder of this come from C99.

§6.2.6.1/3 specifies that unsigned char shall use a pure binary representation, with no padding bits. Padding bits are only allowed in 6.2.6.2/1, which specifically excludes unsigned char. That section, however, describes a pure binary representation in detail -- literally to the bit. Therefore, unsigned char and uint8_t (if it exists) must be represented identically at the bit level.

To see a difference between the two, we have to assert that some particular bits when viewed as one would produce results different from when viewed as the other -- despite the fact that the two must have identical representations at the bit level.

To put it more directly: a difference in result between the two requires that they interpret bits differently -- despite a direct requirement that they interpret bits identically.

Even on a purely theoretical level, this appears difficult to achieve. On anything approaching a practical level, it's obviously ridiculous.



来源:https://stackoverflow.com/questions/16260033/reinterpret-cast-between-char-and-stduint8-t-safe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!