If I have a string:
std::string s = u8\"你好\";
and in C++20,
std::u8string s = u8\"你好\";
how std::u8
Since the difference between u8string
and string
is that one is templated on char8_t
and the other on char
, the real question is what is the difference between using char8_t
-based strings vs. char
-based strings.
It really comes down to this: type-based encoding.
Any char
-based string (char*
, char[]
, string
, etc) may be encoded in UTF-8. But then again, it may not. You could develop your code under an assumption that every char*
equivalent will be UTF-8 encoded. And you could write a u8
in front of every string literal and/or otherwise ensure they're properly encoded. But:
Other people's code may not agree. So you can't use any library that might return char*
s that don't use UTF-8 encoding.
You might accidentally violate your own precepts. After all, char not_utf8[] = "你好";
is conditionally supported C++. The encoding of that char[]
will be the compiler's narrow encoding... whatever that is. It may be UTF-8 on some compilers and something else on others.
You can't tell other people's code (or even other people on your team) that this is what you're doing. That is, your API cannot declare that a particular char*
is UTF-8-encoded. This has to be something the user assumes or has otherwise read in your documentation, rather than something they see in code.
Note that none of these problems exist for users of UTF-16 or UTF-32. If you use a char16_t
-based string, all of these problems go away. If other people's code returns a char16_t
string, you know what they're doing. If they return something else, then you know that those things probably aren't UTF-16. Your UTF-16-based code can interop with theirs. If you write an API that returns a char16_t
-based string, everyone using your code can see from the type of the string what encoding it is. And this is guaranteed to be a compile error: char16_t not_utf16[] = "你好";
Now yes, there is no guarantee of any of these things. Any particular char16_t
string could have any values in it, even those that are illegal for UTF-16. But char16_t
represents a type for which the default assumption is a specific encoding. Given that, if you present a string with this type that isn't UTF-16 encoded, it would not be unreasonable to consider this a mistake/perfidy by the user, that it is a contract violation.
We can see how C++ has been impacted by lacking similar, type-based facilities for UTF-8. Consider filesystem::path
. It can take strings in any Unicode encoding. For UTF-16/32, path
's constructor takes char16/32_t
-based strings. But you cannot pass a UTF-8 string to path
's constructor; the char
-based constructor assumes that the encoding is the implementation-defined narrow encoding, not UTF-8. So instead, you have to employ filesystem::u8path
, which is a separate function that returns a path
, constructed from a UTF-8-encoded string.
What's worse is that if you try to pass a UTF-8 encoded char
-based string to path
's constructor... it compiles fine. Despite being at best non-portable, it may just appear to work.
char8_t
, and all of its accoutrements like u8string
, exist to allow UTF-8 users the same power that other UTF-encodings get. In C++20, filesystem::path
will get overloads for char8_t
-based strings, and u8path will become obsolete.
And, as an added bonus, char8_t
doesn't have special aliasing language around it. So an API that takes char8_t
-based strings is certainly an API that takes a character array, rather than an arbitrary byte array.