Does the C++ standard mandate an encoding for wchar_t?

后端 未结 7 2138
礼貌的吻别
礼貌的吻别 2021-02-10 07:09

Here are some excerpts from my copy of the 2014 draft standard N4140

22.5 Standard code conversion facets [locale.stdcvt]

3 F

7条回答
  •  生来不讨喜
    2021-02-10 08:03

    As Elem can be wchar_t, char16_t, or char32_t, the clause 4.1 says nothing about a required wchar_t encoding. It states something about the conversion performed.

    From the wording, it is clear that the conversion is between UTF-8 and either UCS-2 or UCS-4, depending on the size of Elem. So if wchar_t is 16 bits, the conversion will be with UCS-2, and if it is 32 bits, UCS-4.

    Why does the standard mention UCS-2 and UCS-4 and not UTF-16 and UTF-32 ? Because codecvt_utf8 will convert a multi-byte UTF8 to a single wide character:

    • UCS-2 is a subset of unicode, but there is no surogate pair encoding contrary to UTF-16
    • UCS-4 is the same as UTF-32, now (but looking at the growing number of emojis, maybe one day there couldn't be enough of 32 bits, and you would have a UTF-64, and UTF32 surrogate pairs that would not be supported by codecvt_utf8)

    Although, it is not clear to me what will happen, if an UTF-8 text would contain a sequence corresponds to a unicode character that is not available in UCS-2 used for a receiving char16_t.

提交回复
热议问题