Does the C++ standard mandate an encoding for wchar_t?

后端 未结 7 2165
礼貌的吻别
礼貌的吻别 2021-02-10 07:09

Here are some excerpts from my copy of the 2014 draft standard N4140

22.5 Standard code conversion facets [locale.stdcvt]

3 F

7条回答
  •  野的像风
    2021-02-10 07:58

    Let us differentiate between wchar_t and string literals built using the L prefix.

    wchar_t is just an integer type, which may be larger than char.

    String literals using the L prefix will generate strings using wchar_t characters. Exactly what that means is implementation-dependent. There is no requirement that such literals use any particular encoding. They might use UTF-16, UTF-32, or something else that has nothing to do with Unicode at all.

    So if you want a string literal which is guaranteed to be encoded in a Unicode format, across all platforms, use u8, u, or U prefixes for the string literal.

    One interpretation of these two paragraphs is that wchar_t must be encoded as either UCS2 or UCS4.

    No, that is not a valid interpretation. wchar_t has no encoding; it's just a type. It is data which is encoded. A string literal prefixed by L may or may not be encoded in UCS2 or UCS4.

    If you provide codecvt_utf8 a string of wchar_ts which are encoded in UCS2 or UCS4 (as appropriate to sizeof(wchar_t)), then it will work. But not because of wchar_t; it only works because the data you provide it is correctly encoded.

    If 4.1 said "The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 or whatever encoding is imposed on wchar_t by the current global locale" there would be no problem.

    The whole point of those codecvt_* facets is to perform locale-independent conversions. If you want locale-dependent conversions, you shouldn't use them. You should instead use the global codecvt facet.

提交回复
热议问题