Here are some excerpts from my copy of the 2014 draft standard N4140
22.5 Standard code conversion facets [locale.stdcvt]
3 F
Let us differentiate between wchar_t
and string literals built using the L
prefix.
wchar_t
is just an integer type, which may be larger than char
.
String literals using the L
prefix will generate strings using wchar_t
characters. Exactly what that means is implementation-dependent. There is no requirement that such literals use any particular encoding. They might use UTF-16, UTF-32, or something else that has nothing to do with Unicode at all.
So if you want a string literal which is guaranteed to be encoded in a Unicode format, across all platforms, use u8
, u
, or U
prefixes for the string literal.
One interpretation of these two paragraphs is that wchar_t must be encoded as either UCS2 or UCS4.
No, that is not a valid interpretation. wchar_t
has no encoding; it's just a type. It is data which is encoded. A string literal prefixed by L
may or may not be encoded in UCS2 or UCS4.
If you provide codecvt_utf8
a string of wchar_t
s which are encoded in UCS2 or UCS4 (as appropriate to sizeof(wchar_t)
), then it will work. But not because of wchar_t
; it only works because the data you provide it is correctly encoded.
If 4.1 said "The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 or whatever encoding is imposed on wchar_t by the current global locale" there would be no problem.
The whole point of those codecvt_*
facets is to perform locale-independent conversions. If you want locale-dependent conversions, you shouldn't use them. You should instead use the global codecvt
facet.