Here are some excerpts from my copy of the 2014 draft standard N4140
22.5 Standard code conversion facets [locale.stdcvt]
3 F
wchar_t
is just an integral literal. It has a min value, a max value, etc.
Its size is not fixed by the standard.
If it is large enough, you can store UCS-2 or UCS-4 data in a buffer of wchar_t
. This is true regardless of the system you are on, as UCS-2 and UCS-4 and UTF-16 and UTF-32 are just descriptions of integer values arranged in a sequence.
In C++11, there are std
APIs that read or write data presuming it has those encodings. In C++03, there are APIs that read or write data using the current locale.
22.5 Standard code conversion facets [locale.stdcvt]
3 For each of the three code conversion facets codecvt_utf8, codecvt_utf16, and codecvt_utf8_utf16:
(3.1) — Elem is the wide-character type, such as wchar_t, char16_t, or char32_t.
4 For the facet codecvt_utf8:
(4.1) — The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program.
So here it codecvt_utf8_utf16
deals with utf8
on one side, and UCS2 or UCS4 (depending on how big Elem is) on the other. It does conversion.
The Elem (the wide character) is presumed to be encoded in UCS2 or UCS4 depending on how big it is.
This does not mean that wchar_t
is encoded as such, it just means this operation interprets the wchar_t
as being encoded as such.
How the UCS2 or UCS4 got into the Elem is not something this part of the standard cares about. Maybe you set it in there with hex constants. Maybe you read it from io. Maybe you calculated it on the fly. Maybe you used a high-quality random-number generator. Maybe you added together the bit-values of an ascii
string. Maybe you calculated a fixed-point approximation of the log*
of the number of seconds it takes the moon to change the Earth's day by 1 second. Not these paragraphs problems. These pragraphs simply mandate how bits are modified and interpreted.
Similar claims hold in other cases. This does not mandate what format wchar_t
have. It simply states how these facets interpret wchar_t
or char16_t
or char32_t
or char8_t
(reading or writing).
Other ways of interacting with wchar_t
use different methods to mandate how the value of the wchar_t
is interpreted.
iswalpha uses the (global) locale to interpret the wchar_t
, for example. In some locals, the wchar_t
may be UCS2. In others, it might be some insane cthulian encoding whose details enable you to see a new color from out of space.
To be explicit: encodings are not the property of data, or bits. Encodings are properties of interpretation of data. Quite often there is only one proper or reasonable interpretation of data that makes any sense, but the data itself is bits.
The C++ standard does not mandate what is stored in a wchar_t
. It does mandate what certain operations interpret the contents of a wchar_t
to be. That section describes how some facets interpret the data in a wchar_t
.