Having a variable length encoding is indirectly forbidden in the standard.
So I have several questions:
How is the following part of the standard handled?
Here's how Microsoft's STL implementation handles the variable-length encoding:
basic_string
can return a low or a high surrogate, in isolation.
basic_string
returns the number of wchar_t
objects. A surrogate pair (one Unicode character) uses two wchar_t's and therefore adds two to the size.
basic_string
can truncate a string in the middle of a surrogate pair.
basic_string
can insert in the middle of a surrogate pair.
basic_string
can erase either half of a surrogate pair.
In general, the pattern should be clear: the STL does not assume that a std::wstring
is in UTF-16, nor enforce that it remains UTF-16.