Is 16-bit wchar_t formally valid for representing full Unicode?

前端 未结 3 1270
余生分开走
余生分开走 2021-01-05 09:28

In the ¹comp.lang.c++ Usenet group I recently asserted, based on what I thought I knew, that Windows\' 16-bit wchar_t, with UTF-16 encoding where sometimes two

相关标签:
3条回答
  • 2021-01-05 09:33

    wchar_t is not now and never was a Unicode character/code point. The C++ standard does not declare that a wide-string literal will contain Unicode characters. The C++ standard does not declare that a wide-character literal will contain a Unicode character. Indeed, the standard doesn't say anything about what wchar_t will contain.

    wchar_t can be used with locale-aware APIs, but those are only relative to the implementation-defined encoding, not any particular Unicode encoding. The standard library functions that take these use their knowledge of the implementation's encoding to do their jobs.

    So, is a 16-bit wchar_t legal? Yes; the standard does not require that wchar_t be sufficiently large to hold a Unicode codepoint.

    Is a string of wchar_t permitted to hold UTF-16 values (or variable width in general)? Well, you are permitted to make strings of wchar_t that store whatever you want (so long as it fits). So for the purposes of the standard, the question is whether standard-provided means for generating wchar_t characters and strings are permitted to use UTF-16.

    Well, the standard library can do whatever it wants to; the standard offers no guarantee that a conversion from any particular character encoding to wchar_t will be a 1:1 mapping. Even char->wchar_t conversion via wstring_convert is not required anywhere in the standard to produce a 1:1 character mapping.

    If a compiler wishes to declare that the wide character set consists of the Base Multilingual Plane of Unicode, then a declaration like this L'\U0001F000' will produce a single wchar_t. But the value is implementation-defined, per [lex.ccon]/2:

    The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character set, unless the c-char has no representation in the execution wide-character set, in which case the value is implementation-defined.

    And of course, C++ doesn't allow to use surrogate pairs as a c-char; \uD800 is a compile error.

    Where things get murky in the standard is the treatment of strings that contain characters outside of the character set. The above text would suggest that implementations can do what they want. And yet, [lex.string]\16 says this:

    The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U’\0’ or L’\0’.

    I say this is murky because nothing says what the behavior should be if a c-char in a string literal is outside the range of the destination character set.

    Windows compilers (both VS and GCC-on-Windows) do indeed cause L"\U0001F000" to have an array size of 3 (two surrogate pairs and a single NUL terminator). Is that legal C++ standard behavior? What does it mean to provide a c-char to a string literal that is outside of the valid range for a character set?

    I would say that this is a hole in the standard, rather than a deficiency in those compilers. It should make it more clear what the conversion behavior in this case ought to be.


    In any case, wchar_t is not an appropriate tool for processing Unicode-encoded text. It is not "formally valid" for representing any form of Unicode. Yes, many compilers implement wide-string literals as a Unicode encoding. But since the standard doesn't require this, you cannot rely on it.

    Now obviously, you can stick whatever will fit inside of a wchar_t. So even on platforms where wchar_t is 32-bits, you could shove UTF-16 data into them, with each 16-bit word taking up 32-bits. But you couldn't pass such text to any API function that expects the wide character encoding unless you knew that this was the expected encoding for that platform.

    Basically, never use wchar_t if you want to work with a Unicode encoding.

    0 讨论(0)
  • 2021-01-05 09:37

    After clarifying what the question is I will do an edit.

    Q: Is the width of 16 bits for wchar_t in Windows conformant to the standard?

    A: Well, lets see. We will start with the definition of wchar_t from c99 draft.

    ... largest extended character set specified among the supported locales.

    So, we should look what are the supported locales. For that there are Three steps:

    1. We check the documentation for setlocale
    2. We quickly open the documentation for the locale string. We see the format of the string

      locale :: "locale_name"
              | "language[_country_region[.code_page]]"
              | ".code_page"
              | "C"
              | ""
              | NULL
      
    3. We see the list of supported Code pages and we see UTF-8, UTF-16, UTF-32 and what not. We're in a dead end.

    If we start with the C99 definition, it ends with

    ... corresponds to a member of the extended character set.

    The word "character set" is used. But if we say UTF-16 code units are our character set, then all is OK. Otherwise, it's not. It's kinda vague, and one should not care much. The standards were defined many years ago, when Unicode was not popular.

    At the end of the day, we now have C++11 and C11 that define use cases for UTF-8, 16 and 32 with the additional types char16_t and char32_t.


    You need to read about Unicode and you will answer the question yourself.

    Unicode is a character set. Set of characters, it's about 200000 characters. Or more precisely it is a mapping, mapping between numbers and characters. Unicode by itself does not mean this or that bit width.

    Then there are 4 encodings, UTF-7, UTF-8, UTF-16 and UTF-32. UTF stands for Unicode transformation format. Each format defines a code point and a code unit. Code point is an actual charter from Unicode and can consists of one or more units. Only UTF-32 has one unit per point.

    On the other hand, each unit fits into a fixed size integer. So UTF-7 units are at most 7 bits, UTF-16 units are at most 16 bits etc.

    Therefore, in a 16 bit wchar_t string we can hold Unicode text encoded in UTF-16. Particularly in UTF-16 each point takes one or two units.

    So the final answer, in a single wchar_t you can not store all Unicode char, only the single unit ones, but in a string of wchar_t you can store any Unicode text.

    0 讨论(0)
  • 2021-01-05 09:53

    Let's start from first principles:

    (§3.7.3) wide character: bit representation that fits in an object of type wchar_t, capable of representing any character in the current locale

    (§3.7) character: 〈abstract〉 member of a set of elements used for the organization, control, or representation of data

    That, right away, discards full Unicode as a character set (a set of elements/characters) representable on 16-bit wchar_t.

    But wait, Nicol Bolas quoted the following:

    The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U’\0’ or L’\0’.

    and then wondered about the behavior for characters outside the execution character set. Well, C99 has the following to say about this issue:

    (§5.1.1.2) Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member of the execution character set; if there is no corresponding member, it is converted to an implementation- defined member other than the null (wide) character.8)

    and further clarifies in a footnote that not all source characters need to map to the same execution character.

    Armed with this knowledge, you can declare that your wide execution character set is the Basic Multilingual Plane, and that you consider surrogates as proper characters themselves, not as mere surrogates for other characters. AFAICT, this means you are in the clear as far as Clause 6 (Language) of ISO C99 cares.

    Of course, don't expect Clause 7 (Library) to play along nicely with you. As an example, consider iswalpha(wint_t). You cannot pass astral characters (characters outside the BMP) to that function, you can only pass it the two surrogates. And you'd get some nonsensical result, but that's fine because you declared the surrogate themselves to be proper members of the execution character set.

    0 讨论(0)
提交回复
热议问题