问题
It is said here that UTF-16's largest code point is 10FFFF
Also it is written on that page that
BMP characters require one 16-bit code unit to process or store.
But in bit representation 10FFFF
is
0001 0000 1111 1111 1111 1111
We see that it occupies more than 15 bits of 16-bit wchar_t
(an implementation is allowed to support wide characters with >=0 value only, independently of signedness of wchar_t
)
What is the real largest code point for 16-bit wchar_t
?
回答1:
It is said here that UTF-16's largest code point is 10FFFF
Yes, but you are misinterpreting the table that you are drawing that from.
U+10FFFF is the largest Unicode code point value. UTF-16 is not Unicode itself, it is an encoding of Unicode code points using 16-bit code units (just as UTF-8 is an encoding using 8-bit code units) . As you remarked, 16 bits is not enough to represent the full range of Unicode code point values. The UTF-16 encoding of Unicode code points U+0000 - U+FFFF requires only 1 code unit, but the encoding of code points U+10000 - U+10FFFF requires 2 code units acting together, known as a "surrogate pair". UTF-16 is the successor to UCS-2, which was the original 16-bit encoding for Unicode but it could only encode code points U+0000 - U+FFFF. UTF-16 is backwards compatible with UCS-2, but adding surrogate pairs allows UTF-16 to support the full range of Unicode code points.
UTF-16 is designed so that the code unit values from which surrogate pairs can be formed are reserved for that purpose. They cannot be misinterpreted as regular characters, even when they appear unpaired (in what therefore must be an invalid code sequence).
Note also that it's a bit of an abuse, albeit a common one, for a C implementation to call UTF-16 (or UTF-8) a "character set", as their code units do not all correspond 1-1 with Unicode characters. Or, at least the characters to which they correspond have to be interpreted as the code units that they are. It's a pragmatic approach to the problem of efficiently representing characters from a large range.
Also it is written on that page that
BMP characters require one 16-bit code unit to process or store.
That is also true. You apparently have overlooked the fact that BMP (Basic Multilingual Plane, code points U+0000 - U+FFFF) characters are a subset of all Unicode characters. 1/17th of them, in fact, or somewhat less, depending on how you count. The fact that their code point values can all be represented with 16 bits (i.e. in one UTF-16 code unit) could in fact be taken as a definition of that subset.
We see that it occupies more than 15 bits of 16-bit wchar_t (an implementation is allowed to support wide characters with >=0 value only, independently of signedness of wchar_t)
No, as we covered in my answer to one of your other recent questions. The standard imposes no restriction on C implementations to support only non-negative code point values. That's just the de facto state of the code point assignments of all current, widely-used coded character sets. A conforming C implementation on which wchar_t
is signed could provide a character set in which some extended characters have negative corresponding wchar_t
values.
What is the real largest code point for 16-bit wchar_t?
That has nothing to do with any of the foregoing. In fact, it doesn't make much sense. Code point values are a characteristic of (coded) character sets, not of any C data type. They are the numbers corresponding to the characters supported by that set.
If a C implementation claims to provide UTF-16 as a supported character set, then it follows that its wchar_t
must have at least 16 value bits, because that type must be able to represent all UTF-16 code unit values. If that type has only 16 bits altogether then they must all be value bits, making the type necessarily unsigned, and capable of supporting values up to 0xFFFF
.
来源:https://stackoverflow.com/questions/40755519/what-is-the-largest-code-point-for-16-bit-wchar-t-type