Difference between composite characters and surrogate pairs

前端未结

关注

 2  1009

忘了有多久 2021-02-06 12:08

In Unicode what is the difference between composite characters and surrogate pairs?

To me they sound like similar things - two characters to represent one character. Wh

2条回答

野的像风 (楼主)

2021-02-06 12:51

Surrogate pairs are a weird wart in Unicode.

Unicode itself is nothing other than an abstract assignment of meaning to numbers. That's what an encoding is. Capital-letter-A, Greek-alternate-terminal-sigma, Klingon-closing-bracket-2, etc. currently, numbers up to about 2²¹ are available, though not all are in use. In the context of Unicode, each number is know as a code point.

However, the Unicode suite as a whole contains more than just this encoding. It also contains technologies to serialize code points. This is essentially just an exercise in serializing unsigned integers. Three subfamilies of technologies are specified: UTF-32, UTF-8, and UTF-16.

UTF-32 simply expresses every code-point as a 32-bit unsigned integer. That's easy. Two variants exist, for big and little endian, respectively. Each 32-bit serialized integer is called the code unit of this format, and this is a fixed-width format (one code point per code unit).

UTF-8 is a clever multi-byte format, in which code points take up anything from one to six 8-bit bytes. This format is very portable, since it has no ordering issues and since it is pretty compact for English, near-English and computer code. The code unit of UTF-8 is one byte, and this is a variable-width format (1–6 code units per code point).

Finally, there's UTF-16: Initially, people thought Unicode could do with only 2¹⁶ numbers, so this was initially deemed to be fixed-width, with 16-bit code units. However, it eventually became clear that we needed larger numbers. So UTF-16 is now also a variable-width format, but the way this is achieved is that certain 16-bit code units act as indicators that they are part of a two-unit pair, the surrogate pair. However, to simplify the way you detect those pairs, rather than having some external envelope format as UTF-8 does, the actual 16-bit values that are used by the surrogates are deliberately leaked back into the Unicode encoding and left out of the encoding - that is, the surrogate values, 0xD800 to 0xDFFF, are not valid Unicode code points.

So, in summary, surrogates are the result of forcing a serialization format for Unicode back into the encoding, and distorting the design of the encoding to accommodate the serialization format. This is perhaps an unfortunate historical accident, which is somewhat pointless and unsightly in retrospect, but it's what we have and what we need to live with.

Composite characters, on the other hand, are something much higher-level: They are visual units ("graphemes") that are composed of multiple Unicode code points. Sometimes people refer to code points themselves as "characters", but that's a little bit misleading, since characters should really be graphemes, and they can consist of several components (e.g. a base letter plus diacritics and modifiers).

0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...