Difference between composite characters and surrogate pairs

末鹿安然 提交于 2019-12-03 08:14:27
Kerrek SB

Surrogate pairs are a weird wart in Unicode.

Unicode itself is nothing other than an abstract assignment of meaning to numbers. That's what an encoding is. Capital-letter-A, Greek-alternate-terminal-sigma, Klingon-closing-bracket-2, etc. currently, numbers up to about 221 are available, though not all are in use. In the context of Unicode, each number is know as a code point.

However, the Unicode suite as a whole contains more than just this encoding. It also contains technologies to serialize code points. This is essentially just an exercise in serializing unsigned integers. Three subfamilies of technologies are specified: UTF-32, UTF-8, and UTF-16.

UTF-32 simply expresses every code-point as a 32-bit unsigned integer. That's easy. Two variants exist, for big and little endian, respectively. Each 32-bit serialized integer is called the code unit of this format, and this is a fixed-width format (one code point per code unit).

UTF-8 is a clever multi-byte format, in which code points take up anything from one to six 8-bit bytes. This format is very portable, since it has no ordering issues and since it is pretty compact for English, near-English and computer code. The code unit of UTF-8 is one byte, and this is a variable-width format (1–6 code units per code point).

Finally, there's UTF-16: Initially, people thought Unicode could do with only 216 numbers, so this was initially deemed to be fixed-width, with 16-bit code units. However, it eventually became clear that we needed larger numbers. So UTF-16 is now also a variable-width format, but the way this is achieved is that certain 16-bit code units act as indicators that they are part of a two-unit pair, the surrogate pair. However, to simplify the way you detect those pairs, rather than having some external envelope format as UTF-8 does, the actual 16-bit values that are used by the surrogates are deliberately leaked back into the Unicode encoding and left out of the encoding - that is, the surrogate values, 0xD800 to 0xDFFF, are not valid Unicode code points.

So, in summary, surrogates are the result of forcing a serialization format for Unicode back into the encoding, and distorting the design of the encoding to accommodate the serialization format. This is perhaps an unfortunate historical accident, which is somewhat pointless and unsightly in retrospect, but it's what we have and what we need to live with.


Composite characters, on the other hand, are something much higher-level: They are visual units ("graphemes") that are composed of multiple Unicode code points. Sometimes people refer to code points themselves as "characters", but that's a little bit misleading, since characters should really be graphemes, and they can consist of several components (e.g. a base letter plus diacritics and modifiers).

An example of a composite character is Unicode U+0039, É. It should display identically to the decomposed pair U+0045 E and U+0301 (the combining acute accent character). This is independent of any byte encoding use to actually store the character; it's just two different ways of representing the same graphical character using Unicode.

A surrogate pair is specific to UTF-16, which uses two 16-bit values to represent a single Unicode code point greater than U+FFFF (which obviously cannot fit in a single 16-bit value). For example (from the Wikipedia article), code point U+1D11E is serialized as the two 16-bit values 0xD834 and 0xDD1E. (The actual byte sequence used to represent them will depend on whether you use the big endian or little endian version of UTF-16.)

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!