Difference between composite characters and surrogate pairs

前端 未结 2 1009
忘了有多久
忘了有多久 2021-02-06 12:08

In Unicode what is the difference between composite characters and surrogate pairs?

To me they sound like similar things - two characters to represent one character. Wh

2条回答
  •  野的像风
    2021-02-06 12:51

    Surrogate pairs are a weird wart in Unicode.

    Unicode itself is nothing other than an abstract assignment of meaning to numbers. That's what an encoding is. Capital-letter-A, Greek-alternate-terminal-sigma, Klingon-closing-bracket-2, etc. currently, numbers up to about 221 are available, though not all are in use. In the context of Unicode, each number is know as a code point.

    However, the Unicode suite as a whole contains more than just this encoding. It also contains technologies to serialize code points. This is essentially just an exercise in serializing unsigned integers. Three subfamilies of technologies are specified: UTF-32, UTF-8, and UTF-16.

    UTF-32 simply expresses every code-point as a 32-bit unsigned integer. That's easy. Two variants exist, for big and little endian, respectively. Each 32-bit serialized integer is called the code unit of this format, and this is a fixed-width format (one code point per code unit).

    UTF-8 is a clever multi-byte format, in which code points take up anything from one to six 8-bit bytes. This format is very portable, since it has no ordering issues and since it is pretty compact for English, near-English and computer code. The code unit of UTF-8 is one byte, and this is a variable-width format (1–6 code units per code point).

    Finally, there's UTF-16: Initially, people thought Unicode could do with only 216 numbers, so this was initially deemed to be fixed-width, with 16-bit code units. However, it eventually became clear that we needed larger numbers. So UTF-16 is now also a variable-width format, but the way this is achieved is that certain 16-bit code units act as indicators that they are part of a two-unit pair, the surrogate pair. However, to simplify the way you detect those pairs, rather than having some external envelope format as UTF-8 does, the actual 16-bit values that are used by the surrogates are deliberately leaked back into the Unicode encoding and left out of the encoding - that is, the surrogate values, 0xD800 to 0xDFFF, are not valid Unicode code points.

    So, in summary, surrogates are the result of forcing a serialization format for Unicode back into the encoding, and distorting the design of the encoding to accommodate the serialization format. This is perhaps an unfortunate historical accident, which is somewhat pointless and unsightly in retrospect, but it's what we have and what we need to live with.


    Composite characters, on the other hand, are something much higher-level: They are visual units ("graphemes") that are composed of multiple Unicode code points. Sometimes people refer to code points themselves as "characters", but that's a little bit misleading, since characters should really be graphemes, and they can consist of several components (e.g. a base letter plus diacritics and modifiers).

提交回复
热议问题