Is there encoding in Unicode where every “character” is just one code point?

问题

Trying to rephrase: Can you map every combining character combination into one code point?

I'm new to Unicode, but it seems to me that there is no encoding, normalization or representation where one character would be one code point in every case in Unicode. Is this correct?

Is this true for Basic Multilingual Plane also?

回答1:

If you mean one char == one number (ie: where every char is represented by the same number of bytes/words/what-have-you): in UCS-4, each character is represented by a 4-byte number. That's way more than big enough for every character to be represented by a single value, but it's quite wasteful if you don't need any of the higher chars.

If you mean the compatibility sequences (ie: where e + ´ => é): there are single-character representations for most of the combinations in use in existing modern languages. If you're making up your own language, you could run into problems...but if you're sticking to the ones that people actually use, you'll be fine.

回答2:

Can you map every combining character combination into one code point?

Every combining character combination? How would your proposed encoding represent the string "à̴̵̶̷̸̡̢̧̨̛̖̗̘̙̜̝̞̟̠̣̤̥̦̩̪̫̬̭̮̯̰̱̲̳̹̺̻̼͇͈͉͍͎́̂̃̄̅̆̇̈̉̊̋̌̍̎̏̐̑̒̓̔̽̾̿̀́͂̓̈́͆͊͋͌̕̚ͅ͏͓͔͕͖͙͚͐͑͒͗͛ͣͤͥͦͧͨͩͪͫͬͭͮͯ͘͜͟͢͝͞͠͡"? (an 'a' with more than a hundred combining marks attached to it?) It's just not practical.

There are, however, a lot of "precomposed" characters in Unicode, like áçñü. Normalization form C will use these instead of the decomposed version whenever possible.

回答3:

it seems to me that there is no encoding, normalization or representation where one character would be one code point in every case in Unicode. Is this correct?

Depends on the meaning of the meaning of the word “character.” Unicode has the concepts of abstract character (definition 7 in chapter 3 of the standard: “A unit of information used for the organization, control, or representation of textual data”) and encoded character (definition 11: “An association (or mapping) between an abstract character and a code point”). So a character never is a code point, but for many code points, there exists an abstract character that maps to the code point, this mapping being called “encoded character.” But (definition 11, paragraph 4): “A single abstract character may also be represented by a sequence of code points”

Is this true for Basic Multilingual Plane also?

There is no conceptual difference related to abstract or encoded characters between the BMP and the other planes. The statement above holds for all subsets of the codespace.

Depending on your application, you have to distinguish between the terms glyph, grapheme cluster, grapheme, abstract character, encoded character, code point, scalar value, code unit and byte. All of these concepts are different, and there is no simple mapping between them. In particular, there is almost never a one-to-one mapping between these entities.

来源：https://stackoverflow.com/questions/4608424/is-there-encoding-in-unicode-where-every-character-is-just-one-code-point

标签

unicode

normalization

utf-16

unicode-normalization

surrogate-pairs