How many characters can UTF-8 encode?

前端 未结 10 1218
一个人的身影
一个人的身影 2020-11-28 01:55

If UTF-8 is 8 bits, does it not mean that there can be only maximum of 256 different characters?

The first 128 code points are the same as in ASCII. But it says UTF-

相关标签:
10条回答
  • 2020-11-28 02:13

    Unicode vs UTF-8

    Unicode resolves code points to characters. UTF-8 is a storage mechanism for Unicode. Unicode has a spec. UTF-8 has a spec. They both have different limits. UTF-8 has a different upwards-bound.

    Unicode

    Unicode is designated with "planes." Each plane carries 216 code points. There are 17 Planes in Unicode. For a total of 17 * 2^16 code points. The first plane, plane 0 or the BMP, is special in the weight of what it carries.

    Rather than explain all the nuances, let me just quote the above article on planes.

    The 17 planes can accommodate 1,114,112 code points. Of these, 2,048 are surrogates, 66 are non-characters, and 137,468 are reserved for private use, leaving 974,530 for public assignment.

    UTF-8

    Now let's go back to the article linked above,

    The encoding scheme used by UTF-8 was designed with a much larger limit of 231 code points (32,768 planes), and can encode 221 code points (32 planes) even if limited to 4 bytes.[3] Since Unicode limits the code points to the 17 planes that can be encoded by UTF-16, code points above 0x10FFFF are invalid in UTF-8 and UTF-32.

    So you can see that you can put stuff into UTF-8 that isn't valid Unicode. Why? Because UTF-8 accommodates code points that Unicode doesn't even support.

    UTF-8, even with a four byte limitation, supports 221 code points, which is far more than 17 * 2^16

    0 讨论(0)
  • 2020-11-28 02:17

    While I agree with mpen on the current maximum UTF-8 codes (2,164,864) (listed below, I couldn't comment on his), he is off by 2 levels if you remove the 2 major restrictions of UTF-8: only 4 bytes limit and codes 254 and 255 can not be used (he only removed the 4 byte limit).

    Starting code 254 follows the basic arrangement of starting bits (multi-bit flag set to 1, a count of 6 1's, and terminal 0, no spare bits) giving you 6 additional bytes to work with (6 10xxxxxx groups, an additional 2^36 codes).

    Starting code 255 doesn't exactly follow the basic setup, no terminal 0 but all bits are used, giving you 7 additional bytes (multi-bit flag set to 1, a count of 7 1's, and no terminal 0 because all bits are used; 7 10xxxxxx groups, an additional 2^42 codes).

    Adding these in gives a final maximum presentable character set of 4,468,982,745,216. This is more than all characters in current use, old or dead languages, and any believed lost languages. Angelic or Celestial script anyone?

    Also there are single byte codes that are overlooked/ignored in the UTF-8 standard in addition to 254 and 255: 128-191, and a few others. Some are used locally by the keyboard, example code 128 is usually a deleting backspace. The other starting codes (and associated ranges) are invalid for one or more reasons (https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences).

    0 讨论(0)
  • 2020-11-28 02:18

    Quote from Wikipedia: "UTF-8 encodes each of the 1,112,064 code points in the Unicode character set using one to four 8-bit bytes (termed "octets" in the Unicode Standard)."

    Some links:

    • http://www.utf-8.com/
    • http://www.joelonsoftware.com/articles/Unicode.html
    • http://www.icu-project.org/docs/papers/forms_of_unicode/
    • http://en.wikipedia.org/wiki/UTF-8
    0 讨论(0)
  • 2020-11-28 02:19

    2,164,864 “characters” can be potentially coded by UTF-8.

    This number is 27 + 211 + 216 + 221, which comes from the way the encoding works:

    • 1-byte chars have 7 bits for encoding 0xxxxxxx (0x00-0x7F)

    • 2-byte chars have 11 bits for encoding 110xxxxx 10xxxxxx (0xC0-0xDF for the first byte; 0x80-0xBF for the second)

    • 3-byte chars have 16 bits for encoding 1110xxxx 10xxxxxx 10xxxxxx (0xE0-0xEF for the first byte; 0x80-0xBF for continuation bytes)

    • 4-byte chars have 21 bits for encoding 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (0xF0-0xF7 for the first byte; 0x80-0xBF for continuation bytes)

    As you can see this is significantly larger than current Unicode (1,112,064 characters).

    UPDATE

    My initial calculation is wrong because it doesn't consider additional rules. See comments to this answer for more details.

    0 讨论(0)
  • 2020-11-28 02:22

    UTF-8 does not use one byte all the time, it's 1 to 4 bytes.

    The first 128 characters (US-ASCII) need one byte.

    The next 1,920 characters need two bytes to encode. This covers the remainder of almost all Latin alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets, as well as Combining Diacritical Marks.

    Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use[12] including most Chinese, Japanese and Korean [CJK] characters.

    Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).

    source: Wikipedia

    0 讨论(0)
  • 2020-11-28 02:23

    UTF-8 uses 1-4 bytes per character: one byte for ascii characters (the first 128 unicode values are the same as ascii). But that only requires 7 bits. If the highest ("sign") bit is set, this indicates the start of a multi-byte sequence; the number of consecutive high bits set indicates the number of bytes, then a 0, and the remaining bits contribute to the value. For the other bytes, the highest two bits will be 1 and 0 and the remaining 6 bits are for the value.

    So a four byte sequence would begin with 11110... (and ... = three bits for the value) then three bytes with 6 bits each for the value, yielding a 21 bit value. 2^21 exceeds the number of unicode characters, so all of unicode can be expressed in UTF8.

    0 讨论(0)
提交回复
热议问题