How many characters can UTF-8 encode?

前端 未结 10 1219
一个人的身影
一个人的身影 2020-11-28 01:55

If UTF-8 is 8 bits, does it not mean that there can be only maximum of 256 different characters?

The first 128 code points are the same as in ASCII. But it says UTF-

相关标签:
10条回答
  • 2020-11-28 02:24

    Unicode is firmly married to UTF-8. Unicode specifically supports 2^21 code points (2,097,152 characters) which is exactly the same number of code points supported by UTF-8. Both systems reserve the same 'dead' space and restricted zones for code points etc. ...as of June 2018 the most recent version, Unicode 11.0, contains a repertoire of 137,439 characters

    From the unicode standard. Unicode FAQ

    The Unicode Standard encodes characters in the range U+0000..U+10FFFF, which amounts to a 21-bit code space.

    From the UTF-8 Wikipedia page. UTF-8 Description

    Since the restriction of the Unicode code-space to 21-bit values in 2003, UTF-8 is defined to encode code points in one to four bytes, ...

    0 讨论(0)
  • 2020-11-28 02:26

    UTF-8 is a variable length encoding with a minimum of 8 bits per character.
    Characters with higher code points will take up to 32 bits.

    0 讨论(0)
  • 2020-11-28 02:32

    According to this table* UTF-8 should support:

    231 = 2,147,483,648 characters

    However, RFC 3629 restricted the possible values, so now we're capped at 4 bytes, which gives us

    221 = 2,097,152 characters

    Note that a good chunk of those characters are "reserved" for custom use, which is actually pretty handy for icon-fonts.

    * Wikipedia used show a table with 6 bytes -- they've since updated the article.

    2017-07-11: Corrected for double-counting the same code point encoded with multiple bytes

    0 讨论(0)
  • 2020-11-28 02:40

    Check out the Unicode Standard and related information, such as their FAQ entry, UTF-8 UTF-16, UTF-32 & BOM. It’s not that smooth sailing, but it’s authoritative information, and much of what you might read about UTF-8 elsewhere is questionable.

    The “8” in “UTF-8” relates to the length of code units in bits. Code units are entities use to encode characters, not necessarily as a simple one-to-one mapping. UTF-8 uses a variable number of code units to encode a character.

    The collection of characters that can be encoded in UTF-8 is exactly the same as for UTF-16 or UTF-32, namely all Unicode characters. They all encode the entire Unicode coding space, which even includes noncharacters and unassigned code points.

    0 讨论(0)
提交回复
热议问题