How many characters can UTF-8 encode?

前端未结

关注

 10  1219

一个人的身影

If UTF-8 is 8 bits, does it not mean that there can be only maximum of 256 different characters?

The first 128 code points are the same as in ASCII. But it says UTF-

相关标签:

10条回答

孤街浪徒

2020-11-28 02:24

Unicode is firmly married to UTF-8. Unicode specifically supports 2^21 code points (2,097,152 characters) which is exactly the same number of code points supported by UTF-8. Both systems reserve the same 'dead' space and restricted zones for code points etc. ...as of June 2018 the most recent version, Unicode 11.0, contains a repertoire of 137,439 characters

From the unicode standard. Unicode FAQ

The Unicode Standard encodes characters in the range U+0000..U+10FFFF, which amounts to a 21-bit code space.

From the UTF-8 Wikipedia page. UTF-8 Description

Since the restriction of the Unicode code-space to 21-bit values in 2003, UTF-8 is defined to encode code points in one to four bytes, ...

0 讨论(0)
发布评论:

提交评论
- 加载中...
难免孤独

2020-11-28 02:26

UTF-8 is a variable length encoding with a minimum of 8 bits per character.
Characters with higher code points will take up to 32 bits.

0 讨论(0)
发布评论:

提交评论
- 加载中...
感情败类

2020-11-28 02:32

According to this table* UTF-8 should support:

2³¹ = 2,147,483,648 characters

However, RFC 3629 restricted the possible values, so now we're capped at 4 bytes, which gives us

2²¹ = 2,097,152 characters

Note that a good chunk of those characters are "reserved" for custom use, which is actually pretty handy for icon-fonts.

* Wikipedia used show a table with 6 bytes -- they've since updated the article.

2017-07-11: Corrected for double-counting the same code point encoded with multiple bytes

0 讨论(0)
发布评论:

提交评论
- 加载中...
故里飘歌

2020-11-28 02:40

Check out the Unicode Standard and related information, such as their FAQ entry, UTF-8 UTF-16, UTF-32 & BOM. It’s not that smooth sailing, but it’s authoritative information, and much of what you might read about UTF-8 elsewhere is questionable.

The “8” in “UTF-8” relates to the length of code units in bits. Code units are entities use to encode characters, not necessarily as a simple one-to-one mapping. UTF-8 uses a variable number of code units to encode a character.

The collection of characters that can be encoded in UTF-8 is exactly the same as for UTF-16 or UTF-32, namely all Unicode characters. They all encode the entire Unicode coding space, which even includes noncharacters and unassigned code points.

0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2