How does a file with Chinese characters know how many bytes to use per character?

前端未结

关注

 9  1608

I have read Joel\'s article \"The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)\" but still don\'

相关标签:

9条回答

情话喂你

2020-12-13 05:33

An excellent reference for this is Markus Kuhn's UTF-8 and Unicode FAQ.

0 讨论(0)
发布评论:

提交评论
- 加载中...
余生分开走

2020-12-13 05:33

3 bytes
http://en.wikipedia.org/wiki/UTF-8#Description

0 讨论(0)
发布评论:

提交评论
- 加载中...
醉话见心

2020-12-13 05:37
Code points up to 0x7ff is stored as 2 bytes; up to 0xffff as 3 bytes; everything else as 4 bytes. (Technically, up to 0x1fffff, but the highest codepoint allowed in Unicode is 0x10ffff.)

When decoding, the first byte of the multi-byte sequence is used to determine the number of bytes used to make the sequence:
1. 110x xxxx => 2-byte sequence
2. 1110 xxxx => 3-byte sequence
3. 1111 0xxx => 4-byte sequence
All subsequent bytes in the sequence must fit the 10xx xxxx pattern.
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2