How does a file with Chinese characters know how many bytes to use per character?

前端 未结 9 1608
误落风尘
误落风尘 2020-12-13 05:05

I have read Joel\'s article \"The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)\" but still don\'

相关标签:
9条回答
  • 2020-12-13 05:33

    An excellent reference for this is Markus Kuhn's UTF-8 and Unicode FAQ.

    0 讨论(0)
  • 2020-12-13 05:33

    3 bytes
    http://en.wikipedia.org/wiki/UTF-8#Description

    0 讨论(0)
  • 2020-12-13 05:37

    Code points up to 0x7ff is stored as 2 bytes; up to 0xffff as 3 bytes; everything else as 4 bytes. (Technically, up to 0x1fffff, but the highest codepoint allowed in Unicode is 0x10ffff.)

    When decoding, the first byte of the multi-byte sequence is used to determine the number of bytes used to make the sequence:

    1. 110x xxxx => 2-byte sequence
    2. 1110 xxxx => 3-byte sequence
    3. 1111 0xxx => 4-byte sequence

    All subsequent bytes in the sequence must fit the 10xx xxxx pattern.

    0 讨论(0)
提交回复
热议问题