Are 6 octet UTF-8 sequences valid?

后端 未结 3 707
不思量自难忘°
不思量自难忘° 2021-01-05 00:46

Can UTF-8 encode 5 or 6 byte sequences, allowing all Unicode characters to be encoded? I\'m getting conflicting standards. I need to be able to support every Unico

相关标签:
3条回答
  • 2021-01-05 00:59

    They are no Unicode characters beyond 10FFFF, the BMP covers 0000 through FFFF.

    UTF-8 is well-defined for 0-10FFFF.

    0 讨论(0)
  • 2021-01-05 01:06

    Both UTF-8 and UTF-16 allow all Unicode characters to be encoded. What UTF-8 is not allowed to do is to encode upper and lower surrogate halves (which UTF-16 uses) or values above U+10FFFF, which aren't legal Unicode.

    Note that the BMP ends at U+FFFF.

    0 讨论(0)
  • 2021-01-05 01:16

    I would have to say no: Unicode code points are valid for the range [0, 0x10FFFF], and those map to 1-4 octets. So, if you did come across a 5- or 6-octet UTF-8 encoded code point, it's not a valid code point - there's certainly nothing assigned there. I am a little baffled as to why they're there in the ISO standard - I couldn't find an explanation.

    It does make you wonder, however, if perhaps someday in the future, they would expand past U+10FFFF. 0x10FFFF allows for over a million characters, but there are a lot characters out there, and it would depend how much eventually gets encoded. (For sanity's sake, let's hope not, a million characters is a lot!) UTF-32 could handle more code points, and as you've discovered, UTF-8 could. It'd really be UTF-16 that's out of luck - more surrogate pairs would be needed somewhere in the spectrum of code points.

    0 讨论(0)
提交回复
热议问题