If UTF-8 is an 8-bit encoding, why does it need 1-4 bytes?

前端 未结 3 767
北海茫月
北海茫月 2020-12-17 19:24

On the Unicode site it\'s written that UTF-8 can be represented by 1-4 bytes. As I understand from this question https://softwareengineering.stackexchange.com/questions/7775

3条回答
  •  醉梦人生
    2020-12-17 19:42

    The '8-bit' encoding means that the individual bytes of the encoding use 8 bits. In contrast, pure ASCII is a 7-bit encoding as it only has code points 0-127. It used to be that software had problems with 8-bit encodings; one of the reasons for Base-64 and uuencode encodings was to get binary data through email systems that did not handle 8-bit encodings. However, it's been a decade or more since that ceased to be allowable as a problem - software has had to be 8-bit clean, or capable of handling 8-bit encodings.

    Unicode itself is a 21-bit character set. There are a number of encodings for it:

    • UTF-32 where each Unicode code point is stored in a 32-bit integer
    • UTF-16 where many Unicode code points are stored in a single 16-bit integer, but some need two 16-bit integers (so it needs 2 or 4 bytes per Unicode code point).
    • UTF-8 where Unicode code points can require 1, 2, 3 or 4 bytes to store a single Unicode code point.

    So, "UTF-8 can be represented by 1-4 bytes" is probably not the most appropriate way of phrasing it. "Unicode code points can be represented by 1-4 bytes in UTF-8" would be more appropriate.

提交回复
热议问题