If UTF-8 is an 8-bit encoding, why does it need 1-4 bytes?

前端未结

关注

 3  767

北海茫月 2020-12-17 19:24

On the Unicode site it\'s written that UTF-8 can be represented by 1-4 bytes. As I understand from this question https://softwareengineering.stackexchange.com/questions/7775

3条回答

醉梦人生 (楼主)

2020-12-17 19:42
The '8-bit' encoding means that the individual bytes of the encoding use 8 bits. In contrast, pure ASCII is a 7-bit encoding as it only has code points 0-127. It used to be that software had problems with 8-bit encodings; one of the reasons for Base-64 and uuencode encodings was to get binary data through email systems that did not handle 8-bit encodings. However, it's been a decade or more since that ceased to be allowable as a problem - software has had to be 8-bit clean, or capable of handling 8-bit encodings.

Unicode itself is a 21-bit character set. There are a number of encodings for it:
- UTF-32 where each Unicode code point is stored in a 32-bit integer
- UTF-16 where many Unicode code points are stored in a single 16-bit integer, but some need two 16-bit integers (so it needs 2 or 4 bytes per Unicode code point).
- UTF-8 where Unicode code points can require 1, 2, 3 or 4 bytes to store a single Unicode code point.
So, "UTF-8 can be represented by 1-4 bytes" is probably not the most appropriate way of phrasing it. "Unicode code points can be represented by 1-4 bytes in UTF-8" would be more appropriate.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...