at all times text encoded in UTF-8 will never give us more than a +50% file size of the same text encoded in UTF-16. true / false?

后端未结

关注

 4  1467

攒了一身酷 2021-02-06 15:41

Somewhere I read (rephrased):

If we compare a UTF-8 encoded file VS a UTF-16 encoded file, At some times, the UTF-8 file may give a 50% to 100% larger fil

4条回答

深忆病人 (楼主)

2021-02-06 16:34

In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

Though UTF-8 characters may use up to 4 bytes (and more is theoretically possible), it is not used for the Basic Multilingual Plane which includes "almost all modern languages".

Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use). Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters and various historic scripts.

So I guess a 100% overhead, though theoretically possible, is not possible with a typical modern language. You'd have to use something exotic from the Supplementary Multilingual Plane, which uses 4 bytes in UTF-8, to achieve this.

For HTML documents or mixed text it's may not be necessary to switch to UTF-16 to save space:

Characters U+0800 through U+FFFF use three bytes in UTF-8, but only two in UTF-16. As a result, text in (for example) Chinese, Japanese or Hindi could take more space in UTF-8 if there are more of these characters than there are ASCII characters. This happens for pure text, but rarely for HTML documents. For example, both the Japanese UTF-8 and the Hindi Unicode articles on Wikipedia take more space if saved as UTF-16 than the original UTF-8 version.

See the UTF-8 to UTF-16 comparison on Wikipedia.

Joel Spolsky wrote a great article about Unicode, I can really recommend it:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...