Are UTF16 (as used by for example wide-winapi functions) characters always 2 byte long?

前端未结

关注

 8  1216

半阙折子戏 2021-02-09 06:23

Please clarify for me, how does UTF16 work? I am a little confused, considering these points:

There is a static type in C++, WCHAR, ~~which is 2 bytes long. (alway~~

8条回答

既然无缘 (楼主)

2021-02-09 06:46

Short story: UTF-16 is a variable-length encoding. A single character may be one or two widechars long.

HOWEVER, you may very well get away with treating it as a fixed-length encoding where every character is one widechar (2 bytes). This is formally called UCS-2, and it used to be Win32's assumption until Windows NT 4. The UCS-2 charset includes practically all living, dead and constructed human languages. And truth be told, working with variable-length encoding strings just sucks. Iteration becomes O(n) operation, string length is not the same as string size, etc. Any sensible parsing becomes a pain.

As for the UTF-16 chars that are not in UCS-2... I only know two subsets that may theoretically come up in real life. First is emoji - the graphical smileys that are popular in the Japanese cell phone culture. On iPhone, there's a bunch of third-party apps that enable input of those. Except on mobile phones, they don't display properly. The other character class is VERY obscure Chinese characters. The ones even most Chinese don't know. All the popular Chinese characters are well inside UCS-2.

0 讨论(0)

查看其它8个回答

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复