wchar_t for UTF-16 on Linux?

╄→гoц情女王★ 提交于 2019-12-05 11:28:50

While it's possible to store UTF-16 in wchar_t, such wchar_t values (or arrays of them used as strings) are not suitable for use with any of the standard functions which take wchar_t or pointers to wchar_t strings. As such, to answer your initial question of "Does it make sense...?", I would reply with a definitive no. You could use uint16_t for this purpose of course, or the C11 char16_t if it's available, though I fail to see any reason why the latter would be preferable unless you're also going to use the C11 functions for processing it (and they don't seem to be implemented yet).

http://userguide.icu-project.org/strings says

The Unicode standard defines a default encoding based on 16-bit code units. This is supported in ICU by the definition of the UChar to be an unsigned 16-bit integer type. This is the base type for character arrays for strings in ICU.

So if you use ICU, then you can use UChar*. If not, uint16_t will make the transition easier should you ever want to interoperate with UChar.

Well, the best solution is probably to use char16_t for UTF-16, since that's the standard 16-bit character type. This has been supported since gcc 4.4, so should be present on most Linux systems you'll see.

No, it makes sense to decode the UTF-16 and store it in an array of wchar_t. Not all Unicode code points have exactly one 16-bit word in UTF-16, but they all fit in a wchar_t.

In any case, UTF-16 is a worse compromise than anything else, and should never be used. Either use UTF-8 (which is more efficient in most cases, and more commonly used), or use wchar_t[].

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!