Using iconv with WCHAR_T on Linux

扶醉桌前 提交于 2020-06-01 06:20:06

问题


I have the following code on Linux:-

rc = iconv_open("WCHAR_T", SourceCode);

prior to using iconv to convert the data into a wide character string (wchar_t).

I am trying to understand what it achieves in order to port it to a platform where the option on parameter 1, "WCHAR_T", does not exist.

This leads to sub-questions such as:

  • Is there a single representation of wchar_t on Linux?
  • What codepage does this use? I imagine maybe UTF-32
  • Does it rely on any locale settings to achieve this?

I'm hoping for an answer that says something like: "The code you show is shorthand for doing the following 2 things instead...." and then I might be able to do those two steps instead of the shorthand on the platform where "WCHAR_T" option on iconv_open doesn't exist.


回答1:


The reason the (non-standard) WCHAR_T encoding exists is to make it easy to cast a pointer to wchar_t into a pointer to char and use it with iconv. The format understood by that encoding is whatever the system's native wchar_t is.

If you're asking about glibc and not other libc implementations, then on Linux wchar_t is a 32-bit type in the system's native endianness, and represents Unicode codepoints. This is not the same as UTF-32, since UTF-32 normally has a byte-order mark (BOM) and when it does not, is big endian. WCHAR_T is always native endian.

Note that some systems use different semantics for wchar_t. Windows always uses a 16-bit type using a little-endian UTF-16. If you used the GNU libiconv on that platform, the WCHAR_T encoding would be different than if you ran it on Linux.

Locale settings do not affect wchar_t because the size of wchar_t must be known at compile time, and therefore cannot practically vary based on locale.

If this piece of code is indeed casting a pointer to wchar_t and using that in its call to iconv, then you need to adjust the code to use one of the encodings UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE, depending on sizeof(wchar_t) and the platform's endianness. Those encodings do not require (nor allow) a BOM, and assuming you're not using a PDP-11, one of them will be correct for your platform.

If you're getting the data from some other source, then you need to figure out what that is, and use the appropriate encoding from the list above for it. You should also probably send a patch upstream and ask the maintainer to use a different, more correct encoding for handling their data format.



来源:https://stackoverflow.com/questions/62032729/using-iconv-with-wchar-t-on-linux

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!