I\'m writing some unit tests which are going to verify our handling of various resources that use other character sets apart from the normal latin alphabet: Cyrilic, Hebrew
A tedious but portable way is to build your strings using numeric escape codes. For example:
wchar_t *string = L"דונדארןמע";
becomes:
wchar_t *string = "\x05d3\x05d5\x05e0\x05d3\x05d0\x05e8\x05df\x05de\x05e2";
You have to convert all your Unicode characters to numeric escapes. That way your source code becomes encoding-independent.
You can use online tools for conversion, such as this one. It outputs the JavaScript escape format \uXXXX
, so just search & replace \u
with \x
to get the C format.
You have to tell GCC which encoding your file uses to code those characters into the file.
Use the option -finput-charset=charset
, for example -finput-charset=UTF-8
. Then you need to tell it about the encoding used for those string literals at runtime. That will determine the values of the wchar_t items in the strings. You set that encoding using -fwide-exec-charset=charset
, for example -fwide-exec-charset=UTF-32
. Beware that the size of the encoding (utf-32 needs 32bits, utf-16 needs 16bits) must not exceed the size of wchar_t
gcc uses.
You can adjust that. That option is mainly useful for compiling programs for wine
, designed to be compatible with windows. The option is called -fshort-wchar
, and will most likely then be 16bits instead of 32bits, which is its usual width for gcc on linux.
Those options are described in more detail in man gcc
, the gcc manpage.
#define UNICODE_CONSTANT( CONSTANT ) towstring( CONSTANT )
wstring towstring( LPCSTR lpszValue ) {
wostringstream os;
os << lpszValue;
return os.str();
}
This does not actually convert at all between Unicode encodings, which requires a dedicated routine. You need to keep your source code and data encodings unified- most people use UTF-8- and then convert that to the OS-specific encoding if necessary (such as UTF-16 on Winders).