How can I embed unicode string constants in a source file?

前端 未结 3 1541
一向
一向 2020-12-09 05:33

I\'m writing some unit tests which are going to verify our handling of various resources that use other character sets apart from the normal latin alphabet: Cyrilic, Hebrew

相关标签:
3条回答
  • 2020-12-09 06:14

    A tedious but portable way is to build your strings using numeric escape codes. For example:

    wchar_t *string = L"דונדארןמע";
    

    becomes:

    wchar_t *string = "\x05d3\x05d5\x05e0\x05d3\x05d0\x05e8\x05df\x05de\x05e2";
    

    You have to convert all your Unicode characters to numeric escapes. That way your source code becomes encoding-independent.

    You can use online tools for conversion, such as this one. It outputs the JavaScript escape format \uXXXX, so just search & replace \u with \x to get the C format.

    0 讨论(0)
  • 2020-12-09 06:22

    You have to tell GCC which encoding your file uses to code those characters into the file.

    Use the option -finput-charset=charset, for example -finput-charset=UTF-8. Then you need to tell it about the encoding used for those string literals at runtime. That will determine the values of the wchar_t items in the strings. You set that encoding using -fwide-exec-charset=charset, for example -fwide-exec-charset=UTF-32. Beware that the size of the encoding (utf-32 needs 32bits, utf-16 needs 16bits) must not exceed the size of wchar_t gcc uses.

    You can adjust that. That option is mainly useful for compiling programs for wine, designed to be compatible with windows. The option is called -fshort-wchar, and will most likely then be 16bits instead of 32bits, which is its usual width for gcc on linux.

    Those options are described in more detail in man gcc, the gcc manpage.

    0 讨论(0)
  • 2020-12-09 06:32
    #define UNICODE_CONSTANT( CONSTANT ) towstring( CONSTANT )
    
    wstring towstring( LPCSTR lpszValue ) {
        wostringstream os;
        os << lpszValue;
        return os.str(); 
    }
    

    This does not actually convert at all between Unicode encodings, which requires a dedicated routine. You need to keep your source code and data encodings unified- most people use UTF-8- and then convert that to the OS-specific encoding if necessary (such as UTF-16 on Winders).

    0 讨论(0)
提交回复
热议问题