Is it possible to use a Unicode “argv”?

后端 未结 6 1005
南旧
南旧 2021-02-07 09:48

I\'m writing a little wrapper for an application that uses files as arguments.

The wrapper needs to be in Unicode, so I\'m using wchar_t for the characters and strings I

相关标签:
6条回答
  • 2021-02-07 10:07

    In general, no. It will depend on the O/S, but the C standard says that the arguments to 'main()' must be 'main(int argc, char **argv)' or equivalent, so unless char and wchar_t are the same basic type, you can't do it.

    Having said that, you could get UTF-8 argument strings into the program, convert them to UTF-16 or UTF-32, and then get on with life.

    On a Mac (10.5.8, Leopard), I got:

    Osiris JL: echo "ï€" | odx
    0x0000: C3 AF E2 82 AC 0A                                 ......
    0x0006:
    Osiris JL: 
    

    That's all UTF-8 encoded. (odx is a hex dump program).

    See also: Why is it that UTF-8 encoding is used when interacting with a UNIX/Linux environment

    0 讨论(0)
  • 2021-02-07 10:09

    On Windows anyway, you can have a wmain() for UNICODE builds. Not portable though. I dunno if GCC or Unix/Linux platforms provide anything similar.

    0 讨论(0)
  • 2021-02-07 10:12

    Portable code doesn't support it. Windows (for example) supports using wmain instead of main, in which case argv is passed as wide characters.

    0 讨论(0)
  • 2021-02-07 10:13

    On Windows, you can use GetCommandLineW() and CommandLineToArgvW() to produce an argv-style wchar_t[] array, even if the app is not compiled for Unicode.

    0 讨论(0)
  • 2021-02-07 10:14

    On Windows, you can use tchar.h and _tmain, which will be turned into wmain if the _UNICODE symbol is defined at compile time, or main otherwise. TCHAR *argv[] will similarly be expanded to WCHAR * argv[] if unicode is defined, and char * argv[] if not.

    If you want to have your main method work cross platform, you can define your own macros to the same effect.

    TCHAR.h contains a number of convenience macros for conversion between wchar and char.

    0 讨论(0)
  • 2021-02-07 10:31

    Assuming that your Linux environment uses UTF-8 encoding then the following code will prepare your program for easy Unicode treatment in C++:

        int main(int argc, char * argv[]) {
          std::setlocale(LC_CTYPE, "");
          // ...
        }
    

    Next, wchar_t type is 32-bit in Linux, which means it can hold individual Unicode code points and you can safely use wstring type for classical string processing in C++ (character by character). With setlocale call above, inserting into wcout will automatically translate your output into UTF-8 and extracting from wcin will automatically translate UTF-8 input into UTF-32 (1 character = 1 code point). The only problem that remains is that argv[i] strings are still UTF-8 encoded.

    You can use the following function to decode UTF-8 into UTF-32. If the input string is corrupted it will return properly converted characters until the place where the UTF-8 rules were broken. You could improve it if you need more error reporting. But for argv data one can safely assume that it is correct UTF-8:

    #define ARR_LEN(x) (sizeof(x)/sizeof(x[0]))
    
        wstring Convert(const char * s) {
            typedef unsigned char byte;
            struct Level { 
                byte Head, Data, Null; 
                Level(byte h, byte d) {
                    Head = h; // the head shifted to the right
                    Data = d; // number of data bits
                    Null = h << d; // encoded byte with zero data bits
                }
                bool encoded(byte b) { return b>>Data == Head; }
            }; // struct Level
            Level lev[] = { 
                Level(2, 6),
                Level(6, 5), 
                Level(14, 4), 
                Level(30, 3), 
                Level(62, 2), 
                Level(126, 1)
            };
    
            wchar_t wc = 0;
            const char * p = s;
            wstring result;
            while (*p != 0) {
                byte b = *p++;
                if (b>>7 == 0) { // deal with ASCII
                    wc = b;
                    result.push_back(wc);
                    continue;
                } // ASCII
                bool found = false;
                for (int i = 1; i < ARR_LEN(lev); ++i) {
                    if (lev[i].encoded(b)) {
                        wc = b ^ lev[i].Null; // remove the head
                        wc <<= lev[0].Data * i;
                        for (int j = i; j > 0; --j) { // trailing bytes
                            if (*p == 0) return result; // unexpected
                            b = *p++;   
                            if (!lev[0].encoded(b)) // encoding corrupted
                                return result;
                            wchar_t tmp = b ^ lev[0].Null;
                            wc |= tmp << lev[0].Data*(j-1);
                        } // trailing bytes
                        result.push_back(wc);
                        found = true;
                        break;
                    } // lev[i]
                }   // for lev
                if (!found) return result; // encoding incorrect
            }   // while
            return result;
        }   // wstring Convert
    
    0 讨论(0)
提交回复
热议问题