WChars, Encodings, Standards and Portability

后端 未结 4 1617
遇见更好的自我
遇见更好的自我 2020-11-22 09:11

The following may not qualify as a SO question; if it is out of bounds, please feel free to tell me to go away. The question here is basically, \"Do I understand the C stand

相关标签:
4条回答
  • 2020-11-22 09:42

    The problem with wchar_t is that encoding-agnostic text processing is too difficult and should be avoided. If you stick with "pure C" as you say, you can use all of the w* functions like wcscat and friends, but if you want to do anything more sophisticated then you have to dive into the abyss.

    Here are some things that much harder with wchar_t than they are if you just pick one of the UTF encodings:

    • Parsing Javascript: Identifers can contain certain characters outside the BMP (and lets assume that you care about this kind of correctness).

    • HTML: How do you turn 𐀀 into a string of wchar_t?

    • Text editor: How do you find grapheme cluster boundaries in a wchar_t string?

    If I know the encoding of a string, I can examine the characters directly. If I don't know the encoding, I have to hope that whatever I want to do with a string is implemented by a library function somewhere. So the portability of wchar_t is somewhat irrelevant as I don't consider it an especially useful data type.

    Your program requirements may differ and wchar_t may work fine for you.

    0 讨论(0)
  • 2020-11-22 09:47

    Is this the right way to write an idiomatic, portable, universal, encoding-agnostic program core using only pure standard C/C++

    No, and there is no way at all to fulfill all these properties, at least if you want your program to run on Windows. On Windows, you have to ignore the C and C++ standards almost everywhere and work exclusively with wchar_t (not necessarily internally, but at all interfaces to the system). For example, if you start with

    int main(int argc, char** argv)
    

    you have already lost Unicode support for command line arguments. You have to write

    int wmain(int argc, wchar_t** argv)
    

    instead, or use the GetCommandLineW function, none of which is specified in the C standard.

    More specifically,

    • any Unicode-capable program on Windows must actively ignore the C and C++ standard for things like command line arguments, file and console I/O, or file and directory manipulation. This is certainly not idiomatic. Use the Microsoft extensions or wrappers like Boost.Filesystem or Qt instead.
    • Portability is extremely hard to achieve, especially for Unicode support. You really have to be prepared that everything you think you know is possibly wrong. For example, you have to consider that the filenames you use to open files can be different from the filenames that are actually used, and that two seemingly different filenames may represent the same file. After you create two files a and b, you might end up with a single file c, or two files d and e, whose filenames are different from the file names you passed to the OS. Either you need an external wrapper library or lots of #ifdefs.
    • Encoding agnosticity usually just doesn't work in practice, especially if you want to be portable. You have to know that wchar_t is a UTF-16 code unit on Windows and that char is often (bot not always) a UTF-8 code unit on Linux. Encoding-awareness is often the more desirable goal: make sure that you always know with which encoding you work, or use a wrapper library that abstracts them away.

    I think I have to conclude that it's completely impossible to build a portable Unicode-capable application in C or C++ unless you are willing to use additional libraries and system-specific extensions, and to put lots of effort in it. Unfortunately, most applications already fail at comparatively simple tasks such as "writing Greek characters to the console" or "supporting any filename allowed by the system in a correct manner", and such tasks are only the first tiny steps towards true Unicode support.

    0 讨论(0)
  • 2020-11-22 09:58

    I would avoid the wchar_t type because it's platform-dependent (not "serializable" by your definition): UTF-16 on Windows and UTF-32 on most Unix-like systems. Instead, use the char16_t and/or char32_t types from C++0x/C1x. (If you don't have a new compiler, typedef them as uint16_t and uint32_t for now.)

    DO define functions to convert between UTF-8, UTF-16, and UTF-32 functions.

    DON'T write overloaded narrow/wide versions of every string function like the Windows API did with -A and -W. Pick one preferred encoding to use internally, and stick to it. For things that need a different encoding, convert as necessary.

    0 讨论(0)
  • 2020-11-22 10:05

    Given that iconv is not "pure standard C/C++", I don't think you are satisfying your own specifications.

    There are new codecvt facets coming with char32_t and char16_t so I don't see how you can be wrong as long as you are consistent and pick one char type + encoding if the facets are here.

    The facets are described in 22.5 [locale.stdcvt] (from n3242).


    I don't understand how this doesn't satisfy at least some of your requirements:

    namespace ns {
    
    typedef char32_t char_t;
    using std::u32string;
    
    // or use user-defined literal
    #define LIT u32
    
    // Communicate with interface0, which wants utf-8
    
    // This type doesn't need to be public at all; I just refactored it.
    typedef std::wstring_convert<std::codecvt_utf8<char_T>, char_T> converter0;
    
    inline std::string
    to_interface0(string const& s)
    {
        return converter0().to_bytes(s);
    }
    
    inline string
    from_interface0(std::string const& s)
    {
        return converter0().from_bytes(s);
    }
    
    // Communitate with interface1, which wants utf-16
    
    // Doesn't have to be public either
    typedef std::wstring_convert<std::codecvt_utf16<char_T>, char_T> converter1;
    
    inline std::wstring
    to_interface0(string const& s)
    {
        return converter1().to_bytes(s);
    }
    
    inline string
    from_interface0(std::wstring const& s)
    {
        return converter1().from_bytes(s);
    }
    
    } // ns
    

    Then your code can use ns::string, ns::char_t, LIT'A' & LIT"Hello, World!" with reckless abandon, without knowing what's the underlying representation. Then use from_interfaceX(some_string) whenever it's needed. It doesn't affect the global locale or streams either. The helpers can be as clever as needed, e.g. codecvt_utf8 can deal with 'headers', which I assume is Standardese from tricky stuff like the BOM (ditto codecvt_utf16).

    In fact I wrote the above to be as short as possible but you'd really want helpers like this:

    template<typename... T>
    inline ns::string
    ns::from_interface0(T&&... t)
    {
        return converter0().from_bytes(std::forward<T>(t)...);
    }
    

    which give you access to the 3 overloads for each [from|to]_bytes members, accepting things like e.g. const char* or ranges.

    0 讨论(0)
提交回复
热议问题