Why does wide file-stream in C++ narrow written data by default?

前端 未结 5 1269
别跟我提以往
别跟我提以往 2020-11-30 05:44

Honestly, I just don\'t get the following design decision in C++ Standard library. When writing wide characters to a file, the wofstream converts wchar_t<

相关标签:
5条回答
  • 2020-11-30 06:30

    Check this out: Class basic_filebuf

    You can alter the default behavior by setting a wide char buffer, using pubsetbuf. Once you did that, the output will be wchar_t and not char.

    In other words for your example you will have:

    wofstream file(L"Test.txt", ios_base::binary); //binary is important to set!  
    wchar_t buffer[128];  
    file.rdbuf()->pubsetbuf(buffer, 128);  
    file.put(0xFEFF); //this is the BOM flag, UTF16 needs this, but mirosoft's UNICODE doesn't, so you can skip this line, if any.  
    file << someString; // the output file will consist of unicode characters! without the call to pubsetbuf, the out file will be ANSI (current regional settings)  
    
    0 讨论(0)
  • 2020-11-30 06:40

    A very partial answer for the first question: A file is a sequence of bytes so, when dealing with wchar_t's, at least some conversion between wchar_t and char must occur. Making this conversion "intelligently" requires knowledge of the character encodings, so this is why this conversion is allowed to be locale-dependent, by virtue of using a facet in the stream's locale.

    Then, the question is how that conversion should be made in the only locale required by the standard: the "classic" one. There is no "right" answer for that, and the standard is thus very vague about it. I understand from your question that you assume that blindly casting (or memcpy()-ing) between wchar_t[] and char[] would have been a good way. This is not unreasonable, and is in fact what is (or at least was) done in some implementations.

    Another POV would be that, since a codecvt is a locale facet, it is reasonable to expect that the conversion is made using the "locale's encoding" (I'm handwavy here, as the concept is pretty fuzzy). For example, one would expect a Turkish locale to use ISO-8859-9, or a Japanese on to use Shift JIS. By similarity, the "classic" locale would convert to this "locale's encoding". Apparently, Microsoft chose to simply trim (which leads to IS-8859-1 if we assuming that wchar_t represents UTF-16 and that we stay in the basic multilingual plane), while the Linux implementation I know about decided stick to ASCII.

    For your second question:

    Also, are we gonna get real unicode streams with C++0x or am I missing something here?

    In the [locale.codecvt] section of n2857 (the latest C++0x draft I have at hand), one can read:

    The specialization codecvt<char16_t, char, mbstate_t> converts between the UTF-16 and UTF-8 encodings schemes, and the specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and UTF-8 encodings schemes. codecvt<wchar_t,char,mbstate_t> converts between the native character sets for narrow and wide characters.

    In the [locale.stdcvt] section, we find:

    For the facet codecvt_utf8: — The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program. [...]

    For the facet codecvt_utf16: — The facet shall convert between UTF-16 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program. [...]

    For the facet codecvt_utf8_utf16: — The facet shall convert between UTF-8 multibyte sequences and UTF-16 (one or two 16-bit codes) within the program.

    So I guess that this means "yes", but you'd have to be more precise about what you mean by "real unicode streams" to be sure.

    0 讨论(0)
  • 2020-11-30 06:45

    I don't know about wofstream. But C++0x will include new distict character types (char16_t, char32_t) of guaranteed width and signedness (unsigned) which can be portably used for UTF-8, UTF-16 and UTF-32. In addition, there will be new string literals (u"Hello!" for an UTF-16 coded string literal, for example)

    Check out the most recent C++0x draft (N2960).

    0 讨论(0)
  • 2020-11-30 06:46

    The model used by C++ for charsets is inherited from C, and so dates back to at least 1989.

    Two main points:

    • IO is done in term of char.
    • it is the job of the locale to determine how wide chars are serialized
    • the default locale (named "C") is very minimal (I don't remember the constraints from the standard, here it is able to handle only 7-bit ASCII as narrow and wide character set).
    • there is an environment determined locale named ""

    So to get anything, you have to set the locale.

    If I use the simple program

    #include <locale>
    #include <fstream>
    #include <ostream>
    #include <iostream>
    
    int main()
    {
        wchar_t c = 0x00FF;
        std::locale::global(std::locale(""));
        std::wofstream os("test.dat");
        os << c << std::endl;
        if (!os) {
            std::cout << "Output failed\n";
        }
    }
    

    which use the environment locale and output the wide character of code 0x00FF to a file. If I ask to use the "C" locale, I get

    $ env LC_ALL=C ./a.out
    Output failed
    

    the locale has been unable to handle the wide character and we get notified of the problem as the IO failed. If I run ask an UTF-8 locale, I get

    $ env LC_ALL=en_US.utf8 ./a.out
    $ od -t x1 test.dat
    0000000 c3 bf 0a
    0000003
    

    (od -t x1 just dump the file represented in hex), exactly what I expect for an UTF-8 encoded file.

    0 讨论(0)
  • 2020-11-30 06:46

    For your first question, this is my guess.

    The IOStreams library was constructed under a couple of premises regarding encodings. For converting between Unicode and other not so usual encodings, for example, it's assumed that.

    • Inside your program, you should use a (fixed-width) wide-character encoding.
    • Only external storage should use (variable-width) multibyte encodings.

    I believe that is the reason for the existence of the two template specializations of std::codecvt. One that maps between char types (maybe you're simply working with ASCII) and another that maps between wchar_t (internal to your program) and char (external devices). So whenever you need to perform a conversion to a multibyte encoding you should do it byte-by-byte. Notice that you can write a facet that handles encoding state when you read/write each byte from/to the multibyte encoding.

    Thinking this way the behavior of the C++ standard is understandable. After all, you're using wide-character ASCII encoded (assuming this is the default on your platform and you did not switch locales) strings. The "natural" conversion would be to convert each wide-character ASCII character to a ordinary (in this case, one char) ASCII character. (The conversion exists and is straightforward.)

    By the way, I'm not sure if you know, but you can avoid this by creating a facet that returns noconv for the conversions. Then, you would have your file with wide-characters.

    0 讨论(0)
提交回复
热议问题