Why does wide file-stream in C++ narrow written data by default?

前端未结

关注

 5  1269

Honestly, I just don\'t get the following design decision in C++ Standard library. When writing wide characters to a file, the wofstream converts wchar_t<


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  长情又很酷        
                
              
                            
                2020-11-30 06:30
              
            
            
                                                                       
Check this out:
Class basic_filebuf

You can alter the default behavior by setting a wide char buffer, using pubsetbuf.
Once you did that, the output will be wchar_t and not char.

In other words for your example you will have:

wofstream file(L"Test.txt", ios_base::binary); //binary is important to set!  
wchar_t buffer[128];  
file.rdbuf()->pubsetbuf(buffer, 128);  
file.put(0xFEFF); //this is the BOM flag, UTF16 needs this, but mirosoft's UNICODE doesn't, so you can skip this line, if any.  
file << someString; // the output file will consist of unicode characters! without the call to pubsetbuf, the out file will be ANSI (current regional settings)  

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  忘了有多久        
                
              
                            
                2020-11-30 06:40
              
            
            
                                                                       
A very partial answer for the first question: A file is a sequence of bytes so, when dealing with wchar_t's, at least some conversion between wchar_t and char must occur. Making this conversion "intelligently" requires knowledge of the character encodings, so this is why this conversion is allowed to be locale-dependent, by virtue of using a facet in the stream's locale.

Then, the question is how that conversion should be made in the only locale required by the standard: the "classic" one. There is no "right" answer for that, and the standard is thus very vague about it. I understand from your question that you assume that blindly casting (or memcpy()-ing) between wchar_t[] and char[] would have been a good way. This is not unreasonable, and is in fact what is (or at least was) done in some implementations.

Another POV would be that, since a codecvt is a locale facet, it is reasonable to expect that the conversion is made using the "locale's encoding" (I'm handwavy here, as the concept is pretty fuzzy). For example, one would expect a Turkish locale to use ISO-8859-9, or a Japanese on to use Shift JIS. By similarity, the "classic" locale would convert to this "locale's encoding". Apparently, Microsoft chose to simply trim (which leads to IS-8859-1 if we assuming that wchar_t represents UTF-16 and that we stay in the basic multilingual plane), while the Linux implementation I know about decided stick to ASCII.

For your second question:


  Also, are we gonna get real unicode streams with C++0x or am I missing something here?


In the [locale.codecvt] section of n2857 (the latest C++0x draft I have at hand), one can read:


  The specialization codecvt<char16_t, char, mbstate_t> converts between the UTF-16 and UTF-8 encodings schemes, and the specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and UTF-8 encodings schemes. codecvt<wchar_t,char,mbstate_t> converts between the native character sets for narrow and wide characters.


In the [locale.stdcvt] section, we find:


  For the facet codecvt_utf8:
  — The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program.
  [...]
  
  For the facet codecvt_utf16:
  — The facet shall convert between UTF-16 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program.
  [...]
  
  For the facet codecvt_utf8_utf16:
  — The facet shall convert between UTF-8 multibyte sequences and UTF-16 (one or two 16-bit codes) within the program.


So I guess that this means "yes", but you'd have to be more precise about what you mean by "real unicode streams" to be sure.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  醉酒成梦        
                
              
                            
                2020-11-30 06:45
              
            
            
                                                                       
I don't know about wofstream. But C++0x will include new distict character types (char16_t, char32_t) of guaranteed width and signedness (unsigned) which can be portably used for UTF-8, UTF-16 and UTF-32. In addition, there will be new string literals (u"Hello!" for an UTF-16 coded string literal, for example)

Check out the most recent C++0x draft (N2960).
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  名媛妹妹        
                
              
                            
                2020-11-30 06:46
              
            
            
                                                                       
The model used by C++ for charsets is inherited from C, and so dates back to at least 1989.

Two main points:


IO is done in term of char.
it is the job of the locale to determine how wide chars are serialized
the default locale (named "C") is very minimal (I don't remember the constraints from the standard, here it is able to handle only 7-bit ASCII as narrow and wide character set).
there is an environment determined locale named ""


So to get anything, you have to set the locale.

If I use the simple program

#include <locale>
#include <fstream>
#include <ostream>
#include <iostream>

int main()
{
    wchar_t c = 0x00FF;
    std::locale::global(std::locale(""));
    std::wofstream os("test.dat");
    os << c << std::endl;
    if (!os) {
        std::cout << "Output failed\n";
    }
}


which use the environment locale and output the wide character of code 0x00FF to a file.  If I ask to use the "C" locale, I get

$ env LC_ALL=C ./a.out
Output failed


the locale has been unable to handle the wide character and we get notified of the problem as the IO failed.  If I run ask an UTF-8 locale, I get

$ env LC_ALL=en_US.utf8 ./a.out
$ od -t x1 test.dat
0000000 c3 bf 0a
0000003


(od -t x1 just dump the file represented in hex), exactly what I expect for an UTF-8 encoded file.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  一向        
                
              
                            
                2020-11-30 06:46
              
            
            
                                                                       
For your first question, this is my guess. 

The IOStreams library was constructed under a couple of premises regarding encodings. For converting between Unicode and other not so usual encodings, for example, it's assumed that.


Inside your program, you should use a (fixed-width) wide-character encoding.
Only external storage should use (variable-width) multibyte encodings.


I believe that is the reason for the existence of the two template specializations of std::codecvt. One that maps between char types (maybe you're simply working with ASCII) and another that maps between wchar_t (internal to your program) and char (external devices). So whenever you need to perform a conversion to a multibyte encoding you should do it byte-by-byte. Notice that you can write a facet that handles encoding state when you read/write each byte from/to the multibyte encoding.

Thinking this way the behavior of the C++ standard is understandable. After all, you're using wide-character ASCII encoded (assuming this is the default on your platform and you did not switch locales) strings. The "natural" conversion would be to convert each wide-character ASCII character to a ordinary (in this case, one char) ASCII character. (The conversion exists and is straightforward.)

By the way, I'm not sure if you know, but you can avoid this by creating a facet that returns noconv for the conversions. Then, you would have your file with wide-characters.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复