How can I decode UTF-16 data in Perl when I don't know the byte order?

后端未结

关注

 3  1712

If I open a file ( and specify an encoding directly ) :

open(my $file,\"<:encoding(UTF-16)\",\"some.file\") || die \"error $!\\n\";
while(<$file>) {


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  夕颜        
                
              
                            
                2020-12-30 13:54
              
            
            
                                                                       
What you're trying to do impossible.

You're reading lines of text without specifying an encoding, so every byte that contains a newline character (default \x0a) ends a line.  But this newline character may very well be in the middle of an UTF-16 character, in which case your next line can't be decoded.
If your data is UTF-16LE, this will happen all the time – line feeds are \x0a \x00.  If you have UTF16-BE, you might get lucky (newlines are \x00 \x0a), until you get a character with \x0a in the high byte.

So, don't do that, open the file in the right encoding.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  迷失自我        
                
              
                            
                2020-12-30 14:05
              
            
            
                                                                       
If you simply specify "UTF-16", Perl is going to look for the byte-order mark (BOM) to figure out how to parse it. If there is no BOM, it's going to blow up. In that case, you have to tell Encode which byte-order you have by specifying either "UTF-16LE" for little-endian or "UTF-16BE" for big-endian.

There's something else going on with your situation though, but it's hard to tell without seeing the data you have in the file. I get the same error with both snippets. If I don't have a BOM and I don't specify a byte order, my Perl complains either way. Which Perl are you using and which platform do you have? Does your platform have the native endianness of your file? I think the behaviour I see is correct according to the docs.

Also, you can't simply read a line in some unknown encoding (whatever Perl's default is) then ship that off to decode. You might end up in the middle of a multi-byte sequence. You have to use Encode::FB_QUIET to save the part of the buffer that you couldn't decode and add that to the next chunk of data:

open my($lefh), '<:raw', 'text-utf16.txt';

my $string;
while( $string .= <$lefh> ) {
    print decode("UTF-16LE", $string, Encode::FB_QUIET) 
    }

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  被撕碎了的回忆        
                
              
                            
                2020-12-30 14:13
              
            
            
                                                                       
You need to specify either UTF-16BE or UTF-16LE. See http://perldoc.perl.org/Encode/Unicode.html#Size%2c-Endianness%2c-and-BOM
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复