Python - read text file with weird utf-16 format

前端未结

关注

 4  833

I\'m trying to read a text file into python, but it seems to use some very strange encoding. I try the usual:

file = open(\'data.txt\',\'r\')

lines = file.


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  无人共我        
                
              
                            
                2021-01-17 17:22
              
            
            
                                                                       
Looks like UTF-16 to me.

>>> test_utf16 = '0\x00.\x000\x002\x000\x000\x001\x009\x007\x00'
>>> test_utf16.decode('utf-16')
u'0.0200197'


You can work directly off the Unicode strings:

>>> float(test_utf16)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: null byte in argument for float()
>>> float(test_utf16.decode('utf-16'))
0.020019700000000001


Or encode them to something different, if you prefer:

>>> float(test_utf16.decode('utf-16').encode('ascii'))
0.020019700000000001


Note that you need to do this as early as possible in your processing. As your comment noted, split will behave incorrectly on the utf-16 encoded form. The utf-16 representation of the space character ' ' is ' \x00', so split removes the whitespace but leaves the null byte.

The 2.6 and later io library can handle this for you, as can the older codecs library. io handles linefeeds better, so it's preferable if available.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  刺人心        
                
              
                            
                2021-01-17 17:33
              
            
            
                                                                       
I'm willing to bet this is a UTF-16-LE file, and you're reading it as whatever your default encoding is.

In UTF-16, each character takes two bytes.* If your characters are all ASCII, this means the UTF-16 encoding looks like the ASCII encoding with an extra '\x00' after each character.

To fix this, just decode the data:

print line.decode('utf-16-le').split()


Or do the same thing at the file level with the io or codecs module:

file = io.open('data.txt','r', encoding='utf-16-le')




* This is a bit of an oversimplification: Each BMP character takes two bytes; each non-BMP character is turned into a surrogate pair, with each of the two surrogates taking two bytes. But you probably didn't care about these details.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  长发绾君心        
                
              
                            
                2021-01-17 17:41
              
            
            
                                                                       
This piece of code will do the necessary

file_handle=open(file_name,'rb')
file_first_line=file_handle.readline()
file_handle.close()
print file_first_line
if '\x00' in file_first_line:
    file_first_line=file_first_line.replace('\x00','')
    print file_first_line


When you try to use 'file_first_line.split()' before replacing, the output would contain '\x00' i just tried replacing '\x00' with empty and it worked.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  予麋鹿        
                
              
                            
                2021-01-17 17:45
              
            
            
                                                                       
This is really just @abarnert's suggestion, but I wanted to post it as an answer since this is the simplest solution and the one that I ended up using: 

    file = io.open(filename,'r',encoding='utf-16-le')
    data = np.loadtxt(file,skiprows=8)


This demonstrates how you can create a file object using io.open using whatever crazy encoding your file happens to have, and then pass that file object to np.loadtxt (or np.genfromtxt) for quick-and-easy loading. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复