Fastest way to convert file from latin1 to utf-8 in python

前端未结

关注

 3  2014

I need fastest way to convert files from latin1 to utf-8 in python. The files are large ~ 2G. ( I am moving DB data ). So far I have

import codecs
infile = codec


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  眼角桃花        
                
              
                            
                2021-02-10 04:37
              
            
            
                                                                       
You could use blocks larger than one line, and do binary I/O -- each might speed thinks up a bit (though on Linux binary I/O won't, as it's identical to text I/O):

 BLOCKSIZE = 1024*1024
 with open(tmpfile, 'rb') as inf:
   with open(tmpfile, 'wb') as ouf:
     while True:
       data = inf.read(BLOCKSIZE)
       if not data: break
       converted = data.decode('latin1').encode('utf-8')
       ouf.write(converted)


The byte-by-byte parsing implied in by-line reading, line-end conversion (not on Linux;-), and codecs.open-style encoding-decoding, should be part of what's slowing you down.  This approach is also portable (like yours is), since control-characters such as \n need no translation among these codecs anyway (in any OS).

This only works for input codecs that have no multibyte characters, but `latin1' is one of those (it does not matter whether the output codec has such characters or not).

Try different block sizes to find the sweet spot performance-wise, depending on your disk, filesystem and available RAM.

Edit: changed code per @John's comment, and clarified a conditon as per @gnibbler's.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  予麋鹿        
                
              
                            
                2021-02-10 04:47
              
            
            
                                                                       
I would go with iconv and a system call.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  走了就别回头了        
                
              
                            
                2021-02-10 04:48
              
            
            
                                                                       
If you are desperate to do it in Python (or any other language), at least do the I/O in bigger chunks than lines, and avoid the codecs overhead.

infile = open(tmpfile, 'rb')
outfile = open(tmpfile1, 'wb')
BLOCKSIZE = 65536 # experiment with size
while True:
    block = infile.read(BLOCKSIZE)
    if not block: break
    outfile.write(block.decode('latin1').encode('utf8'))
infile.close()
outfile.close()


Otherwise, go with iconv ... I haven't look under the hood but if it doesn't special-case latin1 input I'd be surprised :-)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复