Fastest way to convert file from latin1 to utf-8 in python

前端 未结 3 2007
一生所求
一生所求 2021-02-10 04:17

I need fastest way to convert files from latin1 to utf-8 in python. The files are large ~ 2G. ( I am moving DB data ). So far I have

import codecs
infile = codec         


        
相关标签:
3条回答
  • 2021-02-10 04:37

    You could use blocks larger than one line, and do binary I/O -- each might speed thinks up a bit (though on Linux binary I/O won't, as it's identical to text I/O):

     BLOCKSIZE = 1024*1024
     with open(tmpfile, 'rb') as inf:
       with open(tmpfile, 'wb') as ouf:
         while True:
           data = inf.read(BLOCKSIZE)
           if not data: break
           converted = data.decode('latin1').encode('utf-8')
           ouf.write(converted)
    

    The byte-by-byte parsing implied in by-line reading, line-end conversion (not on Linux;-), and codecs.open-style encoding-decoding, should be part of what's slowing you down. This approach is also portable (like yours is), since control-characters such as \n need no translation among these codecs anyway (in any OS).

    This only works for input codecs that have no multibyte characters, but `latin1' is one of those (it does not matter whether the output codec has such characters or not).

    Try different block sizes to find the sweet spot performance-wise, depending on your disk, filesystem and available RAM.

    Edit: changed code per @John's comment, and clarified a conditon as per @gnibbler's.

    0 讨论(0)
  • 2021-02-10 04:47

    I would go with iconv and a system call.

    0 讨论(0)
  • 2021-02-10 04:48

    If you are desperate to do it in Python (or any other language), at least do the I/O in bigger chunks than lines, and avoid the codecs overhead.

    infile = open(tmpfile, 'rb')
    outfile = open(tmpfile1, 'wb')
    BLOCKSIZE = 65536 # experiment with size
    while True:
        block = infile.read(BLOCKSIZE)
        if not block: break
        outfile.write(block.decode('latin1').encode('utf8'))
    infile.close()
    outfile.close()
    

    Otherwise, go with iconv ... I haven't look under the hood but if it doesn't special-case latin1 input I'd be surprised :-)

    0 讨论(0)
提交回复
热议问题