I need fastest way to convert files from latin1 to utf-8 in python. The files are large ~ 2G. ( I am moving DB data ). So far I have
import codecs
infile = codec
You could use blocks larger than one line, and do binary I/O -- each might speed thinks up a bit (though on Linux binary I/O won't, as it's identical to text I/O):
BLOCKSIZE = 1024*1024
with open(tmpfile, 'rb') as inf:
with open(tmpfile, 'wb') as ouf:
while True:
data = inf.read(BLOCKSIZE)
if not data: break
converted = data.decode('latin1').encode('utf-8')
ouf.write(converted)
The byte-by-byte parsing implied in by-line reading, line-end conversion (not on Linux;-), and codecs.open-style encoding-decoding, should be part of what's slowing you down. This approach is also portable (like yours is), since control-characters such as \n
need no translation among these codecs anyway (in any OS).
This only works for input codecs that have no multibyte characters, but `latin1' is one of those (it does not matter whether the output codec has such characters or not).
Try different block sizes to find the sweet spot performance-wise, depending on your disk, filesystem and available RAM.
Edit: changed code per @John's comment, and clarified a conditon as per @gnibbler's.