I need fastest way to convert files from latin1 to utf-8 in python. The files are large ~ 2G. ( I am moving DB data ). So far I have
import codecs
infile = codec
If you are desperate to do it in Python (or any other language), at least do the I/O in bigger chunks than lines, and avoid the codecs overhead.
infile = open(tmpfile, 'rb')
outfile = open(tmpfile1, 'wb')
BLOCKSIZE = 65536 # experiment with size
while True:
block = infile.read(BLOCKSIZE)
if not block: break
outfile.write(block.decode('latin1').encode('utf8'))
infile.close()
outfile.close()
Otherwise, go with iconv ... I haven't look under the hood but if it doesn't special-case latin1 input I'd be surprised :-)