I was working on a script which reading a folder of files(each of size ranging from 20 MB to 100 MB), modifies some data in each line, and writes back to a copy of the file.
file.writelines()
expects an iterable of strings. It then proceeds to loop and call file.write()
for each string in the iterable. In Python, the method does this:
def writelines(self, lines)
for line in lines:
self.write(line)
You are passing in a single large string, and a string is an iterable of strings too. When iterating you get individual characters, strings of length 1. So in effect you are making len(data)
separate calls to file.write()
. And that is slow, because you are building up a write buffer a single character at a time.
Don't pass in a single string to file.writelines()
. Pass in a list or tuple or other iterable instead.
You could send in individual lines with added newline in a generator expression, for example:
myWrite.writelines(line + '\n' for line in new_my_list)
Now, if you could make clean_data()
a generator, yielding cleaned lines, you could stream data from the input file, through your data cleaning generator, and out to the output file without using any more memory than is required for the read and write buffers and however much state is needed to clean your lines:
with open(inputPath, 'r+') as myRead, open(outPath, 'w+') as myWrite:
myWrite.writelines(line + '\n' for line in clean_data(myRead))
In addition, I'd consider updating clean_data()
to emit lines with newlines included.
'write(arg)' method expects string as its argument. So once it calls, it will directly writes. this is the reason it is much faster.
where as if you are using writelines()
method, it expects list of string as iterator. so even if you are sending data to writelines
, it assumes that it got iterator and it tries to iterate over it. so since it is an iterator it will take some time to iterate over and write it.
Is that clear ?
as a complement to Martijn answer, the best way would be to avoid to build the list using join
in the first place
Just pass a generator comprehension to writelines
, adding the newline in the end: no unnecessary memory allocation and no loop (besides the comprehension)
myWrite.writelines("{}\n".format(x) for x in my_list)