I have about 50 GB of text file and I am checking the first few characters each line and writing those to other files specified for that beginning text.
For example.
You should definitely try to open/close the file as little as possible
Because even comparing with file read/write, file open/close is far more expensive
Consider two code blocks:
f=open('test1.txt', 'w')
for i in range(1000):
f.write('\n')
f.close()
and
for i in range(1000):
f=open('test2.txt', 'a')
f.write('\n')
f.close()
The first one takes 0.025s while the second one takes 0.309s
IO operations consume too much time. Open and close the file, also.
It's much faster if you open both files(the input and the output), use a memory buffer with, let's say, 10MB size for your text processing and then write this to the output file. For example:
file = {} # just initializing dicts
filename = {}
with open(file) as f:
file['dog'] = None
buffer = ''
...
#maybe there is a loop here
if writeflag:
if file['dog'] == None:
file['dog'] = open(filename['dog'], 'a')
buffer += remline + '\n'
if len(buffer) > 1024*1000*10: # 10MB of text
files['dog'].write(buffer)
buffer = ''
for v in files.values():
v.close()
Use the with
statement, it automatically closes the files for you, do all the operations inside the with
block, so it'll keep the files open for you and will close the files once you're out of the with
block.
with open(inputfile)as f1, open('dog.txt','a') as f2,open('cat.txt') as f3:
#do something here
EDIT:
If you know all the possible filenames to be used before the compilation of your code then using with
is a better option and if you don't then you should use your approach but instead of closing the file you can flush
the data to the file using writefile1.flush()
Keep it open the whole time! Otherwise you tell the system that you are done writing all the time and it might decide to flush it onto the disk instead of buffering it. And for obvious reasons n disk writes are much more expensive than 1 disk write.
If you want to append to the file and not overwrite it then yes, a
is the correct mode.