I have a number of very large text files which I need to process, the largest being about 60GB.
Each line has 54 characters in seven fields and I want to remove the
Since you only mention saving space as a benefit, is there some reason you can't just store the files gzipped? That should save 70% and up on this data. Or consider getting NTFS to compress the files if random access is still important. You'll get much more dramatic savings on I/O time after either of those.
More importantly, where is your data that you're getting only 3.4GB/hr? That's down around USBv1 speeds.
Measure! You got quite some useful hints how to improve your python code and I agree with them. But you should first figure out, what your real problem is. My first steps to find your bottleneck would be:
If you figured out the exact problem, ask again for optimizations of that problem.
ProcessLargeTextFile():
r = open("filepath", "r")
w = open("filepath", "w")
l = r.readline()
while l:
As has been suggested already, you may want to use a for loop to make this more optimal.
x = l.split(' ')[0]
y = l.split(' ')[1]
z = l.split(' ')[2]
You are performing a split operation 3 times here, depending on the size of each line this will have a detremental impact on performance. You should split once and assign x,y,z to the entries in the array that comes back.
w.write(l.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
Each line you are reading, you are writing immediately to the file, which is very I/O intensive. You should consider buffering your output to memory and pushing to the disk periodically. Something like this:
BUFFER_SIZE_LINES = 1024 # Maximum number of lines to buffer in memory
def ProcessLargeTextFile():
r = open("filepath", "r")
w = open("filepath", "w")
buf = ""
bufLines = 0
for lineIn in r:
x, y, z = lineIn.split(' ')[:3]
lineOut = lineIn.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3])
bufLines+=1
if bufLines >= BUFFER_SIZE:
# Flush buffer to disk
w.write(buf)
buf = ""
bufLines=1
buf += lineOut + "\n"
# Flush remaining buffer to disk
w.write(buf)
buf.close()
r.close()
w.close()
You can tweak BUFFER_SIZE to determine an optimal balance between memory usage and speed.
Read the file using for l in r:
to benefit from buffering.
You can try to save your split result first you do it and not do it every time you need a field. May be this will speed up.
you can also try not to run it in gui. Run it in cmd.
It's more idiomatic to write your code like this
def ProcessLargeTextFile():
with open("filepath", "r") as r, open("outfilepath", "w") as w:
for line in r:
x, y, z = line.split(' ')[:3]
w.write(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
The main saving here is to just do the split
once, but if the CPU is not being taxed, this is likely to make very little difference
It may help to save up a few thousand lines at a time and write them in one hit to reduce thrashing of your harddrive. A million lines is only 54MB of RAM!
def ProcessLargeTextFile():
bunchsize = 1000000 # Experiment with different sizes
bunch = []
with open("filepath", "r") as r, open("outfilepath", "w") as w:
for line in r:
x, y, z = line.split(' ')[:3]
bunch.append(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
if len(bunch) == bunchsize:
w.writelines(bunch)
bunch = []
w.writelines(bunch)
suggested by @Janne, an alternative way to generate the lines
def ProcessLargeTextFile():
bunchsize = 1000000 # Experiment with different sizes
bunch = []
with open("filepath", "r") as r, open("outfilepath", "w") as w:
for line in r:
x, y, z, rest = line.split(' ', 3)
bunch.append(' '.join((x[:-3], y[:-3], z[:-3], rest)))
if len(bunch) == bunchsize:
w.writelines(bunch)
bunch = []
w.writelines(bunch)