Process very large (>20GB) text file line by line

后端 未结 11 1712
慢半拍i
慢半拍i 2020-11-29 17:54

I have a number of very large text files which I need to process, the largest being about 60GB.

Each line has 54 characters in seven fields and I want to remove the

相关标签:
11条回答
  • 2020-11-29 18:08

    Since you only mention saving space as a benefit, is there some reason you can't just store the files gzipped? That should save 70% and up on this data. Or consider getting NTFS to compress the files if random access is still important. You'll get much more dramatic savings on I/O time after either of those.

    More importantly, where is your data that you're getting only 3.4GB/hr? That's down around USBv1 speeds.

    0 讨论(0)
  • 2020-11-29 18:10

    Measure! You got quite some useful hints how to improve your python code and I agree with them. But you should first figure out, what your real problem is. My first steps to find your bottleneck would be:

    • Remove any processing from your code. Just read and write the data and measure the speed. If just reading and writing the files is too slow, it's not a problem of your code.
    • If just reading and writing is already slow, try to use multiple disks. You are reading and writing at the same time. On the same disc? If yes, try to use different discs and try again.
    • Some async io library (Twisted?) might help too.

    If you figured out the exact problem, ask again for optimizations of that problem.

    0 讨论(0)
  • 2020-11-29 18:10
    ProcessLargeTextFile():
        r = open("filepath", "r")
        w = open("filepath", "w")
        l = r.readline()
        while l:
    

    As has been suggested already, you may want to use a for loop to make this more optimal.

        x = l.split(' ')[0]
        y = l.split(' ')[1]
        z = l.split(' ')[2]
    

    You are performing a split operation 3 times here, depending on the size of each line this will have a detremental impact on performance. You should split once and assign x,y,z to the entries in the array that comes back.

        w.write(l.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
    

    Each line you are reading, you are writing immediately to the file, which is very I/O intensive. You should consider buffering your output to memory and pushing to the disk periodically. Something like this:

    BUFFER_SIZE_LINES = 1024 # Maximum number of lines to buffer in memory
    
    def ProcessLargeTextFile():
        r = open("filepath", "r")
        w = open("filepath", "w")
        buf = ""
        bufLines = 0
        for lineIn in r:
    
            x, y, z = lineIn.split(' ')[:3]
            lineOut = lineIn.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3])
            bufLines+=1
    
            if bufLines >= BUFFER_SIZE:
                # Flush buffer to disk
                w.write(buf)
                buf = ""
                bufLines=1
    
            buf += lineOut + "\n"
    
        # Flush remaining buffer to disk
        w.write(buf)
        buf.close()
        r.close()
        w.close()
    

    You can tweak BUFFER_SIZE to determine an optimal balance between memory usage and speed.

    0 讨论(0)
  • 2020-11-29 18:10

    Read the file using for l in r: to benefit from buffering.

    0 讨论(0)
  • 2020-11-29 18:11

    You can try to save your split result first you do it and not do it every time you need a field. May be this will speed up.

    you can also try not to run it in gui. Run it in cmd.

    0 讨论(0)
  • 2020-11-29 18:13

    It's more idiomatic to write your code like this

    def ProcessLargeTextFile():
        with open("filepath", "r") as r, open("outfilepath", "w") as w:
            for line in r:
                x, y, z = line.split(' ')[:3]
                w.write(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
    

    The main saving here is to just do the split once, but if the CPU is not being taxed, this is likely to make very little difference

    It may help to save up a few thousand lines at a time and write them in one hit to reduce thrashing of your harddrive. A million lines is only 54MB of RAM!

    def ProcessLargeTextFile():
        bunchsize = 1000000     # Experiment with different sizes
        bunch = []
        with open("filepath", "r") as r, open("outfilepath", "w") as w:
            for line in r:
                x, y, z = line.split(' ')[:3]
                bunch.append(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
                if len(bunch) == bunchsize:
                    w.writelines(bunch)
                    bunch = []
            w.writelines(bunch)
    

    suggested by @Janne, an alternative way to generate the lines

    def ProcessLargeTextFile():
        bunchsize = 1000000     # Experiment with different sizes
        bunch = []
        with open("filepath", "r") as r, open("outfilepath", "w") as w:
            for line in r:
                x, y, z, rest = line.split(' ', 3)
                bunch.append(' '.join((x[:-3], y[:-3], z[:-3], rest)))
                if len(bunch) == bunchsize:
                    w.writelines(bunch)
                    bunch = []
            w.writelines(bunch)
    
    0 讨论(0)
提交回复
热议问题