Process very large (>20GB) text file line by line

后端未结

关注

 11  1717

慢半拍i

I have a number of very large text files which I need to process, the largest being about 60GB.

Each line has 54 characters in seven fields and I want to remove the

相关标签:

11条回答

鱼传尺愫

2020-11-29 18:08

Since you only mention saving space as a benefit, is there some reason you can't just store the files gzipped? That should save 70% and up on this data. Or consider getting NTFS to compress the files if random access is still important. You'll get much more dramatic savings on I/O time after either of those.

More importantly, where is your data that you're getting only 3.4GB/hr? That's down around USBv1 speeds.

0 讨论(0)
发布评论:

提交评论
- 加载中...
暖寄归人

2020-11-29 18:10
Measure! You got quite some useful hints how to improve your python code and I agree with them. But you should first figure out, what your real problem is. My first steps to find your bottleneck would be:
- Remove any processing from your code. Just read and write the data and measure the speed. If just reading and writing the files is too slow, it's not a problem of your code.
- If just reading and writing is already slow, try to use multiple disks. You are reading and writing at the same time. On the same disc? If yes, try to use different discs and try again.
- Some async io library (Twisted?) might help too.
If you figured out the exact problem, ask again for optimizations of that problem.
0 讨论(0)
发布评论:

提交评论
- 加载中...

孤街浪徒

2020-11-29 18:10

ProcessLargeTextFile():
    r = open("filepath", "r")
    w = open("filepath", "w")
    l = r.readline()
    while l:

As has been suggested already, you may want to use a for loop to make this more optimal.

    x = l.split(' ')[0]
    y = l.split(' ')[1]
    z = l.split(' ')[2]

You are performing a split operation 3 times here, depending on the size of each line this will have a detremental impact on performance. You should split once and assign x,y,z to the entries in the array that comes back.

    w.write(l.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))

Each line you are reading, you are writing immediately to the file, which is very I/O intensive. You should consider buffering your output to memory and pushing to the disk periodically. Something like this:

BUFFER_SIZE_LINES = 1024 # Maximum number of lines to buffer in memory

def ProcessLargeTextFile():
    r = open("filepath", "r")
    w = open("filepath", "w")
    buf = ""
    bufLines = 0
    for lineIn in r:

        x, y, z = lineIn.split(' ')[:3]
        lineOut = lineIn.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3])
        bufLines+=1

        if bufLines >= BUFFER_SIZE:
            # Flush buffer to disk
            w.write(buf)
            buf = ""
            bufLines=1

        buf += lineOut + "\n"

    # Flush remaining buffer to disk
    w.write(buf)
    buf.close()
    r.close()
    w.close()

You can tweak BUFFER_SIZE to determine an optimal balance between memory usage and speed.

0 讨论(0)

囚心锁ツ

2020-11-29 18:10

Read the file using for l in r: to benefit from buffering.

0 讨论(0)
发布评论:

提交评论
- 加载中...
鱼传尺愫

2020-11-29 18:11

You can try to save your split result first you do it and not do it every time you need a field. May be this will speed up.

you can also try not to run it in gui. Run it in cmd.

0 讨论(0)
发布评论:

提交评论
- 加载中...

轻奢々

2020-11-29 18:13

It's more idiomatic to write your code like this

def ProcessLargeTextFile():
    with open("filepath", "r") as r, open("outfilepath", "w") as w:
        for line in r:
            x, y, z = line.split(' ')[:3]
            w.write(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))

The main saving here is to just do the split once, but if the CPU is not being taxed, this is likely to make very little difference

It may help to save up a few thousand lines at a time and write them in one hit to reduce thrashing of your harddrive. A million lines is only 54MB of RAM!

def ProcessLargeTextFile():
    bunchsize = 1000000     # Experiment with different sizes
    bunch = []
    with open("filepath", "r") as r, open("outfilepath", "w") as w:
        for line in r:
            x, y, z = line.split(' ')[:3]
            bunch.append(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
            if len(bunch) == bunchsize:
                w.writelines(bunch)
                bunch = []
        w.writelines(bunch)

suggested by @Janne, an alternative way to generate the lines

def ProcessLargeTextFile():
    bunchsize = 1000000     # Experiment with different sizes
    bunch = []
    with open("filepath", "r") as r, open("outfilepath", "w") as w:
        for line in r:
            x, y, z, rest = line.split(' ', 3)
            bunch.append(' '.join((x[:-3], y[:-3], z[:-3], rest)))
            if len(bunch) == bunchsize:
                w.writelines(bunch)
                bunch = []
        w.writelines(bunch)

0 讨论(0)

1 2 下一页