Processing a large .txt file in python efficiently

问题

I am quite new to python and programming in general, but I am trying to run a "sliding window" calculation over a tab delimited .txt file that contains about 7 million lines with python. What I mean by sliding window is that it will run a calculation over say 50,000 lines, report the number and then move up say 10,000 lines and perform the same calculation over another 50,000 lines. I have the calculation and the "sliding window" working correctly and it runs well if I test it on a a small subset of my data. However, if i try to run the program over my entire data set it is incredibly slow (i've had it running now for about 40 hours). The math is quite simple so I don't think it should be taking this long.

The way I am reading my .txt file right now is with the csv.DictReader module. My code is as follows:

file1='/Users/Shared/SmallSetbee.txt'
newfile=open(file1, 'rb')
reader=csv.DictReader((line.replace('\0','') for line in newfile), delimiter="\t")

I believe that this is making a dictionary out of all 7 million lines at once, which I'm thinking could be the reason it slows down so much for the larger file.

Since I am only interested in running my calculation over "chunks" or "windows" of data at a time, is there a more efficient way to read in only specified lines at a time, perform the calculation and then repeat with a new specified "chunk" or "window" of specified lines?

回答1:

A collections.deque is an ordered collection of items which can take a maximum size. When you add an item to one end, one falls of the other end. This means that to iterate over a "window" on your csv, you just need to keep adding rows to the deque and it will handle throwing away complete ones already.

dq = collections.deque(maxlen=50000)
with open(...) as csv_file:
    reader = csv.DictReader((line.replace("\0", "") for line in csv_file), delimiter="\t")

    # initial fill
    for _ in range(50000):
        dq.append(reader.next())

    # repeated compute
    try:
        while 1:
            compute(dq)
            for _ in range(10000):
                dq.append(reader.next())
    except StopIteration:
            compute(dq)

回答2:

Don't use csv.DictReader, instead use csv.reader. It takes longer to create a dictionary for each row than it takes to create a list for each row. Additionally, it is marginally faster to access a list by an index than it is to access a dictionary by a key.

I timed iteration over a 300,000 line 4 column csv file using the two csv readers. csv.DictReader took seven times longer than a csv.reader.

Combine this with katrielalex's suggestion to use collections.deque and you should see a nice speedup.

Additionally, profile your code to pinpoint where you are spending most of your time.

来源：https://stackoverflow.com/questions/13401601/processing-a-large-txt-file-in-python-efficiently

标签

python

tabs

window

sliding

delimited