问题
I am quite new to python and programming in general, but I am trying to run a "sliding window" calculation over a tab delimited .txt file that contains about 7 million lines with python. What I mean by sliding window is that it will run a calculation over say 50,000 lines, report the number and then move up say 10,000 lines and perform the same calculation over another 50,000 lines. I have the calculation and the "sliding window" working correctly and it runs well if I test it on a a small subset of my data. However, if i try to run the program over my entire data set it is incredibly slow (i've had it running now for about 40 hours). The math is quite simple so I don't think it should be taking this long.
The way I am reading my .txt file right now is with the csv.DictReader module. My code is as follows:
file1='/Users/Shared/SmallSetbee.txt'
newfile=open(file1, 'rb')
reader=csv.DictReader((line.replace('\0','') for line in newfile), delimiter="\t")
I believe that this is making a dictionary out of all 7 million lines at once, which I'm thinking could be the reason it slows down so much for the larger file.
Since I am only interested in running my calculation over "chunks" or "windows" of data at a time, is there a more efficient way to read in only specified lines at a time, perform the calculation and then repeat with a new specified "chunk" or "window" of specified lines?
回答1:
A collections.deque
is an ordered collection of items which can take a maximum size. When you add an item to one end, one falls of the other end. This means that to iterate over a "window" on your csv, you just need to keep adding rows to the deque
and it will handle throwing away complete ones already.
dq = collections.deque(maxlen=50000)
with open(...) as csv_file:
reader = csv.DictReader((line.replace("\0", "") for line in csv_file), delimiter="\t")
# initial fill
for _ in range(50000):
dq.append(reader.next())
# repeated compute
try:
while 1:
compute(dq)
for _ in range(10000):
dq.append(reader.next())
except StopIteration:
compute(dq)
回答2:
Don't use csv.DictReader
, instead use csv.reader
. It takes longer to create a dictionary for each row than it takes to create a list for each row. Additionally, it is marginally faster to access a list by an index than it is to access a dictionary by a key.
I timed iteration over a 300,000 line 4 column csv file using the two csv readers. csv.DictReader
took seven times longer than a csv.reader
.
Combine this with katrielalex's suggestion to use collections.deque
and you should see a nice speedup.
Additionally, profile your code to pinpoint where you are spending most of your time.
来源:https://stackoverflow.com/questions/13401601/processing-a-large-txt-file-in-python-efficiently