I\'m working on a project to parse out unique words from a large number of text files. I\'ve got the file handling down, but I\'m trying to refine the parsing procedure. E
Try replacing report_list with a dictionary or set. word_check not in report_list works slow if report_list is a list
One problem is that an in
test for a list
is slow. You should probably keep a set
to keep track of what words you have seen, because the in
test for a set
is very fast.
Example:
report_set = set()
for line in report:
for word in line.split():
if we_want_to_keep_word(word):
report_set.add(word)
Then when you are done: report_list = list(report_set)
Anytime you need to force a set
into a list
, you can. But if you just need to loop over it or do in
tests, you can leave it as a set
; it's legal to do for x in report_set:
Another problem that might or might not matter is that you are slurping all the lines from the file in one go, using the .readlines()
method. For really large files it is better to just use the open file-handle object as an iterator, like so:
with open("filename", "r") as f:
for line in f:
... # process each line here
A big problem is that I don't even see how this code can work:
while 1:
lines = report.readlines()
if not lines:
break
This will loop forever. The first statement slurps all input lines with .readlines()
, then we loop again, then the next call to .readlines()
has report
already exhausted, so the call to .readlines()
returns an empty list, which breaks out of the infinite loop. But this has now lost all the lines we just read, and the rest of the code must make do with an empty lines
variable. How does this even work?
So, get rid of that whole while 1
loop, and change the next loop to for line in report:
.
Also, you don't really need to keep a count
variable. You can use len(report_set)
at any time to find out how many words are in the set
.
Also, with a set
you don't actually need to check whether a word is in
the set; you can just always call report_set.add(word)
and if it's already in the set
it won't be added again!
Also, you don't have to do it my way, but I like to make a generator that does all the processing. Strip the line, translate the line, split on whitespace, and yield up words ready to use. I would also force the words to lower-case except I don't know whether it's important that FOOTNOTES
be detected only in upper-case.
So, put all the above together and you get:
def words(file_object):
for line in file_object:
line = line.strip().translate(None, string.punctuation)
for word in line.split():
yield word
report_set = set()
with open(fullpath, 'r') as report:
for word in words(report):
if word == "FOOTNOTES":
break
word = word.lower()
if len(word) > 2 and word not in dict_file:
report_set.add(word)
print("Words in report_set: %d" % len(report_set))