Parsing unique words from a text file

后端 未结 2 710
时光取名叫无心
时光取名叫无心 2021-01-14 12:38

I\'m working on a project to parse out unique words from a large number of text files. I\'ve got the file handling down, but I\'m trying to refine the parsing procedure. E

相关标签:
2条回答
  • 2021-01-14 13:32

    Try replacing report_list with a dictionary or set. word_check not in report_list works slow if report_list is a list

    0 讨论(0)
  • 2021-01-14 13:37

    One problem is that an in test for a list is slow. You should probably keep a set to keep track of what words you have seen, because the in test for a set is very fast.

    Example:

    report_set = set()
    for line in report:
        for word in line.split():
            if we_want_to_keep_word(word):
                report_set.add(word)
    

    Then when you are done: report_list = list(report_set)

    Anytime you need to force a set into a list, you can. But if you just need to loop over it or do in tests, you can leave it as a set; it's legal to do for x in report_set:

    Another problem that might or might not matter is that you are slurping all the lines from the file in one go, using the .readlines() method. For really large files it is better to just use the open file-handle object as an iterator, like so:

    with open("filename", "r") as f:
        for line in f:
            ... # process each line here
    

    A big problem is that I don't even see how this code can work:

    while 1:
        lines = report.readlines()
        if not lines:
            break
    

    This will loop forever. The first statement slurps all input lines with .readlines(), then we loop again, then the next call to .readlines() has report already exhausted, so the call to .readlines() returns an empty list, which breaks out of the infinite loop. But this has now lost all the lines we just read, and the rest of the code must make do with an empty lines variable. How does this even work?

    So, get rid of that whole while 1 loop, and change the next loop to for line in report:.

    Also, you don't really need to keep a count variable. You can use len(report_set) at any time to find out how many words are in the set.

    Also, with a set you don't actually need to check whether a word is in the set; you can just always call report_set.add(word) and if it's already in the set it won't be added again!

    Also, you don't have to do it my way, but I like to make a generator that does all the processing. Strip the line, translate the line, split on whitespace, and yield up words ready to use. I would also force the words to lower-case except I don't know whether it's important that FOOTNOTES be detected only in upper-case.

    So, put all the above together and you get:

    def words(file_object):
        for line in file_object:
            line = line.strip().translate(None, string.punctuation)
            for word in line.split():
                yield word
    
    report_set = set()
    with open(fullpath, 'r') as report:
        for word in words(report):
            if word == "FOOTNOTES":
                break
            word = word.lower()
            if len(word) > 2 and word not in dict_file:
                report_set.add(word)
    
    print("Words in report_set: %d" % len(report_set))
    
    0 讨论(0)
提交回复
热议问题