Python: how to check that if an item is in a list efficiently?

后端 未结 4 1320
盖世英雄少女心
盖世英雄少女心 2020-12-30 15:28

I have a list of strings (words like), and, while I am parsing a text, I need to check if a word belongs to the group of words of my current list.

However, my input

相关标签:
4条回答
  • 2020-12-30 15:40

    This uses list comprehension

    words_in_line = [word for word in line if word in my_list]
    

    which would be more efficient than the code you posted, though how much more for your huge data set is hard to know.

    0 讨论(0)
  • 2020-12-30 15:54

    I'm not clear on why you chose a list in the first place, but here are some alternatives:

    Using a set() is likely a good idea. This is very fast, though unordered, but sometimes that's exactly what's needed.

    If you need things ordered and to have arbitrary lookups as well, you could use a tree of some sort: http://stromberg.dnsalias.org/~strombrg/python-tree-and-heap-comparison/

    If set membership testing with a small number of false positives here or there is acceptable, you might check into a bloom filter: http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/

    Depending on what you're doing, a trie might also be very good.

    0 讨论(0)
  • 2020-12-30 15:54

    There are two improvments you can make here.

    • Back your word list with a hashtable. This will afford you O(1) performance when you are checking if a word is present in your word list. There are a number of ways to do this; the most fitting in this scenario is to convert your list to a set.
    • Using a more appropriate structure for your matching-word collection.
      • If you need to store all of the matches in memory at the same time, use a dequeue, since its append performance is superior to lists.
      • If you don't need all the matches in memory at once, consider using a generator. A generator is used to iterate over matched values according to the logic you specify, but it only stores part of the resulting list in memory at a time. It may offer improved performance if you are experiencing I/O bottlenecks.

    Below is an example implementation based on my suggestions (opting for a generator, since I can't imagine you need all those words in memory at once).

    from itertools import chain
    d = set(['a','b','c']) # Load our dictionary
    f = open('c:\\input.txt','r')
    # Build a generator to get the words in the file
    all_words_generator = chain.from_iterable(line.split() for line in f)
    # Build a generator to filter out the non-dictionary words
    matching_words_generator = (word for word in all_words_generator if word in d)
    for matched_word in matching_words_generator:
        # Do something with matched_word
        print matched_word
    # We're reading the file during the above loop, so don't close it too early
    f.close()
    

    input.txt

    a b dog cat
    c dog poop
    maybe b cat
    dog
    

    Output

    a
    b
    c
    b
    
    0 讨论(0)
  • 2020-12-30 15:57

    You might consider a trie or a DAWG or a database. There are several Python implementations of the same.

    Here is some relative timings for you to consider of a set vs a list:

    import timeit
    import random
    
    with open('/usr/share/dict/words','r') as di:  # UNIX 250k unique word list 
        all_words_set={line.strip() for line in di}
    
    all_words_list=list(all_words_set)    # slightly faster if this list is sorted...      
    
    test_list=[random.choice(all_words_list) for i in range(10000)] 
    test_set=set(test_list)
    
    def set_f():
        count = 0
        for word in test_set:
            if word in all_words_set: 
               count+=1
        return count
    
    def list_f():
        count = 0
        for word in test_list:
            if word in all_words_list: 
               count+=1
        return count
    
    def mix_f():
        # use list for source, set for membership testing
        count = 0
        for word in test_list:
            if word in all_words_set: 
               count+=1
        return count    
    
    print "list:", timeit.Timer(list_f).timeit(1),"secs"
    print "set:", timeit.Timer(set_f).timeit(1),"secs" 
    print "mixed:", timeit.Timer(mix_f).timeit(1),"secs" 
    

    Prints:

    list: 47.4126560688 secs
    set: 0.00277495384216 secs
    mixed: 0.00166988372803 secs
    

    ie, matching a set of 10000 words against a set of 250,000 words is 17,085 X faster than matching a list of same 10000 words in a list of the same 250,000 words. Using a list for the source and a set for membership testing is 28,392 X faster than an unsorted list alone.

    For membership testing, a list is O(n) and sets and dicts are O(1) for lookups.

    Conclusion: Use better data structures for 600 million lines of text!

    0 讨论(0)
提交回复
热议问题