High performance mass short string search in Python

前端 未结 5 1578
[愿得一人]
[愿得一人] 2021-02-05 13:55

The Problem: A large static list of strings is provided as A, A long string is provided as B, strings in A are all very short (a keywords

相关标签:
5条回答
  • 2021-02-05 14:32

    Depending on how long your long string is, it may be worth it to do something like this:

    ls = 'my long string of stuff'
    #Generate all possible substrings of ls, keeping only uniques
    x = set([ls[p:y] for p in range(0, len(ls)+1) for y in range(p+1, len(ls)+1)])
    
    result = []
    for word in A:
        if word in x:
            result.append(word)
    

    Obviously if your long string is very, very long then this also becomes too slow, but it should be faster for any string under a few hundred characters.

    0 讨论(0)
  • 2021-02-05 14:33

    Assume you has all keywords of the same length (later you could extend this algo for different lengths)

    I could suggest next:

    1. precalculate some hash for each keyword (for example xor hash):

      hash256 = reduce(int.__xor__, map(ord, keyword))
      
    2. create a dictionary where key is a hash, and value list of corresponding keywords

    3. save keyword length

      curr_keyword = []
      for x in B:
        if len(curr_keyword) == keyword_length:
           hash256 = reduce(int.__xor__, map(ord, curr_keyword))
           if hash256 in dictionary_of_hashed:
              #search in list
      
        curr_keyword.append(x)
        curr_keyword = curr_keyword[1:]
      

    Something like this

    0 讨论(0)
  • 2021-02-05 14:39

    Your problem is large enough that you probably need to hit it with the algorithm bat.

    Take a look into the Aho-Corasick Algorithm. Your problem statement is a paraphrase of the problem that this algorithm tackles.

    Also, look into the work by Nicholas Lehuen with his PyTST package.

    There are also references in a related Stack Overflow message that mention other algorithms such as Rabin-Karp: Algorithm for linear pattern matching?

    0 讨论(0)
  • 2021-02-05 14:42

    Pack up all the individual words of B into a new list, consisting of the original string split by ' '. Then, for each element in B, test for membership against each element of A. If you find one (or more), delete it/them from A, and quit as soon as A is empty.

    It seems like your approach will blaze through 500,000 candidates without an opt-out set in place.

    0 讨论(0)
  • 2021-02-05 14:44

    I don't know if this would be any quicker, but it's a lot more pythonic:

    result = [x for x in A if x in B]
    
    0 讨论(0)
提交回复
热议问题