The Problem: A large static list of strings is provided as A
, A long string is provided as B
, strings in A
are all very short (a keywords
Depending on how long your long string is, it may be worth it to do something like this:
ls = 'my long string of stuff'
#Generate all possible substrings of ls, keeping only uniques
x = set([ls[p:y] for p in range(0, len(ls)+1) for y in range(p+1, len(ls)+1)])
result = []
for word in A:
if word in x:
result.append(word)
Obviously if your long string is very, very long then this also becomes too slow, but it should be faster for any string under a few hundred characters.
Assume you has all keywords of the same length (later you could extend this algo for different lengths)
I could suggest next:
precalculate some hash for each keyword (for example xor hash):
hash256 = reduce(int.__xor__, map(ord, keyword))
create a dictionary where key is a hash, and value list of corresponding keywords
save keyword length
curr_keyword = []
for x in B:
if len(curr_keyword) == keyword_length:
hash256 = reduce(int.__xor__, map(ord, curr_keyword))
if hash256 in dictionary_of_hashed:
#search in list
curr_keyword.append(x)
curr_keyword = curr_keyword[1:]
Something like this
Your problem is large enough that you probably need to hit it with the algorithm bat.
Take a look into the Aho-Corasick Algorithm. Your problem statement is a paraphrase of the problem that this algorithm tackles.
Also, look into the work by Nicholas Lehuen with his PyTST package.
There are also references in a related Stack Overflow message that mention other algorithms such as Rabin-Karp: Algorithm for linear pattern matching?
Pack up all the individual words of B
into a new list, consisting of the original string split by ' '
. Then, for each element in B
, test for membership against each element of A
. If you find one (or more), delete it/them from A
, and quit as soon as A
is empty.
It seems like your approach will blaze through 500,000 candidates without an opt-out set in place.
I don't know if this would be any quicker, but it's a lot more pythonic:
result = [x for x in A if x in B]