High performance mass short string search in Python

前端未结

关注

 5  1585

The Problem: A large static list of strings is provided as A, A long string is provided as B, strings in A are all very short (a keywords

相关标签:

5条回答

长情又很酷

2021-02-05 14:32
Depending on how long your long string is, it may be worth it to do something like this:
```
ls = 'my long string of stuff'
#Generate all possible substrings of ls, keeping only uniques
x = set([ls[p:y] for p in range(0, len(ls)+1) for y in range(p+1, len(ls)+1)])

result = []
for word in A:
    if word in x:
        result.append(word)
```
Obviously if your long string is very, very long then this also becomes too slow, but it should be faster for any string under a few hundred characters.
0 讨论(0)
发布评论:

提交评论
- 加载中...
春和景丽

2021-02-05 14:33
Assume you has all keywords of the same length (later you could extend this algo for different lengths)

I could suggest next:
1. precalculate some hash for each keyword (for example xor hash):
```
hash256 = reduce(int.__xor__, map(ord, keyword))
```
2. create a dictionary where key is a hash, and value list of corresponding keywords
3. save keyword length
```
curr_keyword = []
for x in B:
  if len(curr_keyword) == keyword_length:
     hash256 = reduce(int.__xor__, map(ord, curr_keyword))
     if hash256 in dictionary_of_hashed:
        #search in list

  curr_keyword.append(x)
  curr_keyword = curr_keyword[1:]
```
Something like this
0 讨论(0)
发布评论:

提交评论
- 加载中...
刺人心

2021-02-05 14:39

Your problem is large enough that you probably need to hit it with the algorithm bat.

Take a look into the Aho-Corasick Algorithm. Your problem statement is a paraphrase of the problem that this algorithm tackles.

Also, look into the work by Nicholas Lehuen with his PyTST package.

There are also references in a related Stack Overflow message that mention other algorithms such as Rabin-Karp: Algorithm for linear pattern matching?

0 讨论(0)
发布评论:

提交评论
- 加载中...
遥遥无期

2021-02-05 14:42

Pack up all the individual words of B into a new list, consisting of the original string split by ' '. Then, for each element in B, test for membership against each element of A. If you find one (or more), delete it/them from A, and quit as soon as A is empty.

It seems like your approach will blaze through 500,000 candidates without an opt-out set in place.

0 讨论(0)
发布评论:

提交评论
- 加载中...
没有蜡笔的小新

2021-02-05 14:44
I don't know if this would be any quicker, but it's a lot more pythonic:
```
result = [x for x in A if x in B]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...