Find occurrences of huge list of phrases in text

后端未结

关注

 8  2022

傲寒 2021-02-08 05:02

I\'m building a backend and trying to crunch the following problem.

The clients submit text to the backend (around 2000 characters on average)

8条回答

不知归路 (楼主)

2021-02-08 05:36

To get a reasonable speed while matching 80k patterns, you definitely need some preprocessing on the patterns, single-shot algorithms like Boyer-Moore won't help much.

You'll probably also need to do the work in compiled code (think C extension) to get reasonable throughput. Regarding how to preprocess the patterns - one option is state machines like Aho-Corasick or some generic finite state transducer. The next option is something like a suffix array based index, and the last one that comes to my mind is inverted index.

If your matches are exact and the patterns respect word boundaries, chances are that a well implemented word or word-ngram keyed inverted index will be fast enough even in pure Python. The index is not a complete solution, it will rather give you a few candidate phrases which you need to check with normal string matching for a complete match.

If you need approximate matching, character-ngram inverted index is your choice.

Regarding real implementations - flashtext mentioned in other answer here seems to be a reasonable pure Python solution if you're OK with the full-phrase-only limitation.

Otherwise you can get reasonable results with generic multi-pattern capable regexp libraries: one of the fastest should be Intel's hyperscan - there are even some rudimentary python bindings available.

Other option is Google's RE2 with Python bindings from Facebook. You want to use RE2::Set in this case.

0 讨论(0)

查看其它8个回答
发布评论:

提交评论
- 加载中...