Find occurrences of huge list of phrases in text

后端 未结 8 2022
傲寒
傲寒 2021-02-08 05:02

I\'m building a backend and trying to crunch the following problem.

  • The clients submit text to the backend (around 2000 characters on average)
8条回答
  •  不知归路
    2021-02-08 05:36

    To get a reasonable speed while matching 80k patterns, you definitely need some preprocessing on the patterns, single-shot algorithms like Boyer-Moore won't help much.

    You'll probably also need to do the work in compiled code (think C extension) to get reasonable throughput. Regarding how to preprocess the patterns - one option is state machines like Aho-Corasick or some generic finite state transducer. The next option is something like a suffix array based index, and the last one that comes to my mind is inverted index.

    If your matches are exact and the patterns respect word boundaries, chances are that a well implemented word or word-ngram keyed inverted index will be fast enough even in pure Python. The index is not a complete solution, it will rather give you a few candidate phrases which you need to check with normal string matching for a complete match.

    If you need approximate matching, character-ngram inverted index is your choice.

    Regarding real implementations - flashtext mentioned in other answer here seems to be a reasonable pure Python solution if you're OK with the full-phrase-only limitation.

    Otherwise you can get reasonable results with generic multi-pattern capable regexp libraries: one of the fastest should be Intel's hyperscan - there are even some rudimentary python bindings available.

    Other option is Google's RE2 with Python bindings from Facebook. You want to use RE2::Set in this case.

提交回复
热议问题