I\'m building a backend and trying to crunch the following problem.
2000
characters on average)
To get a reasonable speed while matching 80k patterns, you definitely need some preprocessing on the patterns, single-shot algorithms like Boyer-Moore
won't help much.
You'll probably also need to do the work in compiled code (think C extension) to get reasonable throughput. Regarding how to preprocess the patterns - one option is state machines like Aho-Corasick
or some generic finite state transducer. The next option is something like a suffix array
based index, and the last one that comes to my mind is inverted index.
If your matches are exact and the patterns respect word boundaries, chances are that a well implemented word or word-ngram keyed inverted index
will be fast enough even in pure Python. The index is not a complete solution, it will rather give you a few candidate phrases which you need to check with normal string matching for a complete match.
If you need approximate matching, character-ngram inverted index is your choice.
Regarding real implementations - flashtext mentioned in other answer here seems to be a reasonable pure Python solution if you're OK with the full-phrase-only limitation.
Otherwise you can get reasonable results with generic multi-pattern capable regexp libraries: one of the fastest should be Intel's hyperscan - there are even some rudimentary python bindings available.
Other option is Google's RE2 with Python bindings from Facebook. You want to use RE2::Set
in this case.