I\'m building a backend and trying to crunch the following problem.
2000
characters on average)
The pyparsing module - a python tool for extracting information from text - will help you with writing phrase matching. It returns all matches, and the index range of each match, of a phrase which you can describe using BNF (Backus-Naur Form) (i.e. a grammar). In my experience, it is easy to use (2), expressive in terms of the kinds patterns you can define, and is impressively fast.
from pyparsing import Word, alphas
greet = Word( alphas ) + "," + Word( alphas ) + "!" # <-- grammar defined here
hello = "Hello, World!"
print (hello, "->", greet.parseString( hello ))
Use scanString to return index of match:
for item in greet.scanString(hello):
print(item)
>>> ((['Hello', ',', 'World', '!'], {}), 0, 13)
If you assemble a list of phrases using pyparsing as a dictionary of form
phrase_list = {phrase_defined_with_pyparsing: phrase_name}
then your grammar can be a giant OR statement with labeled phrases.
import pyparsing as pp
your_grammar = pp.Or([phrase.setResultsName(phrase_name) for phrase, phrase_name in phrase_list.items()])
all_matches = your_grammar.scanString(big_document)
Each match is a tuple that is labeled (via setResultsName) and has an index range.