Find occurrences of huge list of phrases in text

后端 未结 8 2018
傲寒
傲寒 2021-02-08 05:02

I\'m building a backend and trying to crunch the following problem.

  • The clients submit text to the backend (around 2000 characters on average)
8条回答
  •  陌清茗
    陌清茗 (楼主)
    2021-02-08 05:57

    The pyparsing module - a python tool for extracting information from text - will help you with writing phrase matching. It returns all matches, and the index range of each match, of a phrase which you can describe using BNF (Backus-Naur Form) (i.e. a grammar). In my experience, it is easy to use (2), expressive in terms of the kinds patterns you can define, and is impressively fast.

    from pyparsing import Word, alphas
    greet = Word( alphas ) + "," + Word( alphas ) + "!" # <-- grammar defined here
    hello = "Hello, World!"
    print (hello, "->", greet.parseString( hello ))
    

    Use scanString to return index of match:

    for item in greet.scanString(hello):
        print(item)
    
    >>> ((['Hello', ',', 'World', '!'], {}), 0, 13)
    

    If you assemble a list of phrases using pyparsing as a dictionary of form

    phrase_list = {phrase_defined_with_pyparsing: phrase_name}
    

    then your grammar can be a giant OR statement with labeled phrases.

    import pyparsing as pp
    your_grammar = pp.Or([phrase.setResultsName(phrase_name) for phrase, phrase_name in phrase_list.items()])
    all_matches = your_grammar.scanString(big_document)
    

    Each match is a tuple that is labeled (via setResultsName) and has an index range.

提交回复
热议问题