Find occurrences of huge list of phrases in text

后端 未结 8 2019
傲寒
傲寒 2021-02-08 05:02

I\'m building a backend and trying to crunch the following problem.

  • The clients submit text to the backend (around 2000 characters on average)
8条回答
  •  逝去的感伤
    2021-02-08 05:42

    You have much more pattern data than text data. Invert the problem: match the patterns against the text.

    For the purposes of the this, I would assume that the text can be reasonably tokenized into words (or something word-like). I'd also assume that the phrases, even if they can't be tokenized per se (for example because they are regexes) nevertheless usually contain words, and (most of the time) have to match at least one of the words they contain.

    Here is a sketch of a solution which contains three parts:

    1. Tokenize and index the patterns (once) - this produces a map of patterns that contain each token

    2. Tokenize text and filter patterns to find candidates that could match the text

    3. Test the candidate patterns and perform substitutions

    Here is the code:

    import re
    import random
    # from nltk.corpus import words
    import time
    
    """ Prepare text and phrases, same as in Martin Evans's answer """
    
    # english = words.words()
    with open('/usr/share/dict/american-english') as fh:
        english = [ x.strip() for x in fh.readlines() ]
    
    
    def random_phrase(l=2, h=6):
        return ' '.join(random.sample(english, random.randint(l, h)))
    
    
    texts = ['this is a phrase to match', 'another phrase this is']
    # Make texts ~2000 characters
    texts = ['{} {}'.format(t, random_phrase(200, 200)) for t in texts]
    
    phrases = [{'phrase': 'phrase to match', 'link': 'link_url'}, {'phrase': 'this is', 'link': 'link_url2'}]
    #Simulate 80k phrases
    for x in range(80000):
        phrases.append({'phrase': random_phrase(), 'link': 'link{}'.format(x)})
    
    """ Index the patterns """
    
    construct_time = time.time()    
    
    reverse = {d['phrase']:d['link'] for d in phrases}
    re_phrases = [ re.compile(d['phrase'].replace(' ', r'\s+')) for d in phrases ]
    re_whitespace = re.compile(r'\s+')
    
    def tokenize(str):
        return str.split()
    
    index = {}
    
    for n in range(len(phrases)):
        tokens = tokenize(phrases[n]['phrase'])
        for token in tokens:
            if not token in index:
                index[token] = []
            index[token].append(n)
    
    print('Time to construct:', time.time() - construct_time)
    print()
    
    for text in texts:
        start_time = time.time()
        print('{} characters - "{}..."'.format(len(text), text[:60]))
    
        """ Filter patterns to find candidates that *could* match the text """
        tokens = tokenize(text)
        phrase_ns = []
    
        for token in tokens:
            if not token in index:
               continue
            for n in index[token]:
                phrase_ns.append(n)
    
        phrase_ns = list(set(phrase_ns))
    
        """ Test the candidate patterns and perform substitutions """
        for n in phrase_ns:
            match = re.search(re_phrases[n], text)
            if match:
                print(match.span(), reverse[match.group()])
        print('Time taken:', time.time() - start_time)        
        print()
    

    In my environment, this version creates an index in 16.2 seconds, and does the matching in 0.0042 and 0.0037 seconds (vs 4.7 seconds for the simple regex version, a ~1000x speedup). The exact performance depends on the statistical properties of the text and phrases, of course, but this will almost always be a huge win.

    Bonus: if a phrase must match several words (tokens), you can only add it to the index entry for the one least common token it must match, for another huge speedup.

提交回复
热议问题