Find occurrences of huge list of phrases in text

后端 未结 8 2024
傲寒
傲寒 2021-02-08 05:02

I\'m building a backend and trying to crunch the following problem.

  • The clients submit text to the backend (around 2000 characters on average)
8条回答
  •  难免孤独
    2021-02-08 05:52

    I faced an almost identical problem with my own chat page system. I wanted to be able to add a link to a number of keywords (with slight variations) that were present in the text. I only had around 200 phrases though to check.

    I decided to try using a standard regular expression for the problem to see how fast it would be. The main bottleneck was in constructing the regular expression. I decided to pre-compile this and found the match time was very fast for shorter texts.

    The following approach takes a list of phrases, where each contains phrase and link keys. It first constructs a reverse lookup dictionary:

    {'phrase to match' : 'link_url', 'another phrase' : 'link_url2'}
    

    Next it compiles a regular expression in the following form, this allows for matches which contain different amounts of white space between words:

    (phrase\s+to\s+match|another\s+phrase)
    

    Then for each piece of text (e.g. 2000 words each), it uses finditer() to get each match. The match object gives you .span() giving the start and end location of the matching text and group(1) gives the matched text. As the text can possibly have extra whitespace, re_whitespace is first applied to remove it and bring it back to the form stored in the reverse dictionary. With this, it is possible to automatically look up the required link:

    import re
    
    texts = ['this is a phrase   to    match', 'another phrase this  is']
    phrases = [{'phrase': 'phrase to match', 'link': 'link_url'}, {'phrase': 'this is', 'link': 'link_url2'}]
    
    reverse = {d['phrase']:d['link'] for d in sorted(phrases, key=lambda x: x['phrase'])}
    re_whitespace = re.compile(r'\s+')
    re_phrases = re.compile('({})'.format('|'.join(d['phrase'].replace(' ', r'\s+') for d in phrases)))
    
    for text in texts:
        matches = [(match.span(), reverse[re_whitespace.sub(' ', match.group(1))]) for match in re_phrases.finditer(text)]
        print(matches)
    

    Which would display the matches for the two texts as:

    [((0, 7), 'link_url2'), ((10, 30), 'link_url')]
    [((15, 23), 'link_url2')]
    

    To test how this scales, I have tested it by importing a list of English words from nltk and automatically creating 80,000 two to six word phrases along with unique links. I then timed it on two suitably long texts:

    import re
    import random
    from nltk.corpus import words
    import time
    
    english = words.words()
    
    def random_phrase(l=2, h=6):
        return ' '.join(random.sample(english, random.randint(l, h)))
    
    
    texts = ['this is a phrase   to    match', 'another phrase this  is']
    # Make texts ~2000 characters
    texts = ['{} {}'.format(t, random_phrase(200, 200)) for t in texts]
    
    phrases = [{'phrase': 'phrase to match', 'link': 'link_url'}, {'phrase': 'this is', 'link': 'link_url2'}]
    #Simulate 80k phrases
    for x in range(80000):
        phrases.append({'phrase': random_phrase(), 'link': 'link{}'.format(x)})
    
    construct_time = time.time()    
    
    reverse = {d['phrase']:d['link'] for d in phrases}
    re_whitespace = re.compile(r'\s+')
    re_phrases = re.compile('({})'.format('|'.join(d['phrase'].replace(' ', r'\s+') for d in sorted(phrases, key=lambda x: len(x['phrase'])))))
    
    print('Time to construct:', time.time() - construct_time)
    print()
    
    for text in texts:
        start_time = time.time()
        print('{} characters - "{}..."'.format(len(text), text[:60]))
        matches = [(match.span(), reverse[re_whitespace.sub(' ', match.group(1))]) for match in re_phrases.finditer(text)]
        print(matches)
        print('Time taken:', time.time() - start_time)        
        print()
    

    This takes ~17 seconds to construct the regular expression and reverse lookup (which is only needed once). It then takes about 6 seconds per text. For very short text it takes ~0.06 seconds per text.

    Time to construct: 16.812477111816406
    
    2092 characters - "this is a phrase   to    match totaquine externize intoxatio..."
    [((0, 7), 'link_url2'), ((10, 30), 'link_url')]
    Time taken: 6.000027656555176
    
    2189 characters - "another phrase this  is political procoracoidal playstead as..."
    [((15, 23), 'link_url2')]
    Time taken: 6.190425715255737
    

    This will at least give you an idea to compare against.

提交回复
热议问题