Find occurrences of huge list of phrases in text

后端未结

关注

 8  2023

傲寒

I\'m building a backend and trying to crunch the following problem.

The clients submit text to the backend (around 2000 characters on average)

相关标签:

8条回答

旧时难觅i

2021-02-08 05:35

The "Patricia tree" is a good solution for this kind of problem. It's sort of a radix tree with the radix being the character choices involved. So to find if "the dog" is in the tree, you start at the root, tag the "t" branch, then the "h" branch, and so on. Except Patricia trees do this really fast.

So you spin your text through, and you can get all tree locations (=phrases) that hits. This will even get you overlapping matches if you want.

The main article about them is Donald R. Morrison, PATRICIA - Practical Algorithm to Retrieve Information Coded in Alphanumeric, Journal of the ACM, 15(4):514-534, October 1968. There's some discussion at https://xlinux.nist.gov/dads/HTML/patriciatree.html There are several implementations on github, though I don't know which are good.

0 讨论(0)
发布评论:

提交评论
- 加载中...
不知归路

2021-02-08 05:36

To get a reasonable speed while matching 80k patterns, you definitely need some preprocessing on the patterns, single-shot algorithms like Boyer-Moore won't help much.

You'll probably also need to do the work in compiled code (think C extension) to get reasonable throughput. Regarding how to preprocess the patterns - one option is state machines like Aho-Corasick or some generic finite state transducer. The next option is something like a suffix array based index, and the last one that comes to my mind is inverted index.

If your matches are exact and the patterns respect word boundaries, chances are that a well implemented word or word-ngram keyed inverted index will be fast enough even in pure Python. The index is not a complete solution, it will rather give you a few candidate phrases which you need to check with normal string matching for a complete match.

If you need approximate matching, character-ngram inverted index is your choice.

Regarding real implementations - flashtext mentioned in other answer here seems to be a reasonable pure Python solution if you're OK with the full-phrase-only limitation.

Otherwise you can get reasonable results with generic multi-pattern capable regexp libraries: one of the fastest should be Intel's hyperscan - there are even some rudimentary python bindings available.

Other option is Google's RE2 with Python bindings from Facebook. You want to use RE2::Set in this case.

0 讨论(0)
发布评论:

提交评论
- 加载中...

逝去的感伤

2021-02-08 05:42

You have much more pattern data than text data. Invert the problem: match the patterns against the text.

For the purposes of the this, I would assume that the text can be reasonably tokenized into words (or something word-like). I'd also assume that the phrases, even if they can't be tokenized per se (for example because they are regexes) nevertheless usually contain words, and (most of the time) have to match at least one of the words they contain.

Here is a sketch of a solution which contains three parts:

Tokenize and index the patterns (once) - this produces a map of patterns that contain each token
Tokenize text and filter patterns to find candidates that could match the text
Test the candidate patterns and perform substitutions

Here is the code:

import re
import random
# from nltk.corpus import words
import time

""" Prepare text and phrases, same as in Martin Evans's answer """

# english = words.words()
with open('/usr/share/dict/american-english') as fh:
    english = [ x.strip() for x in fh.readlines() ]


def random_phrase(l=2, h=6):
    return ' '.join(random.sample(english, random.randint(l, h)))


texts = ['this is a phrase to match', 'another phrase this is']
# Make texts ~2000 characters
texts = ['{} {}'.format(t, random_phrase(200, 200)) for t in texts]

phrases = [{'phrase': 'phrase to match', 'link': 'link_url'}, {'phrase': 'this is', 'link': 'link_url2'}]
#Simulate 80k phrases
for x in range(80000):
    phrases.append({'phrase': random_phrase(), 'link': 'link{}'.format(x)})

""" Index the patterns """

construct_time = time.time()    

reverse = {d['phrase']:d['link'] for d in phrases}
re_phrases = [ re.compile(d['phrase'].replace(' ', r'\s+')) for d in phrases ]
re_whitespace = re.compile(r'\s+')

def tokenize(str):
    return str.split()

index = {}

for n in range(len(phrases)):
    tokens = tokenize(phrases[n]['phrase'])
    for token in tokens:
        if not token in index:
            index[token] = []
        index[token].append(n)

print('Time to construct:', time.time() - construct_time)
print()

for text in texts:
    start_time = time.time()
    print('{} characters - "{}..."'.format(len(text), text[:60]))

    """ Filter patterns to find candidates that *could* match the text """
    tokens = tokenize(text)
    phrase_ns = []

    for token in tokens:
        if not token in index:
           continue
        for n in index[token]:
            phrase_ns.append(n)

    phrase_ns = list(set(phrase_ns))

    """ Test the candidate patterns and perform substitutions """
    for n in phrase_ns:
        match = re.search(re_phrases[n], text)
        if match:
            print(match.span(), reverse[match.group()])
    print('Time taken:', time.time() - start_time)        
    print()

In my environment, this version creates an index in 16.2 seconds, and does the matching in 0.0042 and 0.0037 seconds (vs 4.7 seconds for the simple regex version, a ~1000x speedup). The exact performance depends on the statistical properties of the text and phrases, of course, but this will almost always be a huge win.

Bonus: if a phrase must match several words (tokens), you can only add it to the index entry for the one least common token it must match, for another huge speedup.

0 讨论(0)

我在风中等你

2021-02-08 05:45

Assuming that the list of phrases changes over time and gets bigger, I'd recommend using a software, that already does, what you need. E.g. elasticsearch, it's open source and has a Python client. If running a service like that in the background, this solves all you want and probably more than you ever can imagine. Also it's really not that hard to implement.

0 讨论(0)
发布评论:

提交评论
- 加载中...
一整个雨季

2021-02-08 05:49

Maybe you should try flashtext.
According to the author, it is much more faster than Regex.

The author even published a paper for this library.

I've personally tried this library for one of my project, in my opinion its API is quite friendly and usable.

Hope it helps.

0 讨论(0)
发布评论:

提交评论
- 加载中...
别跟我提以往

2021-02-08 05:51

You should try a string search / pattern matching algorithm. Most famous algorithm for you task is the Aho-Corasick there is a python library for it (of the top of google search)

Most of the pattern matching / string search algorithms will require you to convert your "bag of words/phrases" into a trie.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页