I\'m building a backend and trying to crunch the following problem.
2000
characters on average)The "Patricia tree" is a good solution for this kind of problem. It's sort of a radix tree with the radix being the character choices involved. So to find if "the dog" is in the tree, you start at the root, tag the "t" branch, then the "h" branch, and so on. Except Patricia trees do this really fast.
So you spin your text through, and you can get all tree locations (=phrases) that hits. This will even get you overlapping matches if you want.
The main article about them is Donald R. Morrison, PATRICIA - Practical Algorithm to Retrieve Information Coded in Alphanumeric, Journal of the ACM, 15(4):514-534, October 1968. There's some discussion at https://xlinux.nist.gov/dads/HTML/patriciatree.html There are several implementations on github, though I don't know which are good.
To get a reasonable speed while matching 80k patterns, you definitely need some preprocessing on the patterns, single-shot algorithms like Boyer-Moore
won't help much.
You'll probably also need to do the work in compiled code (think C extension) to get reasonable throughput. Regarding how to preprocess the patterns - one option is state machines like Aho-Corasick
or some generic finite state transducer. The next option is something like a suffix array
based index, and the last one that comes to my mind is inverted index.
If your matches are exact and the patterns respect word boundaries, chances are that a well implemented word or word-ngram keyed inverted index
will be fast enough even in pure Python. The index is not a complete solution, it will rather give you a few candidate phrases which you need to check with normal string matching for a complete match.
If you need approximate matching, character-ngram inverted index is your choice.
Regarding real implementations - flashtext mentioned in other answer here seems to be a reasonable pure Python solution if you're OK with the full-phrase-only limitation.
Otherwise you can get reasonable results with generic multi-pattern capable regexp libraries: one of the fastest should be Intel's hyperscan - there are even some rudimentary python bindings available.
Other option is Google's RE2 with Python bindings from Facebook. You want to use RE2::Set
in this case.
You have much more pattern data than text data. Invert the problem: match the patterns against the text.
For the purposes of the this, I would assume that the text can be reasonably tokenized into words (or something word-like). I'd also assume that the phrases, even if they can't be tokenized per se (for example because they are regexes) nevertheless usually contain words, and (most of the time) have to match at least one of the words they contain.
Here is a sketch of a solution which contains three parts:
Tokenize and index the patterns (once) - this produces a map of patterns that contain each token
Tokenize text and filter patterns to find candidates that could match the text
Test the candidate patterns and perform substitutions
Here is the code:
import re
import random
# from nltk.corpus import words
import time
""" Prepare text and phrases, same as in Martin Evans's answer """
# english = words.words()
with open('/usr/share/dict/american-english') as fh:
english = [ x.strip() for x in fh.readlines() ]
def random_phrase(l=2, h=6):
return ' '.join(random.sample(english, random.randint(l, h)))
texts = ['this is a phrase to match', 'another phrase this is']
# Make texts ~2000 characters
texts = ['{} {}'.format(t, random_phrase(200, 200)) for t in texts]
phrases = [{'phrase': 'phrase to match', 'link': 'link_url'}, {'phrase': 'this is', 'link': 'link_url2'}]
#Simulate 80k phrases
for x in range(80000):
phrases.append({'phrase': random_phrase(), 'link': 'link{}'.format(x)})
""" Index the patterns """
construct_time = time.time()
reverse = {d['phrase']:d['link'] for d in phrases}
re_phrases = [ re.compile(d['phrase'].replace(' ', r'\s+')) for d in phrases ]
re_whitespace = re.compile(r'\s+')
def tokenize(str):
return str.split()
index = {}
for n in range(len(phrases)):
tokens = tokenize(phrases[n]['phrase'])
for token in tokens:
if not token in index:
index[token] = []
index[token].append(n)
print('Time to construct:', time.time() - construct_time)
print()
for text in texts:
start_time = time.time()
print('{} characters - "{}..."'.format(len(text), text[:60]))
""" Filter patterns to find candidates that *could* match the text """
tokens = tokenize(text)
phrase_ns = []
for token in tokens:
if not token in index:
continue
for n in index[token]:
phrase_ns.append(n)
phrase_ns = list(set(phrase_ns))
""" Test the candidate patterns and perform substitutions """
for n in phrase_ns:
match = re.search(re_phrases[n], text)
if match:
print(match.span(), reverse[match.group()])
print('Time taken:', time.time() - start_time)
print()
In my environment, this version creates an index in 16.2 seconds, and does the matching in 0.0042 and 0.0037 seconds (vs 4.7 seconds for the simple regex version, a ~1000x speedup). The exact performance depends on the statistical properties of the text and phrases, of course, but this will almost always be a huge win.
Bonus: if a phrase must match several words (tokens), you can only add it to the index entry for the one least common token it must match, for another huge speedup.
Assuming that the list of phrases changes over time and gets bigger, I'd recommend using a software, that already does, what you need. E.g. elasticsearch, it's open source and has a Python client. If running a service like that in the background, this solves all you want and probably more than you ever can imagine. Also it's really not that hard to implement.
Maybe you should try flashtext.
According to the author, it is much more faster than Regex.
The author even published a paper for this library.
I've personally tried this library for one of my project, in my opinion its API is quite friendly and usable.
Hope it helps.
You should try a string search / pattern matching algorithm. Most famous algorithm for you task is the Aho-Corasick there is a python library for it (of the top of google search)
Most of the pattern matching / string search algorithms will require you to convert your "bag of words/phrases" into a trie.