I\'m building a backend and trying to crunch the following problem.
2000
characters on average)
You have much more pattern data than text data. Invert the problem: match the patterns against the text.
For the purposes of the this, I would assume that the text can be reasonably tokenized into words (or something word-like). I'd also assume that the phrases, even if they can't be tokenized per se (for example because they are regexes) nevertheless usually contain words, and (most of the time) have to match at least one of the words they contain.
Here is a sketch of a solution which contains three parts:
Tokenize and index the patterns (once) - this produces a map of patterns that contain each token
Tokenize text and filter patterns to find candidates that could match the text
Test the candidate patterns and perform substitutions
Here is the code:
import re
import random
# from nltk.corpus import words
import time
""" Prepare text and phrases, same as in Martin Evans's answer """
# english = words.words()
with open('/usr/share/dict/american-english') as fh:
english = [ x.strip() for x in fh.readlines() ]
def random_phrase(l=2, h=6):
return ' '.join(random.sample(english, random.randint(l, h)))
texts = ['this is a phrase to match', 'another phrase this is']
# Make texts ~2000 characters
texts = ['{} {}'.format(t, random_phrase(200, 200)) for t in texts]
phrases = [{'phrase': 'phrase to match', 'link': 'link_url'}, {'phrase': 'this is', 'link': 'link_url2'}]
#Simulate 80k phrases
for x in range(80000):
phrases.append({'phrase': random_phrase(), 'link': 'link{}'.format(x)})
""" Index the patterns """
construct_time = time.time()
reverse = {d['phrase']:d['link'] for d in phrases}
re_phrases = [ re.compile(d['phrase'].replace(' ', r'\s+')) for d in phrases ]
re_whitespace = re.compile(r'\s+')
def tokenize(str):
return str.split()
index = {}
for n in range(len(phrases)):
tokens = tokenize(phrases[n]['phrase'])
for token in tokens:
if not token in index:
index[token] = []
index[token].append(n)
print('Time to construct:', time.time() - construct_time)
print()
for text in texts:
start_time = time.time()
print('{} characters - "{}..."'.format(len(text), text[:60]))
""" Filter patterns to find candidates that *could* match the text """
tokens = tokenize(text)
phrase_ns = []
for token in tokens:
if not token in index:
continue
for n in index[token]:
phrase_ns.append(n)
phrase_ns = list(set(phrase_ns))
""" Test the candidate patterns and perform substitutions """
for n in phrase_ns:
match = re.search(re_phrases[n], text)
if match:
print(match.span(), reverse[match.group()])
print('Time taken:', time.time() - start_time)
print()
In my environment, this version creates an index in 16.2 seconds, and does the matching in 0.0042 and 0.0037 seconds (vs 4.7 seconds for the simple regex version, a ~1000x speedup). The exact performance depends on the statistical properties of the text and phrases, of course, but this will almost always be a huge win.
Bonus: if a phrase must match several words (tokens), you can only add it to the index entry for the one least common token it must match, for another huge speedup.