问题
for s_index, s in enumerate(sentences):
s_tokens = s.split()
if (local_q_set.intersection(set(s_tokens)) == local_q_set):
q_results.append(s_index)
The code snippet above is the core algorithm I used to find relevant sentences in massive text data that include all tokens in the query. For example, for a query "happy apple", it finds all sentences that include exactly one or more of all of the given tokens, (i.e. "happy" and "apple"). My method is very simple: find common intersecting sets and see if they match. However, I am not getting enough performance. If anyone has seen an optimization for such problem, I would highly appreciate any direction or link for the idea- Thank you for time in advance
回答1:
There are a few things you can do to increase performance of the sequential search but the real boost would come from indexing tokens.
set.difference
Using not local_q_set.difference(s_tokens)
instead of comparing the intersection to the original set may be somewhat faster.
Regular Expression filter
If your sentences are long, using a regular expression may provide some speed improvements by isolating the potential tokens out of the sentence before checking them against the token set:
import re
tokens = re.compile("|".join(local_q_set))
tokenCount = len(local_q_set)
for s_index, s in enumerate(sentences):
s_tokens = tokens.findall(s)
if len(s_tokens) < tokenCount or local_q_set.difference(s.split()):
continue
q_results.append(s_index)
Filter using the in operator
You can also use a simple in
operator to check for the presence of tokens instead of a regular expression (this should be faster when you have few tokens in the query):
result = []
tokenSet = set(queryTokens)
for index, sentence in enumerate(sentences):
if any( token not in sentence for token in queryTokens) \
or tokenSet.difference(sentence.split()):
continue
result.append(index)
Caching sentence word sets
To improve on the sequential search when multiple queries are made to the same list of sentences, you can build a cache of word sets corresponding to the sentences. This will eliminate the work of parsing the sentences while going through them to find a match.
cachedWords = []
queryTokens = ["happy","apple"]
queryTokenSet = set(queryTokens)
if not cachedWords:
cachedWords = [ set(sentence.split()) for sentence in sentences ]
result = [ index for index,words in enumerate(cachedWords) if not queryTokenSet.difference(words) ]
Token Indexing
If you are going to perform many queries against the same list of sentences, it will be more efficient to create a mapping between tokens and sentence indexes. You can do that using a dictionary and then obtain query results directly by intersecting the sentence indexes of the queried tokens:
tokenIndexes = dict()
for index,sentence in enumerate(sentences):
for token in sentence.lower().split():
tokenIndexes.setdefault(token,[]).append(index)
def tokenSet(token): return set(tokenIndexes.get(token,[]))
queryTokens = ["happy","apple"]
from functools import reduce
result = reduce(set.intersection , (tokenSet(token) for token in queryTokens) )
This will allow you to economically implement complex queries using set operators. For example:
import re
querySring = " happy & ( apple | orange | banana ) "
result = eval(re.sub("(\w+)",r"tokenSet('\1')", querySring))
# re.sub(...) transforms the query string into " tokenSet('happy') & ( tokenSet('apple') | tokenSet('orange') | tokenSet('banana') ) "
Performance Tests:
I made a few performance tests (finding two tokens in one sentence out of 80,000):
original algorithm: 105 ms 1x
set.difference: 88 ms 1.2x
regular expression: 60 ms 1.8x
"in" operator: 43 ms 2.4x
caching word sets: 23 ms 4.6x (excluding 187ms to build cache)
token indexing: 0.0075 ms 14000x (excluding 238ms to build tokenIndexes)
So, if you're going to be performing several queries on the same sentences, with token indexing, you'll get 14 thousand times faster responses once the tokenIndexes dictionary is built.
来源:https://stackoverflow.com/questions/55252069/the-word-toptimizer