The Word Toptimizer

偶尔善良 提交于 2021-02-08 06:57:20

问题


for s_index, s in enumerate(sentences):
        s_tokens = s.split()
        if (local_q_set.intersection(set(s_tokens)) == local_q_set):
            q_results.append(s_index)

The code snippet above is the core algorithm I used to find relevant sentences in massive text data that include all tokens in the query. For example, for a query "happy apple", it finds all sentences that include exactly one or more of all of the given tokens, (i.e. "happy" and "apple"). My method is very simple: find common intersecting sets and see if they match. However, I am not getting enough performance. If anyone has seen an optimization for such problem, I would highly appreciate any direction or link for the idea- Thank you for time in advance


回答1:


There are a few things you can do to increase performance of the sequential search but the real boost would come from indexing tokens.

set.difference

Using not local_q_set.difference(s_tokens) instead of comparing the intersection to the original set may be somewhat faster.

Regular Expression filter

If your sentences are long, using a regular expression may provide some speed improvements by isolating the potential tokens out of the sentence before checking them against the token set:

import re
tokens     = re.compile("|".join(local_q_set))
tokenCount = len(local_q_set)
for s_index, s in enumerate(sentences):
    s_tokens = tokens.findall(s)
    if len(s_tokens) < tokenCount or local_q_set.difference(s.split()):
       continue
    q_results.append(s_index) 

Filter using the in operator

You can also use a simple in operator to check for the presence of tokens instead of a regular expression (this should be faster when you have few tokens in the query):

result = []
tokenSet = set(queryTokens)
for index, sentence in enumerate(sentences):
     if any( token not in sentence for token in queryTokens) \
     or tokenSet.difference(sentence.split()):
         continue
     result.append(index)

Caching sentence word sets

To improve on the sequential search when multiple queries are made to the same list of sentences, you can build a cache of word sets corresponding to the sentences. This will eliminate the work of parsing the sentences while going through them to find a match.

cachedWords = []

queryTokens = ["happy","apple"]

queryTokenSet = set(queryTokens)
if not cachedWords:
    cachedWords = [ set(sentence.split()) for sentence in sentences ]
result = [ index for index,words in enumerate(cachedWords) if not queryTokenSet.difference(words) ]

Token Indexing

If you are going to perform many queries against the same list of sentences, it will be more efficient to create a mapping between tokens and sentence indexes. You can do that using a dictionary and then obtain query results directly by intersecting the sentence indexes of the queried tokens:

tokenIndexes = dict()
for index,sentence in enumerate(sentences):
    for token in sentence.lower().split():
        tokenIndexes.setdefault(token,[]).append(index)

def tokenSet(token): return set(tokenIndexes.get(token,[]))

queryTokens = ["happy","apple"]

from functools import reduce
result = reduce(set.intersection , (tokenSet(token) for token in queryTokens) )

This will allow you to economically implement complex queries using set operators. For example:

import re

querySring = " happy & ( apple | orange | banana ) "
result = eval(re.sub("(\w+)",r"tokenSet('\1')", querySring)) 

# re.sub(...) transforms the query string into " tokenSet('happy') & ( tokenSet('apple') | tokenSet('orange') | tokenSet('banana') ) "

Performance Tests:

I made a few performance tests (finding two tokens in one sentence out of 80,000):

original algorithm: 105 ms           1x
set.difference:      88 ms         1.2x
regular expression:  60 ms         1.8x
"in" operator:       43 ms         2.4x
caching word sets:   23 ms         4.6x (excluding 187ms to build cache)
token indexing:       0.0075 ms  14000x (excluding 238ms to build tokenIndexes)

So, if you're going to be performing several queries on the same sentences, with token indexing, you'll get 14 thousand times faster responses once the tokenIndexes dictionary is built.



来源:https://stackoverflow.com/questions/55252069/the-word-toptimizer

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!