问题
I know that similar questions have been asked several times, but my problem is a bit different and I am looking for a time-efficient solution, in Python.
I have a set of words, some of them end with the "*" and some others don't:
words = set(["apple", "cat*", "dog"])
I have to count their total occurrences in a text, considering that anything can go after an asterisk ("cat*" means all the words that start with "cat"). Search has to be case insensitive. Consider this example:
text = "My cat loves apples, but I never ate an apple. My dog loves them less than my CATS".
I would like to get a final score of 4 (= cat* x 2 + dog + apple). Please note that "cat*" has ben counted twice, also considering the plural, whereas "apple" has been counted just once, as its plural is not considered (having no asterisk at the end).
I have to repeat this operation on a large set of documents, so I would need a fast solution. I don't know if regex or flashtext could reach a fast solution. Could you help me?
EDIT
I forgot to mention thas some of my words contain punctuation, see here for e.g.:
words = set(["apple", "cat*", "dog", ":)", "I've"])
This seems to create additional problems when compiling the regex. Is there some integration to the code you already provided that would work for these two additional words?
回答1:
You can do this with regex, creating a regex out of the set of words, putting word boundaries around the words but leaving the trailing word boundary off words that end with *
. Compiling the regex should help performance:
import re
words = set(["apple", "cat*", "dog"])
text = "My cat loves apples, but I never ate an apple. My dog loves them less than my CATS"
regex = re.compile('|'.join([r'\b' + w[:-1] if w.endswith('*') else r'\b' + w + r'\b' for w in words]), re.I)
matches = regex.findall(text)
print(len(matches))
Output:
4
回答2:
DISCLAIMER: I'm the author of trrex
For this problem, if you really want a scalable solution, use a trie regex instead of a union regex. See this answer for an explanation. One approach is to use trrex, for example:
import trrex as tx
import re
words = {"apple", "cat*", "dog"}
text = "My cat loves apples, but I never ate an apple. My dog loves them less than my CATS"
prefix_set = {w.replace('*', '') for w in words if w.endswith('*')}
full_set = {w for w in words if not w.endswith('*')}
prefix_pattern = re.compile(tx.make(prefix_set, right=''), re.IGNORECASE) # '' as we only care about prefixes
full_pattern = re.compile(tx.make(full_set), re.IGNORECASE)
res = prefix_pattern.findall(text) + full_pattern.findall(text)
print(res)
Output
['cat', 'CAT', 'apple', 'dog']
For a particular use of trrex, see this, the experiments described over there yield a 10 times improvement over the naive union regex. A trie regex take advantages of common prefixes and creates an optimal regular expression, for the words:
['baby', 'bat', 'bad']
it creates the following:
ba(?:by|[td])
回答3:
Create a a Trie for the words you want to search.
Then iterate over the characters of the string you want to check.
- Each time you reached a leaf in the tree, increase the counter and skip to the next word.
- Each time there is no path, skip to the next word.
来源:https://stackoverflow.com/questions/65140090/python-fast-count-words-in-text-from-list-of-strings-and-that-start-with