Python - Fast count words in text from list of strings and that start with

问题

I know that similar questions have been asked several times, but my problem is a bit different and I am looking for a time-efficient solution, in Python.

I have a set of words, some of them end with the "*" and some others don't:

words = set(["apple", "cat*", "dog"])

I have to count their total occurrences in a text, considering that anything can go after an asterisk ("cat*" means all the words that start with "cat"). Search has to be case insensitive. Consider this example:

text = "My cat loves apples, but I never ate an apple. My dog loves them less than my CATS".

I would like to get a final score of 4 (= cat* x 2 + dog + apple). Please note that "cat*" has ben counted twice, also considering the plural, whereas "apple" has been counted just once, as its plural is not considered (having no asterisk at the end).

I have to repeat this operation on a large set of documents, so I would need a fast solution. I don't know if regex or flashtext could reach a fast solution. Could you help me?

EDIT

I forgot to mention thas some of my words contain punctuation, see here for e.g.:

words = set(["apple", "cat*", "dog", ":)", "I've"])

This seems to create additional problems when compiling the regex. Is there some integration to the code you already provided that would work for these two additional words?

回答1:

You can do this with regex, creating a regex out of the set of words, putting word boundaries around the words but leaving the trailing word boundary off words that end with *. Compiling the regex should help performance:

import re

words = set(["apple", "cat*", "dog"])
text = "My cat loves apples, but I never ate an apple. My dog loves them less than my CATS"

regex = re.compile('|'.join([r'\b' + w[:-1] if w.endswith('*') else r'\b' + w + r'\b' for w in words]), re.I)
matches = regex.findall(text)
print(len(matches))

Output:

回答2:

DISCLAIMER: I'm the author of trrex

For this problem, if you really want a scalable solution, use a trie regex instead of a union regex. See this answer for an explanation. One approach is to use trrex, for example:

import trrex as tx
import re

words = {"apple", "cat*", "dog"}
text = "My cat loves apples, but I never ate an apple. My dog loves them less than my CATS"

prefix_set = {w.replace('*', '') for w in words if w.endswith('*')}
full_set = {w for w in words if not w.endswith('*')}

prefix_pattern = re.compile(tx.make(prefix_set, right=''), re.IGNORECASE)  # '' as we only care about prefixes
full_pattern = re.compile(tx.make(full_set), re.IGNORECASE)

res = prefix_pattern.findall(text) + full_pattern.findall(text)
print(res)

Output

['cat', 'CAT', 'apple', 'dog']

For a particular use of trrex, see this, the experiments described over there yield a 10 times improvement over the naive union regex. A trie regex take advantages of common prefixes and creates an optimal regular expression, for the words:

['baby', 'bat', 'bad']

it creates the following:

ba(?:by|[td])

回答3:

Create a a Trie for the words you want to search.

Then iterate over the characters of the string you want to check.

Each time you reached a leaf in the tree, increase the counter and skip to the next word.
Each time there is no path, skip to the next word.

来源：https://stackoverflow.com/questions/65140090/python-fast-count-words-in-text-from-list-of-strings-and-that-start-with

标签

python

string

full-text-search