How to match a string against a set of wildcard strings efficiently?

后端 未结 2 1750
慢半拍i
慢半拍i 2021-01-14 16:43

I am looking for a solution to match a single string against a set of wildcard strings. For example

>>> match(\"ab\", [\"a*\", \"b*\", \"*\", \"c\",         


        
相关标签:
2条回答
  • 2021-01-14 17:25

    Seems like Aho-Corasick algorithm would work. esmre seem to do what I'm looking for. I got this information from this question.

    0 讨论(0)
  • 2021-01-14 17:35

    You could use FilteredRE2 class from re2 library with a help from Aho-Corasick algorithm implementation (or similar). From re2 docs:

    Required substrings. Suppose you have an efficient way to check which of a list of strings appear as substrings in a large text (for example, maybe you implemented the Aho-Corasick algorithm), but now your users want to be able to do regular expression searches efficiently too. Regular expressions often have large literal strings in them; if those could be identified, they could be fed into the string searcher, and then the results of the string searcher could be used to filter the set of regular expression searches that are necessary. The FilteredRE2 class implements this analysis. Given a list of regular expressions, it walks the regular expressions to compute a boolean expression involving literal strings and then returns the list of strings. For example, FilteredRE2 converts (hello|hi)world[a-z]+foo into the boolean expression “(helloworld OR hiworld) AND foo” and returns those three strings. Given multiple regular expressions, FilteredRE2 converts each into a boolean expression and returns all the strings involved. Then, after being told which of the strings are present, FilteredRE2 can evaluate each expression to identify the set of regular expressions that could possibly be present. This filtering can reduce the number of actual regular expression searches significantly.

    The feasibility of these analyses depends crucially on the simplicity of their input. The first uses the DFA form, while the second uses the parsed regular expression (Regexp*). These kind of analyses would be more complicated (maybe even impossible) if RE2 allowed non-regular features in its regular expressions.

    0 讨论(0)
提交回复
热议问题