Efficiently querying one string against multiple regexes

前端 未结 18 810
感情败类
感情败类 2020-12-12 17:16

Lets say that I have 10,000 regexes and one string and I want to find out if the string matches any of them and get all the matches. The trivial way to do it would be to jus

相关标签:
18条回答
  • 2020-12-12 17:35

    I've come across a similar problem in the past. I used a solution similar to the one suggested by akdom.

    I was lucky in that my regular expressions usually had some substring that must appear in every string it matches. I was able to extract these substrings using a simple parser and index them in an FSA using the Aho-Corasick algorithms. The index was then used to quickly eliminate all the regular expressions that trivially don't match a given string, leaving only a few regular expressions to check.

    I released the code under the LGPL as a Python/C module. See esmre on Google code hosting.

    0 讨论(0)
  • 2020-12-12 17:35

    You'd need to have some way of determining if a given regex was "additive" compared to another one. Creating a regex "hierarchy" of sorts allowing you to determine that all regexs of a certain branch did not match

    0 讨论(0)
  • 2020-12-12 17:36

    I think that the short answer is that yes, there is a way to do this, and that it is well known to computer science, and that I can't remember what it is.

    The short answer is that you might find that your regex interpreter already deals with all of these efficiently when |'d together, or you might find one that does. If not, it's time for you to google string-matching and searching algorithms.

    0 讨论(0)
  • 2020-12-12 17:37

    I use Ragel with a leaving action:

    action hello {...}
    action ello {...}
    action ello2 {...}
    main := /[Hh]ello/  % hello |
            /.+ello/ % ello |
            any{0,20} "ello"  % ello2 ;
    

    The string "hello" would call the code in the action hello block, then in the action ello block and lastly in the action ello2 block.

    Their regular expressions are quite limited and the machine language is preferred instead, the braces from your example only work with the more general language.

    0 讨论(0)
  • 2020-12-12 17:38

    Try combining them into one big regex?

    0 讨论(0)
  • 2020-12-12 17:41

    You could combine them in groups of maybe 20.

    (?=(regex1)?)(?=(regex2)?)(?=(regex3)?)...(?=(regex20)?)
    

    As long as each regex has zero (or at least the same number of) capture groups, you can look at what what captured to see which pattern(s) matched.

    If regex1 matched, capture group 1 would have it's matched text. If not, it would be undefined/None/null/...

    0 讨论(0)
提交回复
热议问题