Efficiently querying one string against multiple regexes

前端未结

关注

 18  812

Lets say that I have 10,000 regexes and one string and I want to find out if the string matches any of them and get all the matches. The trivial way to do it would be to jus

相关标签:

18条回答

故里飘歌

2020-12-12 17:42

We had to do this on a product I worked on once. The answer was to compile all your regexes together into a Deterministic Finite State Machine (also known as a deterministic finite automaton or DFA). The DFA could then be walked character by character over your string and would fire a "match" event whenever one of the expressions matched.

Advantages are it runs fast (each character is compared only once) and does not get any slower if you add more expressions.

Disadvantages are that it requires a huge data table for the automaton, and there are many types of regular expressions that are not supported (for instance, back-references).

The one we used was hand-coded by a C++ template nut in our company at the time, so unfortunately I don't have any FOSS solutions to point you toward. But if you google regex or regular expression with "DFA" you'll find stuff that will point you in the right direction.

0 讨论(0)
发布评论:

提交评论
- 加载中...
终归单人心

2020-12-12 17:42

This is the way lexers work.

The regular expressions are converted into a single non deterministic automata (NFA) and possibily transformed in a deterministic automata (DFA).

The resulting automaton will try to match all the regular expressions at once and will succeed on one of them.

There are many tools that can help you here, they are called "lexer generator" and there are solutions that work with most of the languages.

You don't say which language are you using. For C programmers I would suggest to have a look at the re2c tool. Of course the traditional (f)lex is always an option.

0 讨论(0)
发布评论:

提交评论
- 加载中...
花落未央

2020-12-12 17:44

You could compile the regex into a hybrid DFA/Bucchi automata where each time the BA enters an accept state you flag which regex rule "hit".

Bucchi is a bit of overkill for this, but modifying the way your DFA works could do the trick.

0 讨论(0)
发布评论:

提交评论
- 加载中...

小蘑菇

2020-12-12 17:44

The fastest way to do it seems to be something like this (code is C#):

public static List<Regex> FindAllMatches(string s, List<Regex> regexes)
{
    List<Regex> matches = new List<Regex>();
    foreach (Regex r in regexes)
    {
        if (r.IsMatch(string))
        {
            matches.Add(r);
        }
    }
    return matches;
}

Oh, you meant the fastest code? i don't know then....

0 讨论(0)

轻奢々

2020-12-12 17:45

I'd say that it's a job for a real parser. A midpoint might be a Parsing Expression Grammar (PEG). It's a higher-level abstraction of pattern matching, one feature is that you can define a whole grammar instead of a single pattern. There are some high-performance implementations that work by compiling your grammar into a bytecode and running it in a specialized VM.

disclaimer: the only one i know is LPEG, a library for Lua, and it wasn't easy (for me) to grasp the base concepts.

0 讨论(0)
发布评论:

提交评论
- 加载中...
刺人心

2020-12-12 17:46

If you're thinking in terms of "10,000 regexes" you need to shift your though processes. If nothing else, think in terms of "10,000 target strings to match". Then look for non-regex methods built to deal with "boatloads of target strings" situations, like Aho-Corasick machines. Frankly, though, it seems like somethings gone off the rails much earlier in the process than which machine to use, since 10,000 target strings sounds a lot more like a database lookup than a string match.

0 讨论(0)
发布评论:

提交评论
- 加载中...