Is there any Regex optimizer written in Java?

前端 未结 3 1392
佛祖请我去吃肉
佛祖请我去吃肉 2021-01-23 09:46

I wrote a Java program which can generate a sequence of symbols, like \"abcdbcdefbcdbcdefg\". What I need is Regex optimizer, which can result \"a((bcd){2}ef)

相关标签:
3条回答
  • 2021-01-23 10:08

    I've got a nasty feeling that the problem of creating the shortest regex that matches a given input string or set of strings is going to be computationally "difficult". (There are parallels with the problem of computing Kolmogorov Complexity ...)

    It is also worth noting that the optimal regex for abcdbcdefbcdbcdefg in terms of matching speed is likely to be abcdbcdefbcdbcdefg. Adding repeating groups may make the regex string shorter, but it won't make the regex faster. In fact, it is likely to be slower unless the regex engine unrolls the repeating groups.

    The reason that I need this is due to the space/memory limits.

    Do you have clear evidence that you need to do this?

    I suspect that you won't save a worthwhile amount of space by doing this ... unless the input strings are really long. (And if they are long, then you'll get better results using a regular text compression algorithm to compress the strings.)

    0 讨论(0)
  • 2021-01-23 10:21

    I assume you are trying to find a small regex to encode a finite set of input strings. If so, you haven't chosen the best possible subject line.

    I can't give you an existing program, but I can tell you how to approach writing one.

    There is no canonical minimum regex form and determining the true minimum size regex is NP hard. Certainly your sets are finite, so this may be a simpler problem. I'll have to think about it.

    But a good heuristic algorithm would be:

    1. Construct a trivial non-deterministic finite automaton (NFA) that accepts all your strings.
    2. Convert the NFA to a deterministic finite automaton (DFA) with the subset construction.
    3. Minimize the DFA with the standard algorithm.
    4. Use the construction from the proof of Kleene's theorem to get to a regex.

    Note that step 3 does give you a unique minimum DFA. That would probably be the best way to encode your string sets.

    0 讨论(0)
  • 2021-01-23 10:22

    Regular expressions are not a substitute for compression

    Don't use a regular expression to represent a string constant. Regular expressions are designed to be used to match one of many strings. That's not what you're doing.

    0 讨论(0)
提交回复
热议问题