I wrote a Java program which can generate a sequence of symbols, like \"abcdbcdefbcdbcdefg\"
. What I need is Regex optimizer, which can result \"a((bcd){2}ef)
I've got a nasty feeling that the problem of creating the shortest regex that matches a given input string or set of strings is going to be computationally "difficult". (There are parallels with the problem of computing Kolmogorov Complexity ...)
It is also worth noting that the optimal regex for abcdbcdefbcdbcdefg
in terms of matching speed is likely to be abcdbcdefbcdbcdefg
. Adding repeating groups may make the regex string shorter, but it won't make the regex faster. In fact, it is likely to be slower unless the regex engine unrolls the repeating groups.
The reason that I need this is due to the space/memory limits.
Do you have clear evidence that you need to do this?
I suspect that you won't save a worthwhile amount of space by doing this ... unless the input strings are really long. (And if they are long, then you'll get better results using a regular text compression algorithm to compress the strings.)
I assume you are trying to find a small regex to encode a finite set of input strings. If so, you haven't chosen the best possible subject line.
I can't give you an existing program, but I can tell you how to approach writing one.
There is no canonical minimum regex form and determining the true minimum size regex is NP hard. Certainly your sets are finite, so this may be a simpler problem. I'll have to think about it.
But a good heuristic algorithm would be:
Note that step 3 does give you a unique minimum DFA. That would probably be the best way to encode your string sets.
Regular expressions are not a substitute for compression
Don't use a regular expression to represent a string constant. Regular expressions are designed to be used to match one of many strings. That's not what you're doing.