I wrote a Java program which can generate a sequence of symbols, like \"abcdbcdefbcdbcdefg\"
. What I need is Regex optimizer, which can result \"a((bcd){2}ef)
I've got a nasty feeling that the problem of creating the shortest regex that matches a given input string or set of strings is going to be computationally "difficult". (There are parallels with the problem of computing Kolmogorov Complexity ...)
It is also worth noting that the optimal regex for abcdbcdefbcdbcdefg
in terms of matching speed is likely to be abcdbcdefbcdbcdefg
. Adding repeating groups may make the regex string shorter, but it won't make the regex faster. In fact, it is likely to be slower unless the regex engine unrolls the repeating groups.
The reason that I need this is due to the space/memory limits.
Do you have clear evidence that you need to do this?
I suspect that you won't save a worthwhile amount of space by doing this ... unless the input strings are really long. (And if they are long, then you'll get better results using a regular text compression algorithm to compress the strings.)