I was asked today if there was a library to take a list of strings and to compute the most efficient regex to match only those strings. I think it\'s an NP Complete problem by i
Regexp::Assemble::Compressed / Regexp::Assemble know far more tricks than PreSuf
. R::A
comes with the command-line tool assemble (not installed by default) which makes building regexes even easier.
The Perl regex compiler builds a branching trie data structure out of patterns with parts in common across alternatives:
$ perl -Mre=debug -ce '"whatever" =~ /appserver1\.domain\.tld|appserver2\.domain\.tld|appserver3\.domain\.tld/'
Compiling REx "appserver1\.domain\.tld|appserver2\.domain\.tld|appserver3\."...
Final program:
1: EXACT <appserver> (5)
5: TRIEC-EXACT[123] (25)
<1.domain.tld>
<2.domain.tld>
<3.domain.tld>
25: END (0)
anchored "appserver" at 0 (checking anchored) minlen 21
-e syntax OK
Freeing REx: "appserver1\.domain\.tld|appserver2\.domain\.tld|appserver3\."...
The Regex::PreSuf module is designed to do exactly this.
To quote the Synopsis:
use Regex::PreSuf;
my $re = presuf(qw(foobar fooxar foozap));
# $re should be now 'foo(?:zap|[bx]ar)'