I\'m trying to write a regex or Ruby method which will find the longest repeated pattern in a string. For example:
\"abcabc\" => \"abc\"
\"cccc\" => \"c
I can’t believe all these stuckinthebookheads have been telling you it can’t be done with a pattern. They do not know what they are talking about, as I am about to demonstrate. Believe me, if I can solve Diophantine equations of order one using patterns — AND I CAN! :) — then I can certainly do this simple little bit. In fact, this is very very easy to do with a pattern, provided that you will be content with the leftmost longest such match. For example, just use:
/(.+)\1+/
If that matches, then the string contains a repeated substring.
That’s the same strategy that you use pattern matching to factor composite integers.
First, create a string that is the unary representation of the integer. That would be 1
for 1, 11
for 2, 111
for 3, 1111
for 4, etc. Given such a representation, the pattern to find the largest factor is:
/^(11+)\1+$/
where the first subgroup is the largest factor in unary, wherefore the length of the first group is that largest factor as a regular number. (However, that largest factor may not be prime.)
Similarly,
/^(11+?)\1+$/
is the same except that it now finds the smallest factor, which is of course guaranteed to be prime. I don’t know how to emulate Perl’s x
repetition operator in Ruby, so here is a quick demo of this idea using Perl:
$ perl -le 'for $n (@ARGV) { printf "%d is composite and its largest factor is %d.\n", $n, length($1) if ("1" x $n) =~ /^(11+)\1+$/ } ' 5 9 15 24 60 243 891
9 is composite and its largest factor is 3.
15 is composite and its largest factor is 5.
24 is composite and its largest factor is 12.
60 is composite and its largest factor is 30.
243 is composite and its largest factor is 81.
891 is composite and its largest factor is 297.
$ perl -le 'for $n (@ARGV) { printf "%d is composite and its smallest factor is %d.\n", $n, length($1) if ("1" x $n) =~ /^(11+?)\1+$/ } ' 5 9 15 24 60 243 891
9 is composite and its smallest factor is 3.
15 is composite and its smallest factor is 3.
24 is composite and its smallest factor is 2.
60 is composite and its smallest factor is 2.
243 is composite and its smallest factor is 3.
891 is composite and its smallest factor is 3.
A good pattern to find such things in the dictionary is
/(\w+)\1+/i
so that you do the backreference case insensitively.
This is a quick way to find such things in the dictionary list:
$ perl -MEnglish -nle 'print "$PREMATCH<$MATCH>$POSTMATCH" while /(\w+)(\1+)/gi' /usr/share/dict/words
That finds things like:
bkkeeper
booeeper
bookkper
when fed bookkeeper
. Sorted by substring length, the longest dictionary matches are:
12 ambilly
12 tomy
12
12 c
12
10
10 hydria
10
10 ck
10 macetic
10 farad
10 n
10 sophos
10
10
10 abundance
10 abundant
10 abundantly
10 b
10 ior
10
8 ic
8 ali
8 a
8 body
However, it is the leftmost longest such occurring substring. You have to be a lot sneakier to figure out all such substrings even in the fact of overlaps. For example:
2 a- ititious
4 add
ious
4 addious
6 a
6 a
2 aele
4 al
12 ambilly
12 ambilly
2 ambilateralateray
6 inatress
2 aassinatress
2 assainatress
2 assassinatre
2 Caaran
4 Carn
The trick for those is to load up your groups inside a lookahead, turning it into a sneakahead. For example:
/(?=(\w+)(\1+))/i
would be enough to load up the first two groups with the entire match. However, you probably need to keep the prematch and postmatch parts around too, perhaps like this:
/(?=(.*?)(\w+)(\2+)(.*))/i
now you can do a progressive match to sneakahead and find all such matches, even overlaps! The list I gave above was generated using this:
$ perl -nle 'print length($2 . $3), " $`.$1<$2$3>$4" while /(?=(.*?)(\w+)(\2+)(.*))/gi' /usr/share/dict/words | perl -pe 's/\.//g' | uniq
I am pretty sure that the same approach should translate into Ruby without any trouble, since it’s really a property of the matching engine not of Perl per se.
Still wondering about those Diophantine equations, eh? :) Run this in Perl:
# solve for 12x + 15y + 16z = 281, maximizing x
if (($X, $Y, $Z) =
(('o' x 281) =~ /^(o*)\1{11}(o*)\2{14}(o*)\3{15}$/))
{
($x, $y, $z) = (length($X), length($Y), length($Z));
print "One solution is: x=$x; y=$y; z=$z.\n";
} else {
print "No solution.\n";
}
and wonder of wonders, it prints out
One solution is: x=17; y=3; z=2.
As with factoring composite numbers, you can change how you weight these using minimal matching quantifiers. Because the first o*
was greedy, x was allowed to grow as
large as it could. Changing one or more *
quantifiers to *?
,
+
, or +?
can produce different solutions:
('o' x 281) =~ /^(o+)\1{11}(o+)\2{14}(o+)\3{15}$/
# One solution is: x=17; y=3; z=2
('o' x 281) =~ /^(o*?)\1{11}(o*)\2{14}(o*)\3{15}$/
# One solution is: x=0; y=7; z=11.
('o' x 281) =~ /^(o+?)\1{11}(o*)\2{14}(o*)\3{15}$/
# One solution is: x=1; y=3; z=14.
Isn’t that simply incredible? But it’s true. Run the code yourself and you’ll see.
And yes, if anyone is having déjà lu, you’re right, you indeed have read all this before — because I already wrote it up in The Perl Cookbook, several long aeons ago. It all still holds true.
NB: Credit for this technique must go to (M. Douglas) Doug McIlroy of Bell Labs for first demonstrating this wonder.