Check if string is repetition of an unknown substring

前端 未结 2 1091
星月不相逢
星月不相逢 2021-01-27 04:46

I\'m trying to write a regex or Ruby method which will find the longest repeated pattern in a string. For example:

\"abcabc\"  => \"abc\"  
\"cccc\" => \"c         


        
2条回答
  •  慢半拍i
    慢半拍i (楼主)
    2021-01-27 05:04

    I can’t believe all these stuckinthebookheads have been telling you it can’t be done with a pattern. They do not know what they are talking about, as I am about to demonstrate. Believe me, if I can solve Diophantine equations of order one using patterns — AND I CAN! :) — then I can certainly do this simple little bit. In fact, this is very very easy to do with a pattern, provided that you will be content with the leftmost longest such match. For example, just use:

    /(.+)\1+/
    

    If that matches, then the string contains a repeated substring.

    Integer Factorization with Patterns

    That’s the same strategy that you use pattern matching to factor composite integers.

    First, create a string that is the unary representation of the integer. That would be 1 for 1, 11 for 2, 111 for 3, 1111 for 4, etc. Given such a representation, the pattern to find the largest factor is:

    /^(11+)\1+$/   
    

    where the first subgroup is the largest factor in unary, wherefore the length of the first group is that largest factor as a regular number. (However, that largest factor may not be prime.)

    Similarly,

    /^(11+?)\1+$/
    

    is the same except that it now finds the smallest factor, which is of course guaranteed to be prime. I don’t know how to emulate Perl’s x repetition operator in Ruby, so here is a quick demo of this idea using Perl:

    $ perl -le 'for $n (@ARGV) { printf "%d is composite and its largest factor is %d.\n", $n, length($1) if ("1" x $n) =~ /^(11+)\1+$/ } ' 5 9 15 24 60 243 891
    9 is composite and its largest factor is 3.
    15 is composite and its largest factor is 5.
    24 is composite and its largest factor is 12.
    60 is composite and its largest factor is 30.
    243 is composite and its largest factor is 81.
    891 is composite and its largest factor is 297.
    
    $ perl -le 'for $n (@ARGV) { printf "%d is composite and its smallest factor is %d.\n", $n, length($1) if ("1" x $n) =~ /^(11+?)\1+$/ } ' 5 9 15 24 60 243 891
    9 is composite and its smallest factor is 3.
    15 is composite and its smallest factor is 3.
    24 is composite and its smallest factor is 2.
    60 is composite and its smallest factor is 2.
    243 is composite and its smallest factor is 3.
    891 is composite and its smallest factor is 3.
    

    A good pattern to find such things in the dictionary is

    /(\w+)\1+/i
    

    so that you do the backreference case insensitively.

    Trolling the Dictionary

    This is a quick way to find such things in the dictionary list:

    $ perl -MEnglish -nle 'print "$PREMATCH<$MATCH>$POSTMATCH" while /(\w+)(\1+)/gi' /usr/share/dict/words 
    

    That finds things like:

    bkkeeper
    booeeper
    bookkper
    

    when fed bookkeeper. Sorted by substring length, the longest dictionary matches are:

    12 ambilly
    12 tomy
    12 
    12 c
    12 
    10 
    10 hydria
    10 
    10 ck
    10 macetic
    10 farad
    10 n
    10 sophos
    10 
    10 
    10 abundance
    10 abundant
    10 abundantly
    10 b
    10 ior
    10 
    8 ic
    8 ali
    8 a
    8 body
    

    Sneaky Lookaheads => Sneakaheads

    However, it is the leftmost longest such occurring substring. You have to be a lot sneakier to figure out all such substrings even in the fact of overlaps. For example:

    2 a
    ititious 4 addious 4 addious 6 a 6 a 2 aele 4 al 12 ambilly 12 ambilly 2 ambilateralateray 6 inatress 2 aassinatress 2 assainatress 2 assassinatre 2 Caaran 4 Carn

    The trick for those is to load up your groups inside a lookahead, turning it into a sneakahead. For example:

    /(?=(\w+)(\1+))/i
    

    would be enough to load up the first two groups with the entire match. However, you probably need to keep the prematch and postmatch parts around too, perhaps like this:

    /(?=(.*?)(\w+)(\2+)(.*))/i
    

    now you can do a progressive match to sneakahead and find all such matches, even overlaps! The list I gave above was generated using this:

    $ perl -nle 'print length($2 . $3), " $`.$1<$2$3>$4" while /(?=(.*?)(\w+)(\2+)(.*))/gi' /usr/share/dict/words | perl -pe 's/\.//g' | uniq
    

    I am pretty sure that the same approach should translate into Ruby without any trouble, since it’s really a property of the matching engine not of Perl per se.


    Bonus Solution

    Still wondering about those Diophantine equations, eh? :) Run this in Perl:

      # solve for 12x + 15y + 16z = 281, maximizing x
      if (($X, $Y, $Z)  =
         (('o' x 281)  =~ /^(o*)\1{11}(o*)\2{14}(o*)\3{15}$/))
      {
          ($x, $y, $z) = (length($X), length($Y), length($Z));
          print "One solution is: x=$x; y=$y; z=$z.\n";
      } else {
          print "No solution.\n";
      }
    

    and wonder of wonders, it prints out

    One solution is: x=17; y=3; z=2.
    

    As with factoring composite numbers, you can change how you weight these using minimal matching quantifiers. Because the first o* was greedy, x was allowed to grow as large as it could. Changing one or more * quantifiers to *?, +, or +? can produce different solutions:

      ('o' x 281)  =~ /^(o+)\1{11}(o+)\2{14}(o+)\3{15}$/
      # One solution is: x=17; y=3; z=2
    
      ('o' x 281)  =~ /^(o*?)\1{11}(o*)\2{14}(o*)\3{15}$/
      # One solution is: x=0; y=7; z=11.
    
      ('o' x 281)  =~ /^(o+?)\1{11}(o*)\2{14}(o*)\3{15}$/
      # One solution is: x=1; y=3; z=14.
    

    Isn’t that simply incredible? But it’s true. Run the code yourself and you’ll see.

    And yes, if anyone is having déjà lu, you’re right, you indeed have read all this before — because I already wrote it up in The Perl Cookbook, several long aeons ago. It all still holds true.

    NB: Credit for this technique must go to (M. Douglas) Doug McIlroy of Bell Labs for first demonstrating this wonder.

提交回复
热议问题