Check if string is repetition of an unknown substring

前端未结

关注

 2  1089

I\'m trying to write a regex or Ruby method which will find the longest repeated pattern in a string. For example:

\"abcabc\"  => \"abc\"  
\"cccc\" => \"c


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  慢半拍i        
                
              
                            
                2021-01-27 05:04
              
            
            
                                                                       
I can’t believe all these stuckinthebookheads have been telling you it can’t be done with a pattern.  They do not know what they are talking about, as I am about to demonstrate. Believe me, if I can solve Diophantine equations of order one using patterns — AND I CAN! :) — then I can certainly do this simple little bit.  In fact, this is very very easy to do with a pattern, provided that you will be content with the leftmost longest such match.  For example, just use: 

/(.+)\1+/


If that matches, then the string contains a repeated substring. 

Integer Factorization with Patterns

That’s the same strategy that you use pattern matching to factor composite integers.  

First, create a string that is the unary representation of the integer.  That would be 1 for 1, 11 for 2, 111 for 3, 1111 for 4, etc.  Given such a representation, the pattern to find the largest factor is:

/^(11+)\1+$/   


where the first subgroup is the largest factor in unary, wherefore the length of the first group is that largest factor as a regular number. (However, that largest factor may not be prime.)

Similarly, 

/^(11+?)\1+$/


is the same except that it now finds the smallest factor, which is of course guaranteed to be prime.  I don’t know how to emulate Perl’s x repetition operator in Ruby, so here is a quick demo of this idea using Perl:

$ perl -le 'for $n (@ARGV) { printf "%d is composite and its largest factor is %d.\n", $n, length($1) if ("1" x $n) =~ /^(11+)\1+$/ } ' 5 9 15 24 60 243 891
9 is composite and its largest factor is 3.
15 is composite and its largest factor is 5.
24 is composite and its largest factor is 12.
60 is composite and its largest factor is 30.
243 is composite and its largest factor is 81.
891 is composite and its largest factor is 297.

$ perl -le 'for $n (@ARGV) { printf "%d is composite and its smallest factor is %d.\n", $n, length($1) if ("1" x $n) =~ /^(11+?)\1+$/ } ' 5 9 15 24 60 243 891
9 is composite and its smallest factor is 3.
15 is composite and its smallest factor is 3.
24 is composite and its smallest factor is 2.
60 is composite and its smallest factor is 2.
243 is composite and its smallest factor is 3.
891 is composite and its smallest factor is 3.


A good pattern to find such things in the dictionary is 

/(\w+)\1+/i


so that you do the backreference case insensitively.

Trolling the Dictionary

This is a quick way to find such things in the dictionary list:

$ perl -MEnglish -nle 'print "$PREMATCH<$MATCH>$POSTMATCH" while /(\w+)(\1+)/gi' /usr/share/dict/words 


That finds things like:

b<oo>kkeeper
boo<kk>eeper
bookk<ee>per


when fed bookkeeper.  Sorted by substring length, the longest dictionary matches are:

12 ambi<lateralatera>lly
12 <gastrogastro>tomy
12 <killeekillee>
12 <quadriquadri>c
12 <tangantangan>
10 <Bellabella>
10 hydr<ophobophob>ia
10 <kerrikerri>
10 <kinnikinni>ck
10 m<ethylethyl>acetic
10 <micromicro>farad
10 <micromicro>n
10 <philophilo>sophos
10 <quinaquina>
10 <strumstrum>
10 <supersuper>abundance
10 <supersuper>abundant
10 <supersuper>abundantly
10 <supersuper>b
10 <supersuper>ior
10 <Wallawalla>
8 <acetacet>ic
8 ali<bangbang>
8 <anapanap>a
8 <antianti>body


Sneaky Lookaheads => Sneakaheads

However, it is the leftmost longest such occurring substring. You have to be a lot sneakier to figure out all such substrings even in the fact of overlaps.  For example:

2 a<dd>ititious
4 add<itit>ious
4 addi<titi>ous
6 <alfalf>a
6 a<lfalfa>
2 a<ll>ele
4 al<lele>
12 ambi<lateralatera>lly
12 ambil<ateralateral>ly
2 ambilateralatera<ll>y
6 <assass>inatress
2 a<ss>assinatress
2 assa<ss>inatress
2 assassinatre<ss>
2 Ca<rr>aran
4 Car<rara>n


The trick for those is to load up your groups inside a lookahead, turning it into a sneakahead.  For example:

/(?=(\w+)(\1+))/i


would be enough to load up the first two groups with the entire match. However, you probably need to keep the prematch and postmatch parts around too, perhaps like this:

/(?=(.*?)(\w+)(\2+)(.*))/i


now you can do a progressive match to sneakahead and find all such matches, even overlaps! The list I gave above was generated using this:

$ perl -nle 'print length($2 . $3), " $`.$1<$2$3>$4" while /(?=(.*?)(\w+)(\2+)(.*))/gi' /usr/share/dict/words | perl -pe 's/\.//g' | uniq


I am pretty sure that the same approach should translate into Ruby without any trouble, since it’s really a property of the matching engine not of Perl per se.



Bonus Solution

Still wondering about those Diophantine equations, eh? :)  Run this in Perl:

  # solve for 12x + 15y + 16z = 281, maximizing x
  if (($X, $Y, $Z)  =
     (('o' x 281)  =~ /^(o*)\1{11}(o*)\2{14}(o*)\3{15}$/))
  {
      ($x, $y, $z) = (length($X), length($Y), length($Z));
      print "One solution is: x=$x; y=$y; z=$z.\n";
  } else {
      print "No solution.\n";
  }


and wonder of wonders, it prints out 

One solution is: x=17; y=3; z=2.


As with factoring composite numbers, you can change how you weight these using minimal matching quantifiers.  Because the first o* was greedy, x was allowed to grow as
large as it could.  Changing one or more * quantifiers to *?,
+, or +? can produce different solutions:

  ('o' x 281)  =~ /^(o+)\1{11}(o+)\2{14}(o+)\3{15}$/
  # One solution is: x=17; y=3; z=2

  ('o' x 281)  =~ /^(o*?)\1{11}(o*)\2{14}(o*)\3{15}$/
  # One solution is: x=0; y=7; z=11.

  ('o' x 281)  =~ /^(o+?)\1{11}(o*)\2{14}(o*)\3{15}$/
  # One solution is: x=1; y=3; z=14.


Isn’t that simply incredible?  But it’s true.  Run the code yourself and you’ll see.

And yes, if anyone is having déjà lu, you’re right, you indeed have read all this before — because I already wrote it up in The Perl Cookbook, several long aeons ago.  It all still holds true.  


  NB: Credit for this technique must go to (M. Douglas) Doug McIlroy of Bell Labs for first demonstrating this wonder.

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  -上瘾入骨i        
                
              
                            
                2021-01-27 05:20
              
            
            
                                                                       
I knew it couldn't be that complicated, so I thought it over and found a solution:

def unrepeat(str)
  n = str.size

  newstr = str
  n.times do |i|
     newstr = newstr[-1] + newstr[0..-2]
     if newstr == str
        return i + 1
     end
  end
end


This will return the length of the repeated pattern. It finds this by generating rotations of the string.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复