Shortest Repeating Sub-String

前端未结

关注

 2  889

I am looking for an efficient way to extract the shortest repeating substring. For example:

input1 = \'dabcdbcdbcdd\'
ouput1 = \'bcd\'

input2 = \'cbabababac


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  孤城傲影        
                
              
                            
                2021-01-18 13:05
              
            
            
                                                                       
^ matches at the start of a string. In your example the repeating substrings don't start at the beginning. Similar for $. Without ^ and $ the pattern .*? always matches empty string. Demo:

import re

def srp(s):
    return re.search(r'(.+?)\1+', s).group(1)

print srp('dabcdbcdbcdd') # -> bcd
print srp('cbabababac')   # -> ba


Though It doesn't find the shortest substring.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  自闭症患者        
                
              
                            
                2021-01-18 13:09
              
            
            
                                                                       
A quick fix for this pattern could be

(.+?)\1+


Your regex failed because it anchored the repeating string to the start and end of the line, only allowing strings like abcabcabc but not xabcabcabcx. Also, the minimum length of the repeated string should be 1, not 0 (or any string would match), therefore .+? instead of .*?.

In Python:

>>> import re
>>> r = re.compile(r"(.+?)\1+")
>>> r.findall("cbabababac")
['ba']
>>> r.findall("dabcdbcdbcdd")
['bcd']


But be aware that this regex will only find non-overlapping repeating matches, so in the last example, the solution d will not be found although that is the shortest repeating string. Or see this example: here it can't find abcd because the abc part of the first abcd has been used up in the first match):

>>> r.findall("abcabcdabcd")
['abc']


Also, it may return several matches, so you'd need to find the shortest one in a second step:

>>> r.findall("abcdabcdabcabc")
['abcd', 'abc']


Better solution:

To allow the engine to also find overlapping matches, use

(.+?)(?=\1)


This will find some strings twice or more, if they are repeated enough times, but it will certainly find all possible repeating substrings:

>>> r = re.compile(r"(.+?)(?=\1)")
>>> r.findall("dabcdbcdbcdd")
['bcd', 'bcd', 'd']


Therefore, you should sort the results by length and return the shortest one:

>>> min(r.findall("dabcdbcdbcdd") or [""], key=len)
'd'


The or [""] (thanks to J. F. Sebastian!) ensures that no ValueError is triggered if there's no match at all.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复