I am looking for an efficient way to extract the shortest repeating substring. For example:
input1 = \'dabcdbcdbcdd\'
ouput1 = \'bcd\'
input2 = \'cbabababac
^
matches at the start of a string. In your example the repeating substrings don't start at the beginning. Similar for $
. Without ^
and $
the pattern .*?
always matches empty string. Demo:
import re
def srp(s):
return re.search(r'(.+?)\1+', s).group(1)
print srp('dabcdbcdbcdd') # -> bcd
print srp('cbabababac') # -> ba
Though It doesn't find the shortest substring.
A quick fix for this pattern could be
(.+?)\1+
Your regex failed because it anchored the repeating string to the start and end of the line, only allowing strings like abcabcabc
but not xabcabcabcx
. Also, the minimum length of the repeated string should be 1, not 0 (or any string would match), therefore .+?
instead of .*?
.
In Python:
>>> import re
>>> r = re.compile(r"(.+?)\1+")
>>> r.findall("cbabababac")
['ba']
>>> r.findall("dabcdbcdbcdd")
['bcd']
But be aware that this regex will only find non-overlapping repeating matches, so in the last example, the solution d
will not be found although that is the shortest repeating string. Or see this example: here it can't find abcd
because the abc
part of the first abcd
has been used up in the first match):
>>> r.findall("abcabcdabcd")
['abc']
Also, it may return several matches, so you'd need to find the shortest one in a second step:
>>> r.findall("abcdabcdabcabc")
['abcd', 'abc']
Better solution:
To allow the engine to also find overlapping matches, use
(.+?)(?=\1)
This will find some strings twice or more, if they are repeated enough times, but it will certainly find all possible repeating substrings:
>>> r = re.compile(r"(.+?)(?=\1)")
>>> r.findall("dabcdbcdbcdd")
['bcd', 'bcd', 'd']
Therefore, you should sort the results by length and return the shortest one:
>>> min(r.findall("dabcdbcdbcdd") or [""], key=len)
'd'
The or [""]
(thanks to J. F. Sebastian!) ensures that no ValueError
is triggered if there's no match at all.