I\'m looking for a way to test whether or not a given string repeats itself for the entire string or not.
Examples:
[
\'0045662100456621004566210
This version tries only those candidate sequence lengths that are factors of the string length; and uses the *
operator to build a full-length string from the candidate sequence:
def get_shortest_repeat(string):
length = len(string)
for i in range(1, length // 2 + 1):
if length % i: # skip non-factors early
continue
candidate = string[:i]
if string == candidate * (length // i):
return candidate
return None
Thanks to TigerhawkT3 for noticing that length // 2
without + 1
would fail to match the abab
case.
The problem may also be solved in O(n)
in worst case with prefix function.
Note, it may be slower in general case(UPD: and is much slower) than other solutions which depend on number of divisors of n
, but usually find fails sooner, I think one of bad cases for them will be aaa....aab
, where there are n - 1 = 2 * 3 * 5 * 7 ... *p_n - 1
a
's
First of all you need to calculate prefix function
def prefix_function(s):
n = len(s)
pi = [0] * n
for i in xrange(1, n):
j = pi[i - 1]
while(j > 0 and s[i] != s[j]):
j = pi[j - 1]
if (s[i] == s[j]):
j += 1
pi[i] = j;
return pi
then either there's no answer or the shortest period is
k = len(s) - prefix_function(s[-1])
and you just have to check if k != n and n % k == 0
(if k != n and n % k == 0
then answer is s[:k]
, else there's no answer
You may check the proof here (in Russian, but online translator will probably do the trick)
def riad(s):
n = len(s)
pi = [0] * n
for i in xrange(1, n):
j = pi[i - 1]
while(j > 0 and s[i] != s[j]):
j = pi[j - 1]
if (s[i] == s[j]):
j += 1
pi[i] = j;
k = n - pi[-1]
return s[:k] if (n != k and n % k == 0) else None
First, halve the string as long as it's a "2 part" duplicate. This reduces the search space if there are an even number of repeats. Then, working forwards to find the smallest repeating string, check if splitting the full string by increasingly larger sub-string results in only empty values. Only sub-strings up to length // 2
need to be tested since anything over that would have no repeats.
def shortest_repeat(orig_value):
if not orig_value:
return None
value = orig_value
while True:
len_half = len(value) // 2
first_half = value[:len_half]
if first_half != value[len_half:]:
break
value = first_half
len_value = len(value)
split = value.split
for i in (i for i in range(1, len_value // 2) if len_value % i == 0):
if not any(split(value[:i])):
return value[:i]
return value if value != orig_value else None
This returns the shortest match or None if there is no match.
In David Zhang's answer if we have some sort of circular buffer this will not work: principal_period('6210045662100456621004566210045662100456621')
due to the starting 621
, where I would have liked it to spit out: 00456621
.
Extending his solution we can use the following:
def principal_period(s):
for j in range(int(len(s)/2)):
idx = (s[j:]+s[j:]).find(s[j:], 1, -1)
if idx != -1:
# Make sure that the first substring is part of pattern
if s[:j] == s[j:][:idx][-j:]:
break
return None if idx == -1 else s[j:][:idx]
principal_period('6210045662100456621004566210045662100456621')
>>> '00456621'
Here's a concise solution which avoids regular expressions and slow in-Python loops:
def principal_period(s):
i = (s+s).find(s, 1, -1)
return None if i == -1 else s[:i]
See the Community Wiki answer started by @davidism for benchmark results. In summary,
David Zhang's solution is the clear winner, outperforming all others by at least 5x for the large example set.
(That answer's words, not mine.)
This is based on the observation that a string is periodic if and only if it is equal to a nontrivial rotation of itself. Kudos to @AleksiTorhamo for realizing that we can then recover the principal period from the index of the first occurrence of s
in (s+s)[1:-1]
, and for informing me of the optional start
and end
arguments of Python's string.find
.
Non-regex solution:
def repeat(string):
for i in range(1, len(string)//2+1):
if not len(string)%len(string[0:i]) and string[0:i]*(len(string)//len(string[0:i])) == string:
return string[0:i]
Faster non-regex solution, thanks to @ThatWeirdo (see comments):
def repeat(string):
l = len(string)
for i in range(1, len(string)//2+1):
if l%i: continue
s = string[0:i]
if s*(l//i) == string:
return s
The above solution is very rarely slower than the original by a few percent, but it's usually a good bit faster - sometimes a whole lot faster. It's still not faster than davidism's for longer strings, and zero's regex solution is superior for short strings. It comes out to the fastest (according to davidism's test on github - see his answer) with strings of about 1000-1500 characters. Regardless, it's reliably second-fastest (or better) in all cases I tested. Thanks, ThatWeirdo.
Test:
print(repeat('009009009'))
print(repeat('254725472547'))
print(repeat('abcdeabcdeabcdeabcde'))
print(repeat('abcdefg'))
print(repeat('09099099909999'))
print(repeat('02589675192'))
Results:
009
2547
abcde
None
None
None