How can I tell if a string repeats itself in Python?

后端 未结 13 1325
栀梦
栀梦 2020-11-27 09:07

I\'m looking for a way to test whether or not a given string repeats itself for the entire string or not.

Examples:

[
    \'0045662100456621004566210         


        
相关标签:
13条回答
  • 2020-11-27 09:12

    Here's a solution using regular expressions.

    import re
    
    REPEATER = re.compile(r"(.+?)\1+$")
    
    def repeated(s):
        match = REPEATER.match(s)
        return match.group(1) if match else None
    

    Iterating over the examples in the question:

    examples = [
        '0045662100456621004566210045662100456621',
        '0072992700729927007299270072992700729927',
        '001443001443001443001443001443001443001443',
        '037037037037037037037037037037037037037037037',
        '047619047619047619047619047619047619047619',
        '002457002457002457002457002457002457002457',
        '001221001221001221001221001221001221001221',
        '001230012300123001230012300123001230012300123',
        '0013947001394700139470013947001394700139470013947',
        '001001001001001001001001001001001001001001001001001',
        '001406469760900140646976090014064697609',
        '004608294930875576036866359447',
        '00469483568075117370892018779342723',
        '004739336492890995260663507109',
        '001508295625942684766214177978883861236802413273',
        '007518796992481203',
        '0071942446043165467625899280575539568345323741',
        '0434782608695652173913',
        '0344827586206896551724137931',
        '002481389578163771712158808933',
        '002932551319648093841642228739',
        '0035587188612099644128113879',
        '003484320557491289198606271777',
        '00115074798619102416570771',
    ]
    
    for e in examples:
        sub = repeated(e)
        if sub:
            print("%r: %r" % (e, sub))
        else:
            print("%r does not repeat." % e)
    

    ... produces this output:

    '0045662100456621004566210045662100456621': '00456621'
    '0072992700729927007299270072992700729927': '00729927'
    '001443001443001443001443001443001443001443': '001443'
    '037037037037037037037037037037037037037037037': '037'
    '047619047619047619047619047619047619047619': '047619'
    '002457002457002457002457002457002457002457': '002457'
    '001221001221001221001221001221001221001221': '001221'
    '001230012300123001230012300123001230012300123': '00123'
    '0013947001394700139470013947001394700139470013947': '0013947'
    '001001001001001001001001001001001001001001001001001': '001'
    '001406469760900140646976090014064697609': '0014064697609'
    '004608294930875576036866359447' does not repeat.
    '00469483568075117370892018779342723' does not repeat.
    '004739336492890995260663507109' does not repeat.
    '001508295625942684766214177978883861236802413273' does not repeat.
    '007518796992481203' does not repeat.
    '0071942446043165467625899280575539568345323741' does not repeat.
    '0434782608695652173913' does not repeat.
    '0344827586206896551724137931' does not repeat.
    '002481389578163771712158808933' does not repeat.
    '002932551319648093841642228739' does not repeat.
    '0035587188612099644128113879' does not repeat.
    '003484320557491289198606271777' does not repeat.
    '00115074798619102416570771' does not repeat.
    

    The regular expression (.+?)\1+$ is divided into three parts:

    1. (.+?) is a matching group containing at least one (but as few as possible) of any character (because +? is non-greedy).

    2. \1+ checks for at least one repetition of the matching group in the first part.

    3. $ checks for the end of the string, to ensure that there's no extra, non-repeating content after the repeated substrings (and using re.match() ensures that there's no non-repeating text before the repeated substrings).

    In Python 3.4 and later, you could drop the $ and use re.fullmatch() instead, or (in any Python at least as far back as 2.3) go the other way and use re.search() with the regex ^(.+?)\1+$, all of which are more down to personal taste than anything else.

    0 讨论(0)
  • 2020-11-27 09:12

    Here's a straight forward solution, without regexes.

    For substrings of s starting from zeroth index, of lengths 1 through len(s), check if that substring, substr is the repeated pattern. This check can be performed by concatenating substr with itself ratio times, such that the length of the string thus formed is equal to the length of s. Hence ratio=len(s)/len(substr).

    Return when first such substring is found. This would provide the smallest possible substring, if one exists.

    def check_repeat(s):
        for i in range(1, len(s)):
            substr = s[:i]
            ratio = len(s)/len(substr)
            if substr * ratio == s:
                print 'Repeating on "%s"' % substr
                return
        print 'Non repeating'
    
    >>> check_repeat('254725472547')
    Repeating on "2547"
    >>> check_repeat('abcdeabcdeabcdeabcde')
    Repeating on "abcde"
    
    0 讨论(0)
  • 2020-11-27 09:13

    Here is the code in python that checks for repetition of sub string in the main string given by the user.

    print "Enter a string...."
    #mainstring = String given by user
    mainstring=raw_input(">")
    if(mainstring==''):
        print "Invalid string"
        exit()
    #charlist = Character list of mainstring
    charlist=list(mainstring)
    strarr=''
    print "Length of your string :",len(mainstring)
    for i in range(0,len(mainstring)):
        strarr=strarr+charlist[i]
        splitlist=mainstring.split(strarr)
        count = 0
        for j in splitlist:
            if j =='':
                count+=1
        if count == len(splitlist):
            break
    if count == len(splitlist):
        if count == 2:
            print "No repeating Sub-String found in string %r"%(mainstring)
    
        else:
            print "Sub-String %r repeats in string %r"%(strarr,mainstring)
    else :
        print "No repeating Sub-String found in string %r"%(mainstring)
    

    Input:

    0045662100456621004566210045662100456621

    Output :

    Length of your string : 40

    Sub-String '00456621' repeats in string '0045662100456621004566210045662100456621'

    Input :

    004608294930875576036866359447

    Output:

    Length of your string : 30

    No repeating Sub-String found in string '004608294930875576036866359447'

    0 讨论(0)
  • 2020-11-27 09:15

    This function runs very quickly (tested and it's over 3 times faster than fastest solution here on strings with over 100k characters and the difference gets bigger the longer the repeating pattern is). It tries to minimise the number of comparisons needed to get the answer:

    def repeats(string):
        n = len(string)
        tried = set([])
        best = None
        nums = [i for i in  xrange(2, int(n**0.5) + 1) if n % i == 0]
        nums = [n/i for i in nums if n/i!=i] + list(reversed(nums)) + [1]
        for s in nums:
            if all(t%s for t in tried):
                print 'Trying repeating string of length:', s
                if string[:s]*(n/s)==string:
                    best = s
                else:
                    tried.add(s)
        if best:
            return string[:best]
    

    Note that for example for string of length 8 it checks only fragment of size 4 and it does not have to test further because pattern of length 1 or 2 would result in repeating pattern of length 4:

    >>> repeats('12345678')
    Trying repeating string of length: 4
    None
    
    # for this one we need only 2 checks 
    >>> repeats('1234567812345678')
    Trying repeating string of length: 8
    Trying repeating string of length: 4
    '12345678'
    
    0 讨论(0)
  • 2020-11-27 09:17

    You can make the observation that for a string to be considered repeating, its length must be divisible by the length of its repeated sequence. Given that, here is a solution that generates divisors of the length from 1 to n / 2 inclusive, divides the original string into substrings with the length of the divisors, and tests the equality of the result set:

    from math import sqrt, floor
    
    def divquot(n):
        if n > 1:
            yield 1, n
        swapped = []
        for d in range(2, int(floor(sqrt(n))) + 1):
            q, r = divmod(n, d)
            if r == 0:
                yield d, q
                swapped.append((q, d))
        while swapped:
            yield swapped.pop()
    
    def repeats(s):
        n = len(s)
        for d, q in divquot(n):
            sl = s[0:d]
            if sl * q == s:
                return sl
        return None
    

    EDIT: In Python 3, the / operator has changed to do float division by default. To get the int division from Python 2, you can use the // operator instead. Thank you to @TigerhawkT3 for bringing this to my attention.

    The // operator performs integer division in both Python 2 and Python 3, so I've updated the answer to support both versions. The part where we test to see if all the substrings are equal is now a short-circuiting operation using all and a generator expression.

    UPDATE: In response to a change in the original question, the code has now been updated to return the smallest repeating substring if it exists and None if it does not. @godlygeek has suggested using divmod to reduce the number of iterations on the divisors generator, and the code has been updated to match that as well. It now returns all positive divisors of n in ascending order, exclusive of n itself.

    Further update for high performance: After multiple tests, I've come to the conclusion that simply testing for string equality has the best performance out of any slicing or iterator solution in Python. Thus, I've taken a leaf out of @TigerhawkT3 's book and updated my solution. It's now over 6x as fast as before, noticably faster than Tigerhawk's solution but slower than David's.

    0 讨论(0)
  • 2020-11-27 09:23

    Here are some benchmarks for the various answers to this question. There were some surprising results, including wildly different performance depending on the string being tested.

    Some functions were modified to work with Python 3 (mainly by replacing / with // to ensure integer division). If you see something wrong, want to add your function, or want to add another test string, ping @ZeroPiraeus in the Python chatroom.

    In summary: there's about a 50x difference between the best- and worst-performing solutions for the large set of example data supplied by OP here (via this comment). David Zhang's solution is the clear winner, outperforming all others by around 5x for the large example set.

    A couple of the answers are very slow in extremely large "no match" cases. Otherwise, the functions seem to be equally matched or clear winners depending on the test.

    Here are the results, including plots made using matplotlib and seaborn to show the different distributions:


    Corpus 1 (supplied examples - small set)

    mean performance:
     0.0003  david_zhang
     0.0009  zero
     0.0013  antti
     0.0013  tigerhawk_2
     0.0015  carpetpython
     0.0029  tigerhawk_1
     0.0031  davidism
     0.0035  saksham
     0.0046  shashank
     0.0052  riad
     0.0056  piotr
    
    median performance:
     0.0003  david_zhang
     0.0008  zero
     0.0013  antti
     0.0013  tigerhawk_2
     0.0014  carpetpython
     0.0027  tigerhawk_1
     0.0031  davidism
     0.0038  saksham
     0.0044  shashank
     0.0054  riad
     0.0058  piotr
    


    Corpus 2 (supplied examples - large set)

    mean performance:
     0.0006  david_zhang
     0.0036  tigerhawk_2
     0.0036  antti
     0.0037  zero
     0.0039  carpetpython
     0.0052  shashank
     0.0056  piotr
     0.0066  davidism
     0.0120  tigerhawk_1
     0.0177  riad
     0.0283  saksham
    
    median performance:
     0.0004  david_zhang
     0.0018  zero
     0.0022  tigerhawk_2
     0.0022  antti
     0.0024  carpetpython
     0.0043  davidism
     0.0049  shashank
     0.0055  piotr
     0.0061  tigerhawk_1
     0.0077  riad
     0.0109  saksham
    


    Corpus 3 (edge cases)

    mean performance:
     0.0123  shashank
     0.0375  david_zhang
     0.0376  piotr
     0.0394  carpetpython
     0.0479  antti
     0.0488  tigerhawk_2
     0.2269  tigerhawk_1
     0.2336  davidism
     0.7239  saksham
     3.6265  zero
     6.0111  riad
    
    median performance:
     0.0107  tigerhawk_2
     0.0108  antti
     0.0109  carpetpython
     0.0135  david_zhang
     0.0137  tigerhawk_1
     0.0150  shashank
     0.0229  saksham
     0.0255  piotr
     0.0721  davidism
     0.1080  zero
     1.8539  riad
    


    The tests and raw results are available here.

    0 讨论(0)
提交回复
热议问题