Detect whether sequence is a multiple of a subsequence in Python

前端 未结 7 1573
轮回少年
轮回少年 2021-02-07 14:44

I have a tuple of zeros and ones, for instance:

(1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1)

It turns out:

(1, 0, 1, 1, 1, 0, 1, 1, 1,          


        
相关标签:
7条回答
  • 2021-02-07 14:47

    I believe I have an O(n) time solution (actually 2n+r, n is length of tuple, r is sub tuplle) which does not use suffix trees, but uses a string matching algorithm (like KMP, which you should find off-the shelf).

    We use the following little known theorem:

    If x,y are strings over some alphabet,
    
    then xy = yx if and only if x = z^k and y = z^l for some string z and integers k,l.
    

    I now claim that, for the purposes of our problem, this means that all we need to do is determine if the given tuple/list (or string) is a cyclic shift of itself!

    To determine if a string is a cyclic shift of itself, we concatenate it with itself (it does not even have to be a real concat, just a virtual one will do) and check for a substring match (with itself).

    For a proof of that, suppose the string is a cyclic shift of itself.

    The we have that the given string y = uv = vu. Since uv = vu, we must have that u = z^k and v= z^l and hence y = z^{k+l} from the above theorem. The other direction is easy to prove.

    Here is the python code. The method is called powercheck.

    def powercheck(lst):
        count = 0
        position = 0
        for pos in KnuthMorrisPratt(double(lst), lst):
            count += 1
            position = pos
            if count == 2:
                break
    
        return lst[:position]
    
    
    def double(lst):
        for i in range(1,3):
            for elem in lst:
                yield elem
    
    def main():
        print powercheck([1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1])
    
    if __name__ == "__main__":
        main()
    

    And here is the KMP code which I used (due to David Eppstein).

    # Knuth-Morris-Pratt string matching
    # David Eppstein, UC Irvine, 1 Mar 2002
    
    def KnuthMorrisPratt(text, pattern):
    
        '''Yields all starting positions of copies of the pattern in the text.
    Calling conventions are similar to string.find, but its arguments can be
    lists or iterators, not just strings, it returns all matches, not just
    the first one, and it does not need the whole text in memory at once.
    Whenever it yields, it will have read the text exactly up to and including
    the match that caused the yield.'''
    
        # allow indexing into pattern and protect against change during yield
        pattern = list(pattern)
    
        # build table of shift amounts
        shifts = [1] * (len(pattern) + 1)
        shift = 1
        for pos in range(len(pattern)):
            while shift <= pos and pattern[pos] != pattern[pos-shift]:
                shift += shifts[pos-shift]
            shifts[pos+1] = shift
    
        # do the actual search
        startPos = 0
        matchLen = 0
        for c in text:
            while matchLen == len(pattern) or \
                  matchLen >= 0 and pattern[matchLen] != c:
                startPos += shifts[matchLen]
                matchLen -= shifts[matchLen]
            matchLen += 1
            if matchLen == len(pattern):
                yield startPos
    

    For your sample this outputs

    [1,0,1,1]
    

    as expected.

    I compared this against shx2's code(not the numpy one), by generating a random 50 bit string, then replication to make the total length as 1 million. This was the output (the decimal number is the output of time.time())

    1362988461.75
    
    (50, [1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1])
    
    1362988465.96
    
    50 [1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1]
    
    1362988487.14
    

    The above method took ~4 seconds, while shx2's method took ~21 seconds!

    Here was the timing code. (shx2's method was called powercheck2).

    def rand_bitstring(n):
        rand = random.SystemRandom()
        lst = []
        for j in range(1, n+1):
            r = rand.randint(1,2)
            if r == 2:
                lst.append(0)
            else:
                lst.append(1)
    
        return lst
    
    def main():
        lst = rand_bitstring(50)*200000
        print time.time()
        print powercheck(lst)
        print time.time()
        powercheck2(lst)
        print time.time()
    
    0 讨论(0)
  • 2021-02-07 14:57

    The following solution is O(N^2), but has the advantage of not creating any copies (or slices) of your data, as it is based on iterators.

    Depending on the size of your input, the fact you avoid making copies of the data can result in a significant speed-up, but of course, it would not scale as well for huge inputs as algorithms with lower complexity (e.g. O(N*logN)).

    [This is the second revision of my solution, the first one is given below. This one is simpler to understand, and is more along the lines of OP's tuple-multiplication, only using iterators.]

    from itertools import izip, chain, tee
    
    def iter_eq(seq1, seq2):
        """ assumes the sequences have the same len """
        return all( v1 == v2 for v1, v2 in izip(seq1, seq2) )
    
    def dup_seq(seq, n):
        """ returns an iterator which is seq chained to itself n times """
        return chain(*tee(seq, n))
    
    def is_reps(arr, slice_size):
        if len(arr) % slice_size != 0:
            return False
        num_slices = len(arr) / slice_size
        return iter_eq(arr, dup_seq(arr[:slice_size], num_slices))
    
    s = (1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1)
    for i in range(1,len(s)):
        if is_reps(s, i):
            print i, s[:i]
            break
    

    [My original solution]

    from itertools import islice
    
    def is_reps(arr, num_slices):
        if len(arr) % num_slices != 0:
            return False
        slice_size = len(arr) / num_slices
        for i in xrange(slice_size):
            if len(set( islice(arr, i, None, num_slices) )) > 1:
                return False
        return True
    
    s = (1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1)
    for i in range(1,len(s)):
        if is_reps(s, i):
            print i, s[:i]
            break
    

    You can avoid the call to set() by using something like:

    def is_iter_unique(seq):
        """ a faster version of testing len(set(seq)) <= 1 """
        seen = set()
        for x in seq:
            seen.add(x)
            if len(seen) > 1:
                return False
        return True
    

    and replacing this line:

    if len(set( islice(arr, i, None, num_slices) )) > 1:
    

    with:

    if not is_iter_unique(islice(arr, i, None, num_slices)):
    
    0 讨论(0)
  • 2021-02-07 14:58

    This one is just a dumb recursive comparison in Haskell. It takes about one second for Knoothe's million long string (f a). Cool problem! I'll think about it some more.

    a = concat $ replicate 20000 
        [1,1,1,0,0,1,0,1,0,0,1,0,0,1,1,1,0,0,
         0,0,0,0,1,1,1,1,0,0,0,1,1,0,1,1,1,1,
         1,1,1,0,0,1,1,1,0,0,0,0,0,1]
    
    f s = 
      f' s [] where
        f' [] result = []
        f' (x:xs) result =
          let y = result ++ [x]   
          in if concat (replicate (div (length s) (length y)) y) == s
                then y
                else f' xs y
    
    0 讨论(0)
  • 2021-02-07 15:01

    You can archive it in sublinear time by XOR'ing the rotated binary form for the input array:

    1. get the binary representation of the array, input_binary
    2. loop from i = 1 to len(input_array)/2, and for each loop, rotate the input_binary to the right by i bits, save it as rotated_bin, then compare the XOR of rotated_bin and input_binary.
    3. The first i that yields 0, is the index to which is the desired substring.

    Complete code:

    def get_substring(arr):
        binary = ''.join(map(str, arr)) # join the elements to get the binary form
    
        for i in xrange(1, len(arr) / 2):
            # do a i bit rotation shift, get bit string sub_bin
            rotated_bin = binary[-i:] + binary[:-i]
            if int(rotated_bin) ^ int(binary) == 0:
                return arr[0:i]
    
        return None
    
    
    if __name__ == "__main__":
        test = [1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1]
        print get_substring(test) # [1,0,1,1]
    
    0 讨论(0)
  • 2021-02-07 15:07

    Just a different approach to the problem

    I first determine all the factors of the length and then split the list and check if all the parts are same

    >>> def f(s):
        def factors(n):
            #http://stackoverflow.com/a/6800214/977038
            return set(reduce(list.__add__,
                    ([i, n//i] for i in range(2, int(n**0.5) + 1) if n % i == 0)))
        _len = len(s)
        for fact in reversed(list(factors(_len))):
            compare_set = set(izip(*[iter(s)]*fact))
            if len(compare_set) == 1:
                return compare_set
    
    
    >>> f(t)
    set([(1, 0, 1, 1)])
    
    0 讨论(0)
  • 2021-02-07 15:08

    Simplifying Knoothe's solution. His algorithm is right, but his implementation is too complex. This implementation is also O(n).

    Since your array is only composed of ones and zeros, what I do is use existing str.find implementation (Bayer Moore) to implement Knoothe's idea. It's suprisingly simpler and amazingly faster at runtime.

    def f(s):
        s2 = ''.join(map(str, s))
        return s[:(s2+s2).index(s2, 1)]
    
    0 讨论(0)
提交回复
热议问题