How to find the overlap between 2 sequences, and return it

前端 未结 4 1972
逝去的感伤
逝去的感伤 2020-12-02 23:45

I am new in Python, and have already spend to many hours with this problem, hope somebody can help me. I need to find the overlap between 2 sequences. The overlap is in the

相关标签:
4条回答
  • 2020-12-03 00:31

    Longest Common Substring

    def LongestCommonSubstring(S1, S2):
      M = [[0]*(1+len(S2)) for i in xrange(1+len(S1))]
      longest, x_longest = 0, 0
      for x in xrange(1,1+len(S1)):
        for y in xrange(1,1+len(S2)):
            if S1[x-1] == S2[y-1]:
                M[x][y] = M[x-1][y-1] + 1
                if M[x][y]>longest:
                    longest = M[x][y]
                    x_longest  = x
            else:
                M[x][y] = 0
      return S1[x_longest-longest: x_longest]
    
    0 讨论(0)
  • 2020-12-03 00:33

    You could use difflib.SequenceMatcher:

    d = difflib.SequenceMatcher(None,s1,s2)
    >>> match = max(d.get_matching_blocks(),key=lambda x:x[2])
    >>> match
    Match(a=8, b=0, size=39)
    >>> i,j,k = match
    >>> d.a[i:i+k]
    'GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC'
    >>> d.a[i:i+k] == d.b[j:j+k]
    True
    >>> d.a == s1
    True
    >>> d.b == s2
    True
    
    0 讨论(0)
  • 2020-12-03 00:44

    The Knuth-Morris-Pratt algorithm is a nice method for finding one string inside another (since I saw DNA, I'm guessing you'd like this to scale to... billions?).

    # Knuth-Morris-Pratt string matching
    # David Eppstein, UC Irvine, 1 Mar 2002
    
    from __future__ import generators
    
    def KnuthMorrisPratt(text, pattern):
    
        '''Yields all starting positions of copies of the pattern in the text.
    Calling conventions are similar to string.find, but its arguments can be
    lists or iterators, not just strings, it returns all matches, not just
    the first one, and it does not need the whole text in memory at once.
    Whenever it yields, it will have read the text exactly up to and including
    the match that caused the yield.'''
    
        # allow indexing into pattern and protect against change during yield
        pattern = list(pattern)
    
        # build table of shift amounts
        shifts = [1] * (len(pattern) + 1)
        shift = 1
        for pos in range(len(pattern)):
            while shift <= pos and pattern[pos] != pattern[pos-shift]:
                shift += shifts[pos-shift]
            shifts[pos+1] = shift
    
        # do the actual search
        startPos = 0
        matchLen = 0
        for c in text:
            while matchLen == len(pattern) or \
                  matchLen >= 0 and pattern[matchLen] != c:
                startPos += shifts[matchLen]
                matchLen -= shifts[matchLen]
            matchLen += 1
            if matchLen == len(pattern):
                yield startPos
    

    The link where I got the KMP python code (and a builtin, which will be faster for small problems because of the runtime constant).

    For bleeding-edge performance, use a prefix table and hash windows of your string as base 4 integers (in biology you'd call them k-mers or oligos). ; )

    Good luck!

    EDIT: There's also a nice trick where you sort a list containing every prefix (n total) in the first string and every prefix (n total) in the second string. If they share the largest common subsequence, then they must be adjacent in the sorted list, so find the element from the other string that is closest in the sorted list, and then take the longest prefix that matches completely. : )

    0 讨论(0)
  • 2020-12-03 00:48

    Have a look at the difflib library and more precisely at find_longest_match():

    import difflib
    
    def get_overlap(s1, s2):
        s = difflib.SequenceMatcher(None, s1, s2)
        pos_a, pos_b, size = s.find_longest_match(0, len(s1), 0, len(s2)) 
        return s1[pos_a:pos_a+size]
    
    s1 = "CGATTCCAGGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC"
    s2 = "GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTCGTCCAGACCCCTAGC"
    
    print(get_overlap(s1, s2)) # GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC
    
    0 讨论(0)
提交回复
热议问题