Overlapping regex matches

后端 未结 3 2038
抹茶落季
抹茶落季 2020-12-21 19:20

I\'m trying to create the following regular expression: return a string between AUG and (UAG or UGA or UAA) from a follow

相关标签:
3条回答
  • 2020-12-21 19:51

    Unfortunately the re module does not offer support for overlapping matches at the moment, but you can easily break the solution down like so:

    'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
    matches = []
    
    for m in re.finditer('AUG', str):
        for n in re.finditer('(UAG)|(UGA)|(UAA)', str[m.start():]):
            matches.append(str[m.start()+3:m.start()+n.end()-3]
    
    print matches
    
    0 讨论(0)
  • 2020-12-21 19:56

    If you don't think in terms of 'matches', but rather in terms of 'intervals' I think you will find it easier. This is what @ionut-hulub has done. You can do it in a single pass as I demonstrate below, however you should probably use the simpler finditer() approach unless you have enough RNA strings (or they are long enough) you need to avoid the redundant passes over the string.

    s = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
    
    def intervals(s):
        state = []
        i = 0
        max = len(s) - 2
        while i < max:
            if s[i] == 'A' and s[i+1] == 'U' and s[i+2] == 'G':
                state.append(i)
            if s[i] == 'U' and (s[i+1] == 'A' and s[i+2] == 'G') or (s[i+1] == 'G' and s[i+2] == 'A') or (s[i+1] == 'A' and s[i+2] == 'A'):
                for b in state:
                    yield (b, i)
            i += 1
    
    for interval in intervals(s):
        print interval
    
    0 讨论(0)
  • 2020-12-21 19:59

    Doing this with one regular expression is actually pretty difficult, as most uses specifically don't want overlapping matches. You could, however, do this with some simple iteration:

    regex = re.compile('(?=AUG)(\w+)(?=UAG|UGA|UAA)');
    RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
    matches = []
    tmp = RNA
    while (match = regex.search(tmp)):
        matches.append(match)
        tmp = tmp[match.start()-2:]  #Back up two to get the UG portion.  Shouldn't matter, but safer.
    
    for m in matches:
        print m.group(0)
    

    Though, this has some problems. What would you expect the return to be in the case of AUGUAGUGAUAA? Are there two strings to be returned? Or just one? Right now, your regular expression isn't even capable of capturing UAG, as it continues on through to match UAGUGA and get cut off at UAA. To combat this, then, you might wish to use the ? operator to make your operator lazy - an approach that then will be unable to capture the longer substring.

    Maybe iteration over the string twice is the answer, but then what if your RNA Sequence contains AUGAUGUAGUGAUAA? What's the correct behaviour there?

    I might favor a regular expression free approach, by iterating over the string and its substrings:

    RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
    candidates = []
    start = 0
    
    while (RNA.find('AUG', start) > -1):
        start = RNA.find('AUG', start) #Confound python and its lack of assignment returns
        candidates.append(RNA[start+3:])
        start += 1
    
    matches = []
    
    for candidate in candidates:
        for terminator in ['UAG', 'UGA', 'UAA']:
            end = 1;
            while(candidate.find(terminator, end) > -1):
                end = candidate.find(terminator, end)
                matches.append(candidate[:end])
                end += 1
    
    for match in matches:
        print match
    

    This way, you're sure to get all matches, no matter what.

    If you need to keep track of the position of each match, you can modify your candidates data structure to use tuples which maintain the starting position:

    RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
    candidates = []
    start = 0
    
    while (RNA.find('AUG', start) > -1):
        start = RNA.find('AUG', start) #Confound python and its lack of assignment returns
        candidates.append((RNA[start+3:], start+3))
        start += 1
    
    matches = []
    
    for candidate in candidates:
        for terminator in ['UAG', 'UGA', 'UAA']:
            end = 1;
            while(candidate[0].find(terminator, end) > -1):
                end = candidate[0].find(terminator, end)
                matches.append((candidate[1], candidate[1] + end, candidate[0][:end]))
                end += 1
    
    for match in matches:
        print "%d - %d: %s" % match
    

    which prints:

    7 - 49: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAU
    7 - 85: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
    7 - 31: UAGCUAACUCAGGUUACAUGGGGA
    7 - 72: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
    7 - 76: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
    7 - 11: UAGC
    7 - 66: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
    27 - 49: GGGAUGACCCCGCGACUUGGAU
    27 - 85: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
    27 - 31: GGGA
    27 - 72: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
    27 - 76: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
    27 - 66: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
    33 - 49: ACCCCGCGACUUGGAU
    33 - 85: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
    33 - 72: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
    33 - 76: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
    33 - 66: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
    78 - 85: AUCCGAG
    

    Hell, with literally three more lines, you can even sort the matches according to where they fall in the RNA sequence:

    from operator import itemgetter
    matches.sort(key=itemgetter(1))
    matches.sort(key=itemgetter(0)) 
    

    That placed before the final print nets you:

    007 - 011: UAGC
    007 - 031: UAGCUAACUCAGGUUACAUGGGGA
    007 - 049: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAU
    007 - 066: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
    007 - 072: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
    007 - 076: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
    007 - 085: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
    027 - 031: GGGA
    027 - 049: GGGAUGACCCCGCGACUUGGAU
    027 - 066: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
    027 - 072: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
    027 - 076: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
    027 - 085: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
    033 - 049: ACCCCGCGACUUGGAU
    033 - 066: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
    033 - 072: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
    033 - 076: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
    033 - 085: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
    078 - 085: AUCCGAG
    
    0 讨论(0)
提交回复
热议问题