Overlapping regex matches

后端未结

关注

 3  2038

I\'m trying to create the following regular expression: return a string between AUG and (UAG or UGA or UAA) from a follow

相关标签:

3条回答

眼角桃花

2020-12-21 19:51

Unfortunately the re module does not offer support for overlapping matches at the moment, but you can easily break the solution down like so:

'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
matches = []

for m in re.finditer('AUG', str):
    for n in re.finditer('(UAG)|(UGA)|(UAA)', str[m.start():]):
        matches.append(str[m.start()+3:m.start()+n.end()-3]

print matches

0 讨论(0)

南方客

2020-12-21 19:56

If you don't think in terms of 'matches', but rather in terms of 'intervals' I think you will find it easier. This is what @ionut-hulub has done. You can do it in a single pass as I demonstrate below, however you should probably use the simpler finditer() approach unless you have enough RNA strings (or they are long enough) you need to avoid the redundant passes over the string.

s = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'

def intervals(s):
    state = []
    i = 0
    max = len(s) - 2
    while i < max:
        if s[i] == 'A' and s[i+1] == 'U' and s[i+2] == 'G':
            state.append(i)
        if s[i] == 'U' and (s[i+1] == 'A' and s[i+2] == 'G') or (s[i+1] == 'G' and s[i+2] == 'A') or (s[i+1] == 'A' and s[i+2] == 'A'):
            for b in state:
                yield (b, i)
        i += 1

for interval in intervals(s):
    print interval

0 讨论(0)

庸人自扰

2020-12-21 19:59

Doing this with one regular expression is actually pretty difficult, as most uses specifically don't want overlapping matches. You could, however, do this with some simple iteration:

regex = re.compile('(?=AUG)(\w+)(?=UAG|UGA|UAA)');
RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
matches = []
tmp = RNA
while (match = regex.search(tmp)):
    matches.append(match)
    tmp = tmp[match.start()-2:]  #Back up two to get the UG portion.  Shouldn't matter, but safer.

for m in matches:
    print m.group(0)

Though, this has some problems. What would you expect the return to be in the case of AUGUAGUGAUAA? Are there two strings to be returned? Or just one? Right now, your regular expression isn't even capable of capturing UAG, as it continues on through to match UAGUGA and get cut off at UAA. To combat this, then, you might wish to use the ? operator to make your operator lazy - an approach that then will be unable to capture the longer substring.

Maybe iteration over the string twice is the answer, but then what if your RNA Sequence contains AUGAUGUAGUGAUAA? What's the correct behaviour there?

I might favor a regular expression free approach, by iterating over the string and its substrings:

RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
candidates = []
start = 0

while (RNA.find('AUG', start) > -1):
    start = RNA.find('AUG', start) #Confound python and its lack of assignment returns
    candidates.append(RNA[start+3:])
    start += 1

matches = []

for candidate in candidates:
    for terminator in ['UAG', 'UGA', 'UAA']:
        end = 1;
        while(candidate.find(terminator, end) > -1):
            end = candidate.find(terminator, end)
            matches.append(candidate[:end])
            end += 1

for match in matches:
    print match

This way, you're sure to get all matches, no matter what.

If you need to keep track of the position of each match, you can modify your candidates data structure to use tuples which maintain the starting position:

RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
candidates = []
start = 0

while (RNA.find('AUG', start) > -1):
    start = RNA.find('AUG', start) #Confound python and its lack of assignment returns
    candidates.append((RNA[start+3:], start+3))
    start += 1

matches = []

for candidate in candidates:
    for terminator in ['UAG', 'UGA', 'UAA']:
        end = 1;
        while(candidate[0].find(terminator, end) > -1):
            end = candidate[0].find(terminator, end)
            matches.append((candidate[1], candidate[1] + end, candidate[0][:end]))
            end += 1

for match in matches:
    print "%d - %d: %s" % match

which prints:

7 - 49: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAU
7 - 85: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
7 - 31: UAGCUAACUCAGGUUACAUGGGGA
7 - 72: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
7 - 76: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
7 - 11: UAGC
7 - 66: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
27 - 49: GGGAUGACCCCGCGACUUGGAU
27 - 85: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
27 - 31: GGGA
27 - 72: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
27 - 76: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
27 - 66: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
33 - 49: ACCCCGCGACUUGGAU
33 - 85: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
33 - 72: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
33 - 76: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
33 - 66: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
78 - 85: AUCCGAG

Hell, with literally three more lines, you can even sort the matches according to where they fall in the RNA sequence:

from operator import itemgetter
matches.sort(key=itemgetter(1))
matches.sort(key=itemgetter(0))

That placed before the final print nets you:

007 - 011: UAGC
007 - 031: UAGCUAACUCAGGUUACAUGGGGA
007 - 049: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAU
007 - 066: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
007 - 072: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
007 - 076: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
007 - 085: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
027 - 031: GGGA
027 - 049: GGGAUGACCCCGCGACUUGGAU
027 - 066: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
027 - 072: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
027 - 076: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
027 - 085: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
033 - 049: ACCCCGCGACUUGGAU
033 - 066: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
033 - 072: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
033 - 076: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
033 - 085: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
078 - 085: AUCCGAG

0 讨论(0)