Why is re.findall not being specific in finding triplet items in string. Python

[亡魂溺海] 提交于 2019-12-01 14:29:20

If you want your regex to stop matching at the first TAA|TAG|TGA, but still only succeed if there are at least nine three letter chunks, the following may help:

>>> import re
>>> regexp = r'ATG(?:(?!TAA|TAG|TGA)...){9,}?(?:TAA|TAG|TGA)'
>>> re.findall(regexp, 'ATGAAAAAAAAAAAAAAAAAAAAAAAAAAATAG')
['ATGAAAAAAAAAAAAAAAAAAAAAAAAAAATAG']
>>> re.findall(regexp, 'ATGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAG')
['ATGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAG']
>>> re.findall(regexp, 'ATGAAATAGAAAAAAAAAAAAAAAAAAAAATAG')
[]

This uses a negative lookahead (?!TAA|TAG|TGA) to ensure that a three character chunk is not a TAA|TAG|TGA before it matches the three character chunk.

Note though that a TAA|TAG|TGA that does not fall on a three character boundary will still successfully match:

>>> re.findall(regexp, 'ATGAAAATAGAAAAAAAAAAAAAAAAAAAATAG')
['ATGAAAATAGAAAAAAAAAAAAAAAAAAAATAG']

If the length is not a requirement then it's pretty easy:

>>> import re
>>> seq= 'ATGGAAGTTGGATGAAAGTGGAGGTAAAGAGAAGACGTTTGA'
>>> regex = re.compile(r'ATG(?:...)*?(?:TAA|TAG|TGA)')
>>> regex.findall(seq)
['ATGGAAGTTGGATGA']

Anyway I believe, according to your explanation, that your previous regex is actually doing what you want: searching for matches of at least 30 characters that start in ATG and end in TGA.

In your question you first state that you need matches of at least 30 characters, and hence you put the {9,}?, but after that you expect to match any match. You cannot have both, choose one. If length is important than keep the regex you already have and the result you are getting is correct.

You don't need regular expressions.

def chunks(l, n):
    """ Yield successive n-sized chunks from l.
    from: http://stackoverflow.com/a/312464/1561176
    """
    for i in xrange(0, len(l), n):
        yield l[i:i+n]

def method(sequence, start=['ATG'], stop=['TAA','TAG','TGA'], min_len=30):
    response = ''
    started = False
    for x in chunks(sequence, 3):
        if x in start:
            started = True
            response += x
        elif x in stop and started:
            if len(response) >= min_len:
                yield response + x
                response = ''
                started = False
            else:
                response += x
        elif started:
            response += x
    yield response

for result in method('ATGGAAGTTGGATGAAAGTGGAGGTAAAGAGAAGACGTTTGA'):
    print result

If I use the min_len of 30, the return is:

ATGGAAGTTGGATGAAAGTGGAGGTAAAGAGAAGACGTTTGA

If I use a min_len of 0, the return is:

ATGGAAGTTGGATGA

Try this:

seq= 'ATGGAAGTTGGATGAAAGTGGAGGTAAAGAGAAGACGTTTGA'
OR_0 = re.findall(r'ATG(?:.{3})*?(?:TAA|TAG|TGA)',seq) 
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!