问题
The task is to find the longest sequence of a group
for instance, given DNA sequence: "AGATCAGATCTTTTTTCTAATGTCTAGGATATATCAGATCAGATCAGATCAGATCAGATC"
and it has 7 occurrences of AGATC. (AGATC)
matches all occurrences.
Is it possible to write a regular expression that catches only the longest sequence, i.e. AGATCAGATCAGATCAGATCAGATC
in the given text?
If this is not possible only with regex, how can I iterate through each sequence (i.e. 1st sequence is AGATCAGATC
, 2nd - AGATCAGATCAGATCAGATCAGATC
et cetera) in python?
回答1:
Use:
import re
sequence = "AGATCAGATCTTTTTTCTAATGTCTAGGATATATCAGATCAGATCAGATCAGATCAGATC"
matches = re.findall(r'(?:AGATC)+', sequence)
# To find the longest subsequence
longest = max(matches, key=len)
Explanation:
Non-capturing group (?:AGATC)+
+
Quantifier — Matches between one and unlimited times, as many times as possible.AGATC
matches the characters AGATC literally (case sensitive)
Result:
# print(matches)
['AGATCAGATC', 'AGATCAGATCAGATCAGATCAGATC']
# print(longest)
'AGATCAGATCAGATCAGATCAGATC'
You can test the regex here.
回答2:
The central question is, "Is it possible to write a regular expression that catches only the longest sequence?" The answer is "yes":
import re
s = 'AGATC_AGATCAGATC_AGATCAGATCAGATC_AGATC_AGATCAGATC'
m = re.search(r'((?:AGATC)+)(?!.*\1)', s)
print m.group() if m else ''
#=> "AGATCAGATCAGATC"
Regex demo<¯\(ツ)/¯>Python demo
Python's regex engine performs the following operations.
( begin capture group 1
(?:AGATC) match 'AGATC' in a non-capture group
+ execute the non-capture group 1+ times
) end capture group 1
(?! begin a negative lookahead
.* match 0+ characters
\1 match the content of capture group 1
) end the negative lookahead
For the string s
above, AGATC
would first be matched but the negative lookahead would find AGATC
as the first part of AGATCAGATC
, so the tentative match would be rejected. Then AGATCAGATC
would be matched, but the negative lookahead would find AGATCAGATC
as the first part of AGATCAGATCAGATC
so that tentative match would also be rejected. Next, AGATCAGATCAGATC
would be matched and accepted, as the negative lookahead would not find that match later in the string. (re.findall
, unlike re.search
, would also match AGATCAGATC
at the end of the string.)
If re.findall
were used there may be multiple matches after the longest one (see the last test string at the link to the regex demo), but the lengths of the matches are non-decreasing from the first to the last. Therefore, the first match, obtained using re.search
is a longest match.
回答3:
Use re.finditer()
to iterate over all matches. Then use max()
with a key function to find the longest. Make it a function so you can use different groups.
import re
def find_longest(sequence, group):
# build pattern
pattern = fr"(?:{group})+"
# iterate over all matches
matches = (match[0] for match in re.finditer(pattern, sequence))
# find the longest
return max(matches, key=len)
seq = "AGATCAGATCTTTTTTCTAATGTCTAGGATATATCAGATCAGATCAGATCAGATCAGATC"
find_longest(seq, "AGATC")
来源:https://stackoverflow.com/questions/62078827/how-to-catch-the-longest-sequence-of-a-group