I have a string that is randomly generated:
polymer_str = \"diol diNCO diamine diNCO diamine diNCO diamine diNCO diol diNCO diamine\"
I\'d
import re
pat = re.compile("[^|]+")
p = "diol diNCO diamine diNCO diamine diNCO diamine diNCO diol diNCO diamine".replace("diNCO diamine","|").replace(" ","")
print max(map(len,pat.split(p)))
One was is to use findall
:
polymer_str = "diol diNCO diamine diNCO diamine diNCO diamine diNCO diol diNCO diamine"
len(re.findall("diNCO diamine", polymer_str)) # returns 4.
Expanding on Ealdwulf's answer:
Documentation on re.findall
can be found here.
def getLongestSequenceSize(search_str, polymer_str):
matches = re.findall(r'(?:\b%s\b\s?)+' % search_str, polymer_str)
longest_match = max(matches)
return longest_match.count(search_str)
This could be written as one line, but it becomes less readable in that form.
Alternative:
If polymer_str
is huge, it will be more memory efficient to use re.finditer
. Here's how you might go about it:
def getLongestSequenceSize(search_str, polymer_str):
longest_match = ''
for match in re.finditer(r'(?:\b%s\b\s?)+' % search_str, polymer_str):
if len(match.group(0)) > len(longest_match):
longest_match = match.group(0)
return longest_match.count(search_str)
The biggest difference between findall
and finditer
is that the first returns a list object, while the second iterates over Match objects. Also, the finditer
approach will be somewhat slower.
Using re:
m = re.search(r"(\bdiNCO diamine\b\s?)+", polymer_str)
len(m.group(0)) / len("bdiNCO diamine")
I think the op wants the longest contiguous sequence. You can get all contiguous sequences like: seqs = re.findall("(?:diNCO diamine)+", polymer_str)
and then find the longest.