I just started learning python and here I have a sorted list of protein sequences (total 59,000 sequences) and some of them overlap. I have made a toy list here for example:
You could use groupby() and max()
to help here:
from itertools import groupby
with open('toy.txt') as f_input:
for key, group in groupby(f_input, lambda x: x[:2]):
print(max(group, key=lambda x: len(x)).strip())
This would display:
ABCDEFGHIJKLMNO
CEST
DBTSFDEO
EOEUDNBNUW
EAEUDNBNUW
FGH
groupby()
works by returning a list of matching items based on a function, in this case consecutive lines with the same first 2 characters. The max()
function then takes this list and returns the list item with the longest length.