I just started learning python and here I have a sorted list of protein sequences (total 59,000 sequences) and some of them overlap. I have made a toy list here for example:
with open('demo.txt') as f:
lines = f.readlines()
l_lines = len(lines)
n_lst = []
for i, line in enumerate(lines):
line = line.strip()
if i == l_lines - 1:
if lines[-2] not in line:
n_lst.append(line)
break
if line not in lines[i + 1]:
n_lst.append(line)
print(n_lst)
Output
['ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FGH']