I just started learning python and here I have a sorted list of protein sequences (total 59,000 sequences) and some of them overlap. I have made a toy list here for example:
A simple way is to process the input file one line at a time, compare each line with the previous one and keep previous one if it is not contained in current one.
Code can be as simple as:
with open('toy.txt' ,'r') as f:
old = next(f).strip() # keep first line after stripping EOL
for pattern in f:
pattern = pattern.strip() # strip end of line...
if old not in pattern:
print old # keep old if it is not contained in current line
old = pattern # and store current line for next iteration
print old # do not forget last line