I just started learning python and here I have a sorted list of protein sequences (total 59,000 sequences) and some of them overlap. I have made a toy list here for example:
Code
import collections as ct
def read_file(filepath):
"""Yield a generator of lines from a file."""
with open(filepath, "r") as f:
for line in f:
yield line.strip()
def find_longest_sequences(seqs):
"""Return a dict of the long common sequences."""
seqs = tuple(seqs)
dd = ct.defaultdict(list)
[dd[k].append(seq) for seq in seqs for k in seqs if k in seq]
return {max(v, key=len) for v in dd.values()}
data = read_file("test.txt")
find_longest_sequences(data)
Output
{'ABCDEFGHIJKLMNO',
'CEST',
'DBTSFDEO',
'EAEUDNBNUW',
'EOEUDNBNUWD',
'FEOEUDNBNUW'}
Details
We use read_file
to yield each line of the file.
find_longest_sequences
builds a defaultdict that groups similar sequences together. It iterates the data with two loops:
A set of the values is made of the resulting dict, and the longest sequences are returned.
Note some discrepancies with your expected output:
FGH
overlaps with ABCDEFGHIJKLMNO
and is thus not a valid output.FEOEUDNBNUWD
is not an original sequence. Post-processing is needed for overlapping sequences.