Remove item from list based on the next item in same list

前端未结

关注

 11  2376

悲&欢浪女 2021-02-18 17:08

I just started learning python and here I have a sorted list of protein sequences (total 59,000 sequences) and some of them overlap. I have made a toy list here for example:

11条回答

逝去的感伤 (楼主)

2021-02-18 17:52
Code
```
import collections as ct


def read_file(filepath):
    """Yield a generator of lines from a file."""
    with open(filepath, "r") as f:
        for line in f:
            yield line.strip()


def find_longest_sequences(seqs):
    """Return a dict of the long common sequences."""
    seqs = tuple(seqs) 
    dd = ct.defaultdict(list)
    [dd[k].append(seq) for seq in seqs for k in seqs if k in seq]
    return {max(v, key=len) for v in dd.values()}


data = read_file("test.txt")
find_longest_sequences(data)
```
Output
```
{'ABCDEFGHIJKLMNO',
 'CEST',
 'DBTSFDEO',
 'EAEUDNBNUW',
 'EOEUDNBNUWD',
 'FEOEUDNBNUW'}
```
Details

We use read_file to yield each line of the file.

find_longest_sequences builds a defaultdict that groups similar sequences together. It iterates the data with two loops:
1. The first loop builds a dict of empty lists with unique sequences as keys.
2. The second loop appends as values any strings that are similar to the key.
A set of the values is made of the resulting dict, and the longest sequences are returned.

Note some discrepancies with your expected output:
1. FGH overlaps with ABCDEFGHIJKLMNO and is thus not a valid output.
2. FEOEUDNBNUWD is not an original sequence. Post-processing is needed for overlapping sequences.
0 讨论(0)

查看其它11个回答
发布评论:

提交评论
- 加载中...