Remove item from list based on the next item in same list

前端 未结 11 2340
悲&欢浪女
悲&欢浪女 2021-02-18 17:08

I just started learning python and here I have a sorted list of protein sequences (total 59,000 sequences) and some of them overlap. I have made a toy list here for example:

11条回答
  •  逝去的感伤
    2021-02-18 17:52

    Code

    import collections as ct
    
    
    def read_file(filepath):
        """Yield a generator of lines from a file."""
        with open(filepath, "r") as f:
            for line in f:
                yield line.strip()
    
    
    def find_longest_sequences(seqs):
        """Return a dict of the long common sequences."""
        seqs = tuple(seqs) 
        dd = ct.defaultdict(list)
        [dd[k].append(seq) for seq in seqs for k in seqs if k in seq]
        return {max(v, key=len) for v in dd.values()}
    
    
    data = read_file("test.txt")
    find_longest_sequences(data)
    

    Output

    {'ABCDEFGHIJKLMNO',
     'CEST',
     'DBTSFDEO',
     'EAEUDNBNUW',
     'EOEUDNBNUWD',
     'FEOEUDNBNUW'}
    

    Details

    We use read_file to yield each line of the file.

    find_longest_sequences builds a defaultdict that groups similar sequences together. It iterates the data with two loops:

    1. The first loop builds a dict of empty lists with unique sequences as keys.
    2. The second loop appends as values any strings that are similar to the key.

    A set of the values is made of the resulting dict, and the longest sequences are returned.

    Note some discrepancies with your expected output:

    1. FGH overlaps with ABCDEFGHIJKLMNO and is thus not a valid output.
    2. FEOEUDNBNUWD is not an original sequence. Post-processing is needed for overlapping sequences.

提交回复
热议问题