Remove item from list based on the next item in same list

前端 未结 11 2372
悲&欢浪女
悲&欢浪女 2021-02-18 17:08

I just started learning python and here I have a sorted list of protein sequences (total 59,000 sequences) and some of them overlap. I have made a toy list here for example:

相关标签:
11条回答
  • 2021-02-18 17:44

    A simple way is to process the input file one line at a time, compare each line with the previous one and keep previous one if it is not contained in current one.

    Code can be as simple as:

    with open('toy.txt' ,'r') as f:
        old = next(f).strip()               # keep first line after stripping EOL 
    
        for pattern in f:
            pattern = pattern.strip()       # strip end of line...
            if old not in pattern:
                print old                   # keep old if it is not contained in current line
            old = pattern                   # and store current line for next iteration
        print old                           # do not forget last line
    
    0 讨论(0)
  • 2021-02-18 17:45
    with open('demo.txt') as f:
        lines = f.readlines()
    
    l_lines = len(lines)
    
    n_lst = []
    
    for i, line in enumerate(lines):
        line = line.strip()
        if i == l_lines - 1:
            if lines[-2] not in line:
                n_lst.append(line)
            break
        if line not in lines[i + 1]:
            n_lst.append(line)
    
    print(n_lst)
    

    Output

    ['ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FGH']
    
    0 讨论(0)
  • 2021-02-18 17:45

    This will get you where you want to be:

    with open('toy.txt' ,'r') as f:
        lines = f.readlines()
        data = set(lines)
        print(sorted([i for i in lines if len([j for j in data if j.startswith(i)])==1]))
    
    #['ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EAEUDNBNUW', 'EOEUDNBNUW', 'FGH']
    

    I've added set just in case of multiple occurrences of same text.

    0 讨论(0)
  • 2021-02-18 17:48
    # assuming list is sorted:
    pattern = ["ABCDE",
    "ABCDEFG",
    "ABCDEFGH",
    "ABCDEFGHIJKLMNO",
    "CEST",
    "DBTSFDE",
    "DBTSFDEO",
    "EOEUDNBNUW",
    "EAEUDNBNUW",
    "FG",
    "FGH"]
    
    pattern = list(reversed(pattern))
    
    def iterate_patterns():
        while pattern:
            i = pattern.pop()
            throw_it_away = False
            for p in pattern:
                if p.startswith(i):
                    throw_it_away = True
                    break
            if throw_it_away == False:
                yield i
    
    print(list(iterate_patterns()))
    

    Output:

    ['ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FGH']

    0 讨论(0)
  • 2021-02-18 17:52

    Code

    import collections as ct
    
    
    def read_file(filepath):
        """Yield a generator of lines from a file."""
        with open(filepath, "r") as f:
            for line in f:
                yield line.strip()
    
    
    def find_longest_sequences(seqs):
        """Return a dict of the long common sequences."""
        seqs = tuple(seqs) 
        dd = ct.defaultdict(list)
        [dd[k].append(seq) for seq in seqs for k in seqs if k in seq]
        return {max(v, key=len) for v in dd.values()}
    
    
    data = read_file("test.txt")
    find_longest_sequences(data)
    

    Output

    {'ABCDEFGHIJKLMNO',
     'CEST',
     'DBTSFDEO',
     'EAEUDNBNUW',
     'EOEUDNBNUWD',
     'FEOEUDNBNUW'}
    

    Details

    We use read_file to yield each line of the file.

    find_longest_sequences builds a defaultdict that groups similar sequences together. It iterates the data with two loops:

    1. The first loop builds a dict of empty lists with unique sequences as keys.
    2. The second loop appends as values any strings that are similar to the key.

    A set of the values is made of the resulting dict, and the longest sequences are returned.

    Note some discrepancies with your expected output:

    1. FGH overlaps with ABCDEFGHIJKLMNO and is thus not a valid output.
    2. FEOEUDNBNUWD is not an original sequence. Post-processing is needed for overlapping sequences.
    0 讨论(0)
  • 2021-02-18 17:54

    There is other working answers, but none of them explain your actual problem. you were actually really close of a valid solution and what is, in my opinion, the most readable answer.

    The error came from the fact that you were mutating the same list while checking for index using range().

    Thus, while increasing the i variable you were removing item from the list which at one point causes the index error inevitably.

    Therefore, here is a working version of your initial code with some changes,

    pattern = ["ABCDE","ABCDEFG","ABCDEFGH","ABCDEFGHIJKLMNO","CEST","DBTSFDE","DBTSFDEO","EOEUDNBNUW","EAEUDNBNUW","FG","FGH"]
    output_pattern = []
    
    
    for i in range(0, (len(pattern)-1)):
        if not pattern[i] in pattern[i+1]:
            output_pattern.append(pattern[i]) 
    
    # Adding the last item
    output_pattern.append(pattern[-1])   
    print (output_pattern)
    
    >>>> ['ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FGH']    
    

    Note that this code will work if your list is previously sorted as you mentioned in comment section.

    What is this code doing ?

    Basically, it use the same logic of your initial answer where it iterates on the list and check if the next item contains the current item. But, using another list and iterating until the before last item, will fix your index problem. But now comes a question,

    What should I do with the last item ?

    Since the list is sorted, you can consider the last item as always being unique. This is why I'm using

    output_pattern.append(pattern[-1])
    

    which adds the last item of the initial list.

    Important note

    This answer was written in response to OP's initial question where he wanted to keep the longer overlap and I quote based on the next item in same list. As stated by @Chris_Rands if your concerns are related to a biological task and need to find any overlap, this solution is not suited for your needs.

    Example where this code would fail to recognize a potential overlap,

    pattern = ["ACD", "AD", "BACD"]
    

    where it would output the same result without removing the possible "ACD" overlap. Now, just as a clarification though, this would imply a much more complex algorithm and I initially thought it was out of the scope of the question's requirements. If ever this is your case, I may be completely wrong here, but I truly think a C++ implementation seems more appropriate. have a look at the CD-Hit algorithm suggested by @Chris_Rands in the comment section.

    0 讨论(0)
提交回复
热议问题