Remove item from list based on the next item in same list

前端 未结 11 2330
悲&欢浪女
悲&欢浪女 2021-02-18 17:08

I just started learning python and here I have a sorted list of protein sequences (total 59,000 sequences) and some of them overlap. I have made a toy list here for example:

相关标签:
11条回答
  • 2021-02-18 17:54

    You could use groupby() and max() to help here:

    from itertools import groupby
    
    with open('toy.txt') as f_input:
        for key, group in groupby(f_input, lambda x: x[:2]):
            print(max(group, key=lambda x: len(x)).strip())
    

    This would display:

    ABCDEFGHIJKLMNO
    CEST
    DBTSFDEO
    EOEUDNBNUW
    EAEUDNBNUW
    FGH
    

    groupby() works by returning a list of matching items based on a function, in this case consecutive lines with the same first 2 characters. The max() function then takes this list and returns the list item with the longest length.

    0 讨论(0)
  • 2021-02-18 17:57

    As stated in other answers, your error comes from calculating the length of your input at the start and then not updating it as you shorten the list.

    Here's another take at a working solution:

    with open('toy.txt', 'r') as infile:
        input_lines = reversed(map(lambda s: s.strip(), infile.readlines()))
    
    output = []
    for pattern in input_lines:
        if len(output) == 0 or not output[-1].startswith(pattern):        
            output.append(pattern)
    
    print('\n'.join(reversed(output)))
    
    0 讨论(0)
  • 2021-02-18 18:01

    Not an exact match with your expectations, but, given that you state it's sorted (and it's not, near EOEUDNBNUWD EAEUDNBNUW) and that I don't know why you're missing EOEUDNBNUWD I am not sure if your expectations are correctly stated or if I've misread your question.

    (ah, yes, I see the notion of overlap throws a wrench into the sort and startswith approach).

    Might be nice for the OP to restate that particular aspect, I read @DSM comment without really understanding his concern. Now I do.

    li = sorted([i.strip() for i in """
    ABCDE
    ABCDEFG
    ABCDEFGH
    ABCDEFGHIJKLMNO
    CEST
    DBTSFDE
    DBTSFDEO
    EOEUDNBNUW
    EOEUDNBNUWD
    EAEUDNBNUW
    FEOEUDNBNUW
    FG
    FGH""".splitlines() if i.strip()])
    
    def get_iter(li):
        prev = ""
        for i in li:
            if not i.startswith(prev):
                yield(prev)
            prev = i
        yield prev
    
    for v in get_iter(li):
        print(v)
    

    output:

    ABCDEFGHIJKLMNO
    CEST
    DBTSFDEO
    EAEUDNBNUW
    EOEUDNBNUWD
    FEOEUDNBNUW
    FGH
    
    0 讨论(0)
  • 2021-02-18 18:06

    Kenny, You almost got it, but there are two problems which @scharette pointed out:

    1. for loop and removing of list item should not go together. The fix is to use the while loop and explicitly increase the index. The while loop is less efficient because it calls len() several times instead once, but that's what it take to get the correct result.
    2. The IndexError. This only happens at the very last line. My way to deal with this problem is to ignore the error.

    With that, I modified your code to:

    with open('toy.txt' ,'r') as f:
        pattern = f.read().splitlines()
        print pattern
    
        try:
            i = 0
            while i < len(pattern):
                if pattern[i] in pattern[i+1]:
                    pattern.remove(pattern[i])
                print pattern
                i += 1
        except IndexError:
            pass
    
    0 讨论(0)
  • 2021-02-18 18:09

    You can use a binary tree whose insertion process attempts to find nodes that precede the value:

    class Tree:
      def __init__(self, val=None):
        self.left, self.value, self.right = None, val, None
      def insert_val(self, _val):
        if self.value is None or _val.startswith(self.value):
           self.value = _val
        else:
           if _val < self.value:
              getattr(self.left, 'insert_val', lambda x:setattr(self, 'left', Tree(x)))(_val)
           else:
              getattr(self.right, 'insert_val', lambda x:setattr(self, 'right', Tree(x)))(_val)
      def flatten(self):
         return [*getattr(self.left, 'flatten', lambda :[])(), self.value, *getattr(self.right, 'flatten', lambda :[])()]
    
    t = Tree()
    for i in open('filename.txt'):
      t.insert_val(i.strip('\n'))
    print(t.flatten())
    

    Output:

    ['ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EAEUDNBNUW', 'EOEUDNBNUW', 'FGH']
    
    0 讨论(0)
提交回复
热议问题