Remove item from list based on the next item in same list

前端未结

关注

 11  2330

I just started learning python and here I have a sorted list of protein sequences (total 59,000 sequences) and some of them overlap. I have made a toy list here for example:

相关标签:

11条回答

青春惊慌失措

2021-02-18 17:54
You could use groupby() and max() to help here:
```
from itertools import groupby

with open('toy.txt') as f_input:
    for key, group in groupby(f_input, lambda x: x[:2]):
        print(max(group, key=lambda x: len(x)).strip())
```
This would display:
```
ABCDEFGHIJKLMNO
CEST
DBTSFDEO
EOEUDNBNUW
EAEUDNBNUW
FGH
```
groupby() works by returning a list of matching items based on a function, in this case consecutive lines with the same first 2 characters. The max() function then takes this list and returns the list item with the longest length.
0 讨论(0)
发布评论:

提交评论
- 加载中...

遇见更好的自我

2021-02-18 17:57

As stated in other answers, your error comes from calculating the length of your input at the start and then not updating it as you shorten the list.

Here's another take at a working solution:

with open('toy.txt', 'r') as infile:
    input_lines = reversed(map(lambda s: s.strip(), infile.readlines()))

output = []
for pattern in input_lines:
    if len(output) == 0 or not output[-1].startswith(pattern):        
        output.append(pattern)

print('\n'.join(reversed(output)))

0 讨论(0)

一个人的身影

2021-02-18 18:01
Not an exact match with your expectations, but, given that you state it's sorted (and it's not, near EOEUDNBNUWD EAEUDNBNUW) and that I don't know why you're missing EOEUDNBNUWD I am not sure if your expectations are correctly stated or if I've misread your question.

(ah, yes, I see the notion of overlap throws a wrench into the sort and startswith approach).

Might be nice for the OP to restate that particular aspect, I read @DSM comment without really understanding his concern. Now I do.
```
li = sorted([i.strip() for i in """
ABCDE
ABCDEFG
ABCDEFGH
ABCDEFGHIJKLMNO
CEST
DBTSFDE
DBTSFDEO
EOEUDNBNUW
EOEUDNBNUWD
EAEUDNBNUW
FEOEUDNBNUW
FG
FGH""".splitlines() if i.strip()])

def get_iter(li):
    prev = ""
    for i in li:
        if not i.startswith(prev):
            yield(prev)
        prev = i
    yield prev

for v in get_iter(li):
    print(v)
```
output:
```
ABCDEFGHIJKLMNO
CEST
DBTSFDEO
EAEUDNBNUW
EOEUDNBNUWD
FEOEUDNBNUW
FGH
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
猫巷女王i

2021-02-18 18:06
Kenny, You almost got it, but there are two problems which @scharette pointed out:
1. for loop and removing of list item should not go together. The fix is to use the while loop and explicitly increase the index. The while loop is less efficient because it calls len() several times instead once, but that's what it take to get the correct result.
2. The IndexError. This only happens at the very last line. My way to deal with this problem is to ignore the error.
With that, I modified your code to:
```
with open('toy.txt' ,'r') as f:
    pattern = f.read().splitlines()
    print pattern

    try:
        i = 0
        while i < len(pattern):
            if pattern[i] in pattern[i+1]:
                pattern.remove(pattern[i])
            print pattern
            i += 1
    except IndexError:
        pass
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

长发绾君心

2021-02-18 18:09

You can use a binary tree whose insertion process attempts to find nodes that precede the value:

class Tree:
  def __init__(self, val=None):
    self.left, self.value, self.right = None, val, None
  def insert_val(self, _val):
    if self.value is None or _val.startswith(self.value):
       self.value = _val
    else:
       if _val < self.value:
          getattr(self.left, 'insert_val', lambda x:setattr(self, 'left', Tree(x)))(_val)
       else:
          getattr(self.right, 'insert_val', lambda x:setattr(self, 'right', Tree(x)))(_val)
  def flatten(self):
     return [*getattr(self.left, 'flatten', lambda :[])(), self.value, *getattr(self.right, 'flatten', lambda :[])()]

t = Tree()
for i in open('filename.txt'):
  t.insert_val(i.strip('\n'))
print(t.flatten())

Output:

['ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EAEUDNBNUW', 'EOEUDNBNUW', 'FGH']

0 讨论(0)

上一页 1 2