I just started learning python and here I have a sorted list of protein sequences (total 59,000 sequences) and some of them overlap. I have made a toy list here for example:
You could use groupby() and max()
to help here:
from itertools import groupby
with open('toy.txt') as f_input:
for key, group in groupby(f_input, lambda x: x[:2]):
print(max(group, key=lambda x: len(x)).strip())
This would display:
ABCDEFGHIJKLMNO
CEST
DBTSFDEO
EOEUDNBNUW
EAEUDNBNUW
FGH
groupby()
works by returning a list of matching items based on a function, in this case consecutive lines with the same first 2 characters. The max()
function then takes this list and returns the list item with the longest length.
As stated in other answers, your error comes from calculating the length of your input at the start and then not updating it as you shorten the list.
Here's another take at a working solution:
with open('toy.txt', 'r') as infile:
input_lines = reversed(map(lambda s: s.strip(), infile.readlines()))
output = []
for pattern in input_lines:
if len(output) == 0 or not output[-1].startswith(pattern):
output.append(pattern)
print('\n'.join(reversed(output)))
Not an exact match with your expectations, but, given that you state it's sorted (and it's not, near EOEUDNBNUWD EAEUDNBNUW
) and that I don't know why you're missing EOEUDNBNUWD
I am not sure if your expectations are correctly stated or if I've misread your question.
(ah, yes, I see the notion of overlap throws a wrench into the sort
and startswith
approach).
Might be nice for the OP to restate that particular aspect, I read @DSM comment without really understanding his concern. Now I do.
li = sorted([i.strip() for i in """
ABCDE
ABCDEFG
ABCDEFGH
ABCDEFGHIJKLMNO
CEST
DBTSFDE
DBTSFDEO
EOEUDNBNUW
EOEUDNBNUWD
EAEUDNBNUW
FEOEUDNBNUW
FG
FGH""".splitlines() if i.strip()])
def get_iter(li):
prev = ""
for i in li:
if not i.startswith(prev):
yield(prev)
prev = i
yield prev
for v in get_iter(li):
print(v)
output:
ABCDEFGHIJKLMNO
CEST
DBTSFDEO
EAEUDNBNUW
EOEUDNBNUWD
FEOEUDNBNUW
FGH
Kenny, You almost got it, but there are two problems which @scharette pointed out:
for
loop and removing of list item should not go together. The fix is to use the while
loop and explicitly increase the index. The while
loop is less efficient because it calls len()
several times instead once, but that's what it take to get the correct result.IndexError
. This only happens at the very last line. My way to deal with this problem is to ignore the error.With that, I modified your code to:
with open('toy.txt' ,'r') as f:
pattern = f.read().splitlines()
print pattern
try:
i = 0
while i < len(pattern):
if pattern[i] in pattern[i+1]:
pattern.remove(pattern[i])
print pattern
i += 1
except IndexError:
pass
You can use a binary tree whose insertion process attempts to find nodes that precede the value:
class Tree:
def __init__(self, val=None):
self.left, self.value, self.right = None, val, None
def insert_val(self, _val):
if self.value is None or _val.startswith(self.value):
self.value = _val
else:
if _val < self.value:
getattr(self.left, 'insert_val', lambda x:setattr(self, 'left', Tree(x)))(_val)
else:
getattr(self.right, 'insert_val', lambda x:setattr(self, 'right', Tree(x)))(_val)
def flatten(self):
return [*getattr(self.left, 'flatten', lambda :[])(), self.value, *getattr(self.right, 'flatten', lambda :[])()]
t = Tree()
for i in open('filename.txt'):
t.insert_val(i.strip('\n'))
print(t.flatten())
Output:
['ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EAEUDNBNUW', 'EOEUDNBNUW', 'FGH']