问题
If you had a list of names . . .
query = ['link','zelda','saria','ganon','volvagia']
and a list of lines from a file
data = ['>link is the first','OIGFHFH','AGIUUIIUFG','>peach is the second',
'AGFDA','AFGDSGGGH','>luigi is the third','SAGSGFFG','AFGDFGDFG',
'DSGSFGAAA','>ganon is the fourth','ADGGHHHHHH','>volvagia is the last',
'AFGDAAFGDA','ADFGAFD','ADFDFFDDFG','AHUUERR','>ness is another','ADFGGGGH',
'HHHDFDA']
how would you be able to look at all lines that start with '>' and then if they have one of the names name_list then include the line with the '>' and also the sequences following it (sequences following will always be in upper) in two separate lists
#example output file
name_list = ['>link is the first','>ganon is the fourth','>volvagia is the last']
seq_list = ['OIGFHFHAGIUUIIUFG','ADGGHHHHHH','AFGDAAFGDAADFGAFDADFDFFDDFGAHUUERR']
i would rather not use a dictionary to do this as i've been prompted to do in similar situations
so what i have so far is:
for line,name in zip(data,query):
if bool(line[0] == '>' and re.search(name,line))==True:
#but then i'm stuck because len(query) and len(data) are not equal
.... any help would be greatly appreciated``
回答1:
result = []
names = ['link', 'zelda', 'saria', 'ganon', 'volvagia']
lines = iter(data)
for line in lines:
while line.startswith(">") and any(name in line for name in names):
name = line
upper_seq = []
for line in lines:
if not line.isupper():
break
upper_seq.append(line)
else:
line = "" # guard against infinite loop at EOF
result.append((name, ''.join(upper_seq)))
If there are many names then set()
might be faster to find names in line instead of any(...)
:
names = set(names)
# ...
if line.startswith(">") and names.intersection(line[1:].split()):
# ...
Result
[('>link is the first', 'OIGFHFHAGIUUIIUFG'),
('>ganon is the fourth', 'ADGGHHHHHH'),
('>volvagia is the last', 'AFGDAAFGDAADFGAFDADFDFFDDFGAHUUERR')]
回答2:
use a list comprehension
print [line for line in lines if line.startswith(">") and set(my_words).intersection(line[1:].split())]
this decomposes to a for loop as follows
matched_line = []
for line in lines:
if line.startswith(">") and set(my_words).intersection(line[1:].split()):
matched_lines.append(line)
using a set intersection should be significantly faster than looping over each word in the list and seeign if it is in the string
>>> print [line for line in data if line.startswith(">") and set(query).intersection(line[1:].split())]
['>link is the first', '>ganon is the fourth', '>volvagia is the last']
回答3:
There are more elegant ways to do this, but I think this method might be the easiest for you to understand:
>>> found_lines = []
>>> sequences = []
>>> for line in data:
... if line.startswith(">"):
... for name in query:
... if name in line:
... found_lines.append(line)
... else:
... sequences.append(line)
>>> print found_lines
['>link is the first', '>ganon is the fourth', '>volvagia is the last']
>>>
Always start simple, and think your way through the problem. What's the first thing you need to do? You want to loop over every line in data
(for line in data
).
For each of those lines, you want to check if it starts with >
. (if line.startswith(">")
). If it doesn't start with that character, then we can assume it's a "sequence", and add it to the sequences
list (sequences.append(line)
)
If it does, then you want to check if any of the names in query
appear in that line. What's the easiest way to do that? Loop over every one of the names (for name in query
), and check it by itself (if name in line
)
来源:https://stackoverflow.com/questions/15349121/search-for-word-from-list-of-words-in-line-from-list-of-lines-and-append-val