问题
I'm trying to organize file with multiple sequences . In doing so, I'm trying to add the names to a list and add the sequences to a separate list that is parallel with the name list . I figured out how to add the names to a list but I can't figure out how to add the sequences that follow it into separate lists . I tried appending the lines of sequence into an empty string but it appended all the lines of all the sequences into a single string .
all the names start with a '>'
def Name_Organizer(FASTA,output):
import os
import re
in_file=open(FASTA,'r')
dir,file=os.path.split(FASTA)
temp = os.path.join(dir,output)
out_file=open(temp,'w')
data=''
name_list=[]
for line in in_file:
line=line.strip()
for i in line:
if i=='>':
name_list.append(line)
break
else:
line=line.upper()
if all([k==k.upper() for k in line]):
data=data+line
print data
how do i add the sequences to a list as a set of strings ?
the input file looks like this
回答1:
You need to reset the string when you hit marker lines, like this:
def Name_Organizer(FASTA,output):
import os
import re
in_file=open(FASTA,'r')
dir,file=os.path.split(FASTA)
temp = os.path.join(dir,output)
out_file=open(temp,'w')
data=''
name_list=[]
seq_list=[]
for line in in_file:
line=line.strip()
for i in line:
if i=='>':
name_list.append(line)
if data:
seq_list.append(data)
data=''
break
else:
line=line.upper()
if all([k==k.upper() for k in line]):
data=data+line
print seq_list
Of course, it might also be faster (depending on how large your files are) to use string joining rather than continually appending:
data = []
# ...
data.append(line) # repeatedly
# ...
seq_list.append(''.join(data)) # each time you get to a new marker line
data = []
回答2:
If you're working with Python & fasta files, you might want to look into installing BioPython. It already contains this parsing functionality, and a whole lot more.
Parsing a fasta file would be as simple as this:
from Bio import SeqIO
for record in SeqIO.parse('filename.fasta', 'fasta'):
print record.id, record.seq
回答3:
I organized it in a dictionary first
# remove white spaces from the lines
lines = [x.strip() for x in open(sys.argv[1]).readlines()]
fasta = {}
for line in lines:
if not line:
continue
# create the sequence name in the dict and a variable
if line.startswith('>'):
sname = line
if line not in fasta:
fasta[line] = ''
continue
# add the sequence to the last sequence name variable
fasta[sname] += line
# just to facilitate the input for my function
lst = list(fasta.values())
来源:https://stackoverflow.com/questions/9557713/add-multiple-sequences-from-a-fasta-file-to-a-list-in-python