Parse fasta sequence to the dictionary

问题

I need most trivial solution to convert fasta.txt containing multiple nucleotide sequences like

>seq1
TAGATTCTGAGTTATCTCTTGCATTAGCAGGTCATCCTGGTCAAACCGCTACTGTTCCGG
CTTTCTGATAATTGATAGCATACGCTGCGAACCCACGGAAGGGGGTCGAGGACAGTGGTG
>seq2
TCCCTCTAGAGGCTCTTTACCGTGATGCTACATCTTACAGGTATTTCTGAGGCTCTTTCA
AACAGGTGCGCGTGAACAACAACCCACGGCAAACGAGTACAGTGTGTACGCCTGAGAGTA
>seq3
GGTTCCGCTCTAAGCCTCTAACTCCCGCACAGGGAAGAGATGTCGATTAACTTGCGCCCA
TAGAGCTCTGCGCGTGCGTCGAAGGCTCTTTTCGCGATATCTGTGTGGTCTCACTTTGGT

to the dictionary(name,value) object where name will be the >header and value will be assigned to corresponded sequence.

Below you can find my failed attempt do it via 2 lists (does not work for long sequence containing >1 line )

f = open('input2.txt', 'r')
list={}
names=[]
seq=[]
for line in f:
 if line.startswith('>'):
  names.append(line[1:-1])
 elif line.startswith('A') or line.startswith('C') or line.startswith('G') or line.startswith('T'):
  seq.append(line)

list = dict(zip(names, seq))

I'll be thankful if you provide me with the solution of how fix it and example how to do it via separate function.

Thanks for help,

Gleb

回答1:

It is better to use biopython library

from Bio import SeqIO
input_file = open("input.fasta")
my_dict = SeqIO.to_dict(SeqIO.parse(input_file, "fasta"))

回答2:

a simple correction to your code:

from collections import defaultdict #this will make your life simpler
f = open('input2.txt','r')
list=defaultdict(str)
name = ''
for line in f:
    #if your line starts with a > then it is the name of the following sequence
    if line.startswith('>'):
        name = line[1:-1]
        continue #this means skips to the next line
    #This code is only executed if it is a sequence of bases and not a name.
    list[name]+=line.strip()

UPDATE:

Since I've got a notification that this old answer was upvoted, I've decided to present what I now think is the proper solution using Python 3.7. Translation to Python 2.7 only requires removing the typing import line and the function annotations:

from collections import OrderedDict
from typing import Dict

NAME_SYMBOL = '>'


def parse_sequences(filename: str,
                    ordered: bool=False) -> Dict[str, str]:
    """
    Parses a text file of genome sequences into a dictionary.
    Arguments:
      filename: str - The name of the file containing the genome info.
      ordered: bool - Set this to True if you want the result to be ordered.
    """
    result = OrderedDict() if ordered else {}

    last_name = None
    with open(filename) as sequences:
        for line in sequences:
            if line.startswith(NAME_SYMBOL):
                last_name = line[1:-1]
                result[last_name] = []
            else:
                result[last_name].append(line[:-1])

    for name in result:
        result[name] = ''.join(result[name])

    return result

Now, I realize that the OP asked for the "most trivial solution", however since they are working with genome data, it seems fair to assume that each sequence could potentially be very large. In that case it makes sense to optimize a little bit by collecting the sequence lines into a list, and then to use the str.join method on those lists at the end to produce the final result.

来源：https://stackoverflow.com/questions/22698807/parse-fasta-sequence-to-the-dictionary

标签

python

dictionary

fasta