How to translate a FASTA sequence from dict/ how to make function output a string?

问题

Firstly I can't use BioPython :( I need to translate a bunch of FASTA sequences from a FASTA file and translate them to protein sequence. FASTA file is like this;

>some info
ACCGGGCTAAA

>other info
ACCGCCAATTT

So I can create a function that outputs only the DNA sequence but when I try to translate it I get the following error; "TypeError: object of type '_io.TextIOWrapper' has no len()" I have no ide how to resolve this. Any help is immensely appreciated!!!!! Also I am taking my first Python course so please explain any answers as if to a moron :)

#Open the file for reading
fasta=open('mRNA_database.fasta', 'r')


def readSeq(fasta):
    for line in fasta:
        if line.startswith('>'):
            continue
        line = line.strip()
        #print(line)
readSeq(fasta)

g_code=dict()
g_code = {'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'stop', 'TAG':'stop',
    'TGC':'C', 'TGT':'C', 'TGA':'stop', 'TGG':'W'}

def aa_to_prt(fasta, g_code):
    prt = ''
    for i in range(0, len(fasta), 3):
        codon = fasta[i:i+3]
        prt+= g_code[codon]
    print(prt)

aa_to_prt(fasta, g_code)

回答1:

What is your desired output ?

for input like:

some info ACCGGGCTAAA
other info ACCGCCAATTT

with code:

def readSeq():
    for line in open('mRNA_database.fasta', 'r'):
        if line.startswith('>'):
            continue
        line = line.strip()
        yield line.split(' ')[2]

g_code=dict()
g_code = {'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'stop', 'TAG':'stop',
    'TGC':'C', 'TGT':'C', 'TGA':'stop', 'TGG':'W'}

def aa_to_prt(g_code):
    prt = ''
    for name in readSeq():
        codon = name[:3]
        prt += g_code[codon]
    print(prt)

aa_to_prt(g_code)

i got output:

TT

is that what You want?

回答2:

You seem to be trying to use len(filehandle) to figure out how far you can read into the file. But the handle doesn't have a length -- the file might, but that's not what you are looking at. And anyway, the API is a more general one, where in many cases the stream you are opening doesn't (yet) have a length -- there is no way for the system to know when opening a handle how many bytes the user will type, or how many packets are going to arrive over the network.

Instead, the convention is to simply iterate over the handle until it no longer produces a value. (Behind the scenes, modern Python uses an iterator which creates a StopIteration exception when there is nothing left to read.)

Your readSeq function does this correctly, but you are not returning any values from it, so it simply consumes the file, and leaves you with the file handle open at the end of the file, with nothing left to read.

Maybe try something like this instead.

def prtSeq(fastahandle):
    global g_code # as defined in your code already
    for line in fastahandle:
        if line.startswith('>'):
            continue
        line = line.strip()
        proteins = []
        for seq in xrange(0, len(line)-1, 3):
            proteins.append(g_code[line[seq:seq+3]])
        print(''.join(proteins))

prtSeq(open('mRNA_database.fasta', 'r'))

As an aside, assigning an empty dictionary to the variable is useless; you are immediately overwriting the previous value with a new dictionary.

A better design would only return values to the caller for printing, but I take it you are primarily interested in getting the job done with the simplest possible code.

来源：https://stackoverflow.com/questions/36305314/how-to-translate-a-fasta-sequence-from-dict-how-to-make-function-output-a-strin

标签

string

dictionary

typeerror

bioinformatics

fasta