parsing a fasta file using a generator ( python )

前端 未结 4 800
臣服心动
臣服心动 2020-11-28 14:14

I am trying to parse a large fasta file and I am encountering out of memory errors. Some suggestions to improve the data handling would be appreciated. Currently the program

相关标签:
4条回答
  • 2020-11-28 14:20

    A pyparsing parser for this format is only a few lines long. See the annotations in the following code:

    data = """>1 (PB2) 
    AATATATTCAATATGGAGAGAATAAAAGAACTAAGAGATCTAATGTCACAGTCTCGCACTCGCGAGATAC 
    TCACCAAAACCACTGTGGACCACATGGCCATAATCAAAAAGTACACATCAGGAAGGCAAGAGAAGAACCC 
    TGCACTCAGGATGAAGTGGATGATG 
    >2 (PB1) 
    AACCATTTGAATGGATGTCAATCCGACTTTACTTTTCTTGAAAGTTCCAGCGCAAAATGCCATAAGCACC 
    ACATTTCCCTATACTGGAGACCCTCC"""
    
    from pyparsing import Word, nums, QuotedString, Combine, OneOrMore
    
    # define some basic forms
    integer = Word(nums)
    key = QuotedString("(", endQuoteChar=")")
    
    # sequences are "words" made up of the characters A, G, C, and T
    # we want to match one or more of them, and have the parser combine
    # them into a single string (Combine by default requires all of its
    # elements to be adjacent within the input string, but we want to allow
    # for the intervening end of lines, so we add adjacent=False)
    sequence = Combine(OneOrMore(Word("AGCT")), adjacent=False)
    
    # define the overall pattern to scan for - attach results names
    # to each matched element
    seqEntry = ">" + integer("index") + key("key") + sequence("sequence")
    
    for seq,s,e in seqEntry.scanString(data):
        # just dump out the matched data
        print seq.dump()
        # could also access fields as seq.index, seq.key and seq.sequence
    

    Prints:

    ['>', '1', 'PB2', 'AATATATTCAATATGGAGAGAATAAAAGAACTAAGAGATCTAATGTCACAGTCTCGCACTCGCGAGATACTCACCAAAACCACTGTGGACCACATGGCCATAATCAAAAAGTACACATCAGGAAGGCAAGAGAAGAACCCTGCACTCAGGATGAAGTGGATGATG']
    - index: 1
    - key: PB2
    - sequence: AATATATTCAATATGGAGAGAATAAAAGAACTAAGAGATCTAATGTCACAGTCTCGCACTCGCGAGATACTCACCAAAACCACTGTGGACCACATGGCCATAATCAAAAAGTACACATCAGGAAGGCAAGAGAAGAACCCTGCACTCAGGATGAAGTGGATGATG
    ['>', '2', 'PB1', 'AACCATTTGAATGGATGTCAATCCGACTTTACTTTTCTTGAAAGTTCCAGCGCAAAATGCCATAAGCACCACATTTCCCTATACTGGAGACCCTCC']
    - index: 2
    - key: PB1
    - sequence: AACCATTTGAATGGATGTCAATCCGACTTTACTTTTCTTGAAAGTTCCAGCGCAAAATGCCATAAGCACCACATTTCCCTATACTGGAGACCCTCC
    
    0 讨论(0)
  • 2020-11-28 14:22

    Without having a great understanding of what you are doing, I would have written the code like this:

    def readFastaEntry( fp ):
        name = ""
        while True:
            line = name or f.readline()
            if not line:
                break
            seq = []
            while True:
                name = f.readline()
                if not name or name.startswith(">"):
                    break
                else:
                    seq.append(name)
            yield (line, "".join(seq))
    

    This gathers up the data after a starting line up to the next starting line. Making seq an array means that you minimize the string joining until the last possible moment. Yielding a tuple makes more sense than a list.

    0 讨论(0)
  • 2020-11-28 14:26

    Have you considered using BioPython. They have a sequence reader that can read fasta files. And if you are interested in coding one yourself, you can take a look at BioPython's code.

    Edit: Code added

    def read_fasta(fp):
        name, seq = None, []
        for line in fp:
            line = line.rstrip()
            if line.startswith(">"):
                if name: yield (name, ''.join(seq))
                name, seq = line, []
            else:
                seq.append(line)
        if name: yield (name, ''.join(seq))
    
    with open('f.fasta') as fp:
        for name, seq in read_fasta(fp):
            print(name, seq)
    
    0 讨论(0)
  • 2020-11-28 14:34
    def read_fasta(filename):
        name = None
        with open(filename) as file:
            for line in file:
                if line[0] == ">":
                    if name:
                        yield (name, seq)
                    name = line[1:-1].split("|")[0]
                    seq = ""
                else:
                    seq += line[:-1]
            yield (name, seq)
    
    0 讨论(0)
提交回复
热议问题