Transform dna alignment into numpy array using biopython

问题

I have several DNA sequences that have been aligned and I would like to keep only the bases that are variable at a specific position.

This maybe could be done if we first transform the alignment into an array. I tried using the code in the Biopython tutorial but it gives an error.

import numpy as np
from Bio import AlignIO
alignment = AlignIO.parse("ma-all-mito.fa", "fasta")
align_array = np.array([list(rec) for rec in alignment], np.character)
print("Array shape %i by %i" % align_array.shape)

The error I get:

Traceback (most recent call last):

File "C:/select-snps.py", line 8, in <module>
    print("Array shape %i by %i" % align_array.shape)
TypeError: not all arguments converted during string formatting

回答1:

AlignIO doesn't seem to be the tool you want for this job. You have a file presumably with many sequences, not with many multiple sequence alignments, so you probably want to use SeqIO, not AlignIO (source). This is why the shape of your array is (1, 99, 16926), because you have 1 alignment of 99 sequences of length 16926.

If you just want an array of the sequences (which it appears you do from the np.character dtype supplied to np.array), then do the following:

import numpy as np
from Bio import SeqIO
records = SeqIO.parse("ma-all-mito.fa", "fasta")
align_array = np.array([record.seq for record in records], np.character)
print("Array shape %i by %i" % align_array.shape)
# expect to be (99, 16926)

Note above that technically each element of records is also a BioPython SeqRecord which includes the sequence in addition to metadata. list(record) is a shortcut for getting the sequence, the other way being record.seq. Either should work, but I chose using the attribute way since it is more explicit.

回答2:

I'm answering to your problem instead of fixing your code. If you want to keep only certain positions, you want to use AlignIO:

FASTA sample al.fas:

>seq1
CATCGATCAGCATCGACATGCGGCA-ACG
>seq2
CATCGATCAG---CGACATGCGGCATACG
>seq3
CATC-ATCAGCATCGACATGCGGCATACG
>seq4
CATCGATCAGCATCGACAAACGGCATACG

Now suppose you want to keep only certain positions. MultipleSeqAlignment allows you to query the alignment like a numpy array:

from Bio import AlignIO


al = AlignIO.read("al.fas", "fasta")

# Print the 11th column
print(al[:, 10])

# Print the 12-15 columns
print(al[:, 11:14])

If you want to know the shape of the alignment, use len and get_alignment_length:

>>> print(len(al), al.get_alignment_length())
4 29

When you use AlignIO.parse() to load an alignment, it assumes the file to be parsed could contain more than one alignment (PHYLIP does this). Thus the parser returns an iterator over each alignment and not over records as your code implies. But your FASTA file only contain one alignment per file and parse() yields only one MultipleSeqAlignment. So the fix to your code is:

alignment = AlignIO.read("ma-all-mito.fa", "fasta")
align_array = np.array(alignment, np.character)
print("Array shape %i by %i" % align_array.shape)

来源：https://stackoverflow.com/questions/39779488/transform-dna-alignment-into-numpy-array-using-biopython

标签

python

numpy

biopython