问题
Input fasta-format text file:
http://www.jcvi.org/cgi-bin/tigrfams/DownloadFile.cgi?file=/opt/www/www_tmp/tigrfams/fa_alignment_PF00205.txt
#!/usr/bin/python
from Bio import AlignIO
seq_file = open('/path/to/fa_alignment_PF00205.txt')
alignment = AlignIO.read(seq_file, "fasta")
Error:
ValueError: Sequences must all be the same length
The input sequences shouldn't have to be the same length since on ClustalOmega you can align sequences of differing lengths.
This also doesn't work...gets the same error:
alignment = AlignIO.parse(seq_file,"fasta")
for record in alignment:
print(record.id)
Does anybody who is familiar with BioPython know how to get around this to align sequences from fasta files?
回答1:
Pad the sequence that is too short and write the records to to a temporary FASTA file. Than your alignments works as expected:
from Bio import AlignIO
from Bio import SeqIO
from Bio import Seq
import os
input_file = '/path/to/fa_alignment_PF00205.txt'
records = SeqIO.parse(input_file, 'fasta')
records = list(records) # make a copy, otherwise our generator
# is exhausted after calculating maxlen
maxlen = max(len(record.seq) for record in records)
# pad sequences so that they all have the same length
for record in records:
if len(record.seq) != maxlen:
sequence = str(record.seq).ljust(maxlen, '.')
record.seq = Seq.Seq(sequence)
assert all(len(record.seq) == maxlen for record in records)
# write to temporary file and do alignment
output_file = '{}_padded.fasta'.format(os.path.splitext(input_file)[0])
with open(output_file, 'w') as f:
SeqIO.write(records, f, 'fasta')
alignment = AlignIO.read(output_file, "fasta")
print alignment
This outputs:
SingleLetterAlphabet() alignment with 104 rows and 275 columns
TKAAIELIADHQ.......LTVLADLLVHRLQ..AVKELEALLA...QAL SP|A2VGF0.1/208-339
LQELASVINQHE...KV..MLFCGHGCR...Y..AVEEVMALAK...EDL SP|A3D4X6.1/190-319
IKKIAQAIEKAK...KP..VICAGGGVINS.N..ASEELLTLSR...KEL SP|A3DID9.1/192-327
IDEAAEAINKAE...RP..VILAGGGVSIA.G..ANKELFEFAT...QLL SP|A3DIY4.1/192-327
IEKAIELINSSQ...RP..FICSGGGVISS.E..ASEELIQFAE...KIL SP|A4XHS0.1/191-326
IKRAVEAIENSQ...RP..VICSGGGVIAS.R..ASDELKILVE...SEI SP|A4XIL5.1/194-328
VRQAARIIMESE...RP..VIYAGGGVRIS.G..AAPELLELSE...RAL SP|A5D4V9.1/192-327
LQALAQRILRAQ...RP..VIITGDEIVKS.D..ALQAAADFAS...LQL SP|A5ECG1.1/192-328
VEKAVELLWSAR...RV..LVISGRGAR...G..AGPELIGLLD...RAM SP|A5EDH4.1/198-324
IQKAARLIETAE...KP..VIIAGHGVNIS.G..ANEELKTLAE...KSL SP|A5FR34.1/193-328
LDALARDLDSAA...RV..TIYAGIGAR...G..AAARVVQLAG...EAL SP|A5FTR0.1/189-317
VADVAALLRAAR...RP..VIVAGGGVIHSG...AEERLATFAA...DAL SP|A5G0X6.1/217-351
IAEAVSALKGAK...RP..IIYTGGGLINS.GPESAELIVQLAK...RAL SP|A5G2E1.1/199-336
LKKAAEIINRAK...RP..LIYAGGGITLA.G..ASAELRALAA...ALL SP|A5GC69.1/192-327
CRDIVGKLLQSH...RP..VVLGGTGVRLS.R..TEQRLLALVE...DVF SP|A5W0I1.1/200-336
LDQAALKLAAAE...RP..MIIAGGGA..L.H..AAEQLAQLSA...AGL SP|A5W220.1/196-326
LQRAADILNTGH...KV..AILVGAGAL...Q..ATEQVIAIAE...RAL SP|A5W364.1/198-328
IRKAAEMLLAAK...RP..VVYSGGGVILG.G..GSEALTEIAK...SEM SP|A5W954.1/196-331
...
LTELQERLANAQ...RP..VVILGGSRWSD.A..AVQQFTRFAE...... SP|Q220C3.1/190-328
回答2:
your problem is last record of fasta ... tail -9 fa_alignment_PF00205.txt
>SP|Q21VK8.1/229-357 LQAALAALAKAE...RP..LLVIGSQALVLSK..QAEHLAEAVARL.GIPV.YLSGMA..RGLLG.R..........DH. ...............PLQ..................MRHQRRQALRE..ADCVLLAG.VP...CDFRLD......YGKHV RR..............S.AT.........L..IAA.N......................RSA.........KDARLNR.. .......K...PD.IAAIGDAG.......LFLQAL >SP|Q220C3.1/190-328 LTELQERLANAQ...RP..VVILGGSRWSD.A..AVQQFTRFAEAF.SLPV.FCSFRR..QMLFS.A..........NH. ...............ACY...AG.DLGLG.A.....NQRLLARI.RQ..SDLILLLG.GR...MSEVPS......QGYEL LGIPAPQQ...........D
Sequence with id SP|Q220C3.1/190-328
has different length than other sequences
来源:https://stackoverflow.com/questions/32833230/biopython-alignio-valueerror-says-strings-must-be-same-length