Is there a function that can calculate a score for aligned sequences given the alignment parameters?

我与影子孤独终老i 提交于 2019-12-09 06:20:40

问题


I try to score the already-aligned sequences. Let say

seq1 = 'PAVKDLGAEG-ASDKGT--SHVVY----------TI-QLASTFE'
seq2 = 'PAVEDLGATG-ANDKGT--LYNIYARNTEGHPRSTV-QLGSTFE'

with given parameters

substitution matrix : blosum62
gap open penalty : -5
gap extension penalty : -1

I did look through the biopython cookbook but all i can get is substitution matrix blogsum62 but I feel that it must have someone already implemented this kind of library.

So can anyone suggest any libraries or shortest code that can solve my problem?

Thx in advance


回答1:


Jessada,

The Blosum62 matrix (note the spelling ;) is in Bio.SubsMat.MatrixInfo and is a dictionary with tuples resolving to scores (so ('A', 'A') is worth 4 pts). It doesn't have the gaps, and it's only one triangle of the matrix (so it might ahve ('T', 'A') but not ('A', 'T'). There are some helper functions in Biopython, including some in Bio.Pairwise, but this is what I came up with as an answer:

from Bio.SubsMat import MatrixInfo

def score_match(pair, matrix):
    if pair not in matrix:
        return matrix[(tuple(reversed(pair)))]
    else:
        return matrix[pair]

def score_pairwise(seq1, seq2, matrix, gap_s, gap_e):
    score = 0
    gap = False
    for i in range(len(seq1)):
        pair = (seq1[i], seq2[i])
        if not gap:
            if '-' in pair:
                gap = True
                score += gap_s
            else:
                score += score_match(pair, matrix)
        else:
            if '-' not in pair:
                gap = False
                score += score_match(pair, matrix)
            else:
                score += gap_e
    return score

seq1 = 'PAVKDLGAEG-ASDKGT--SHVVY----------TI-QLASTFE'
seq2 = 'PAVEDLGATG-ANDKGT--LYNIYARNTEGHPRSTV-QLGSTFE'

blosum = MatrixInfo.blosum62

score_pairwise(seq1, seq2, blosum, -5, -1)

Which returns 82 for your alignment. There's almost certianly prettier ways to do all of this, but that should be a good start.




回答2:


blosum62 is a dictonary of 276 items.

I prefered to complete with the lacking items, because it represents an iteration of only 276 turns, while the sequences to be analysed are likely to have more than 276 elements. Consequently, if you find the score of each pair with the help of the function score_match() , this function will have to perform the test if pair not in matrix for each of the elements of the sequences, that is to say certainly far more than 276 times.

Another thing that takes a lot of time: each score += something creates a new integer and binds the name score to this new object. Each binding takes an amount of time that doesn't exist with a stream of integers by a generator that are immediatly added to the current amount.

from Bio.SubsMat.MatrixInfo import blosum62 as blosum
from itertools import izip

blosum.update(((b,a),val) for (a,b),val in blosum.items())

def score_pairwise(seq1, seq2, matrix, gap_s, gap_e, gap = True):
    for A,B in izip(seq1, seq2):
        diag = ('-'==A) or ('-'==B)
        yield (gap_e if gap else gap_s) if diag else matrix[(A,B)]
        gap = diag

seq1 = 'PAVKDLGAEG-ASDKGT--SHVVY----------TI-QLASTFE'
seq2 = 'PAVEDLGATG-ANDKGT--LYNIYARNTEGHPRSTV-QLGSTFE'

print sum(score_pairwise(seq1, seq2, blosum, -5, -1))

This score_pairwise() is a generator function because there is yield instead of return.



来源:https://stackoverflow.com/questions/5686211/is-there-a-function-that-can-calculate-a-score-for-aligned-sequences-given-the-a

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!