Finding Minimum hamming distance of a set of strings in python

丶灬走出姿态 提交于 2019-12-01 02:25:22

问题


I have a set of n (~1000000) strings (DNA sequences) stored in a list trans. I have to find the minimum hamming distance of all sequences in the list. I implemented a naive brute force algorithm, which has been running for more than a day and has not yet given a solution. My code is

dmin=len(trans[0])
for i in xrange(len(trans)):
    for j in xrange(i+1,len(trans)):
            dist=hamdist(trans[i][:-1], trans[j][:-1])
            if dist < dmin:
                    dmin = dist

Is there a more efficient method to do this? Here hamdist is a function I wrote to find hamming distances. It is

def hamdist(str1, str2):
    diffs = 0
    if len(str1) != len(str2):
        return max(len(str1),len(str2))
    for ch1, ch2 in zip(str1, str2):
        if ch1 != ch2:
          diffs += 1
    return diffs

回答1:


You could optimize your hamdist function by adding an optional parameter containing the minimum distance you have got so far, this way if diffs reaches that value you stop calculating the distance because this comparison will give you a greater distance than the minimum:

def hamdist(str1, str2,prevMin=None):
    diffs = 0
    if len(str1) != len(str2):
        return max(len(str1),len(str2))
    for ch1, ch2 in zip(str1, str2):
        if ch1 != ch2:
            diffs += 1
            if prevMin is not None and diffs>prevMin:
                return None
    return diffs 

You will need to adapt your main loop to work with None return value from hamdist:

dmin=len(trans[0])
for i in xrange(len(trans)):
    for j in xrange(i+1,len(trans)):
            dist=hamdist(trans[i][:-1], trans[j][:-1])
            if dist is not None and dist < dmin:
                    dmin = dist



回答2:


Some ideas:

1) sklearn.metrics.hamming_loss is probably much more efficient than your implementation, even if you have to convert your strings to arrays.

2) Are all your strings unique? If so remove the duplicates.

You can also try sklearn.metrics.pairwise.pairwise_distances, for example:

In [1]: from sklearn.metrics.pairwise import pairwise_distances

In [2]: from sklearn.metrics import hamming_loss

In [3]: a = np.array([[3,4,5], [3,4,4],[3,1,1]])

In [4]: import numpy as np

In [5]: a = np.array([[3,4,5], [3,4,4],[3,1,1]])

In [6]: pairwise_distances(metric=hamming_loss)

In [7]: pairwise_distances(a, metric=hamming_loss)
Out[7]: 
array([[ 0.        ,  0.33333333,  0.66666667],
       [ 0.33333333,  0.        ,  0.66666667],
       [ 0.66666667,  0.66666667,  0.        ]])

I am not seeing a flag that would only calculate upper-triangle, but this still should be faster than looping.




回答3:


As mentioned in this answer, there is no general way to get better than the quadratic running time. You need to exploit the structure of the data. For example, if the threshold t for maximum allowed Hamming distance is small compared to the length of the strings n (e.g. t=100, n=1000000), you can do the following: randomly select k columns (e.g. k=1000), restrict the strings to these columns, and hash them into buckets. You then need to do the pairwise comparison only within each bucket, under the assumption that the two strings with minimum Hamming distance mismatch only in nonselected columns. For the example, this is true with about 90% probability, and you can get the error probability arbitrarily low by repeating the process.




回答4:


find the hamming distances of all strings and store it in an array. some thing like

    distance=[]
    for i in trans:
      distance.append(hamdist(i))

then caluclate the min of them like

    minimum =min(distance)


来源:https://stackoverflow.com/questions/24624415/finding-minimum-hamming-distance-of-a-set-of-strings-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!