Finding Minimum hamming distance of a set of strings in python

一个人想着一个人 提交于 2019-12-01 04:58:56
Pablo Francisco Pérez Hidalgo

You could optimize your hamdist function by adding an optional parameter containing the minimum distance you have got so far, this way if diffs reaches that value you stop calculating the distance because this comparison will give you a greater distance than the minimum:

def hamdist(str1, str2,prevMin=None):
    diffs = 0
    if len(str1) != len(str2):
        return max(len(str1),len(str2))
    for ch1, ch2 in zip(str1, str2):
        if ch1 != ch2:
            diffs += 1
            if prevMin is not None and diffs>prevMin:
                return None
    return diffs 

You will need to adapt your main loop to work with None return value from hamdist:

dmin=len(trans[0])
for i in xrange(len(trans)):
    for j in xrange(i+1,len(trans)):
            dist=hamdist(trans[i][:-1], trans[j][:-1])
            if dist is not None and dist < dmin:
                    dmin = dist

Some ideas:

1) sklearn.metrics.hamming_loss is probably much more efficient than your implementation, even if you have to convert your strings to arrays.

2) Are all your strings unique? If so remove the duplicates.

You can also try sklearn.metrics.pairwise.pairwise_distances, for example:

In [1]: from sklearn.metrics.pairwise import pairwise_distances

In [2]: from sklearn.metrics import hamming_loss

In [3]: a = np.array([[3,4,5], [3,4,4],[3,1,1]])

In [4]: import numpy as np

In [5]: a = np.array([[3,4,5], [3,4,4],[3,1,1]])

In [6]: pairwise_distances(metric=hamming_loss)

In [7]: pairwise_distances(a, metric=hamming_loss)
Out[7]: 
array([[ 0.        ,  0.33333333,  0.66666667],
       [ 0.33333333,  0.        ,  0.66666667],
       [ 0.66666667,  0.66666667,  0.        ]])

I am not seeing a flag that would only calculate upper-triangle, but this still should be faster than looping.

Falk Hüffner

As mentioned in this answer, there is no general way to get better than the quadratic running time. You need to exploit the structure of the data. For example, if the threshold t for maximum allowed Hamming distance is small compared to the length of the strings n (e.g. t=100, n=1000000), you can do the following: randomly select k columns (e.g. k=1000), restrict the strings to these columns, and hash them into buckets. You then need to do the pairwise comparison only within each bucket, under the assumption that the two strings with minimum Hamming distance mismatch only in nonselected columns. For the example, this is true with about 90% probability, and you can get the error probability arbitrarily low by repeating the process.

find the hamming distances of all strings and store it in an array. some thing like

    distance=[]
    for i in trans:
      distance.append(hamdist(i))

then caluclate the min of them like

    minimum =min(distance)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!