Finding Minimum hamming distance of a set of strings in python

I have a set of n (~1000000) strings (DNA sequences) stored in a list trans. I have to find the minimum hamming distance of all sequences in the list. I implemented a naive brute force algorithm, which has been running for more than a day and has not yet given a solution. My code is

dmin=len(trans[0])
for i in xrange(len(trans)):
    for j in xrange(i+1,len(trans)):
            dist=hamdist(trans[i][:-1], trans[j][:-1])
            if dist < dmin:
                    dmin = dist

Is there a more efficient method to do this? Here hamdist is a function I wrote to find hamming distances. It is

def hamdist(str1, str2):
    diffs = 0
    if len(str1) != len(str2):
        return max(len(str1),len(str2))
    for ch1, ch2 in zip(str1, str2):
        if ch1 != ch2:
          diffs += 1
    return diffs

Pablo Francisco Pérez Hidalgo

You could optimize your hamdist function by adding an optional parameter containing the minimum distance you have got so far, this way if diffs reaches that value you stop calculating the distance because this comparison will give you a greater distance than the minimum:

def hamdist(str1, str2,prevMin=None):
    diffs = 0
    if len(str1) != len(str2):
        return max(len(str1),len(str2))
    for ch1, ch2 in zip(str1, str2):
        if ch1 != ch2:
            diffs += 1
            if prevMin is not None and diffs>prevMin:
                return None
    return diffs

You will need to adapt your main loop to work with None return value from hamdist:

dmin=len(trans[0])
for i in xrange(len(trans)):
    for j in xrange(i+1,len(trans)):
            dist=hamdist(trans[i][:-1], trans[j][:-1])
            if dist is not None and dist < dmin:
                    dmin = dist

Some ideas:

1) sklearn.metrics.hamming_loss is probably much more efficient than your implementation, even if you have to convert your strings to arrays.

2) Are all your strings unique? If so remove the duplicates.

You can also try sklearn.metrics.pairwise.pairwise_distances, for example:

In [1]: from sklearn.metrics.pairwise import pairwise_distances

In [2]: from sklearn.metrics import hamming_loss

In [3]: a = np.array([[3,4,5], [3,4,4],[3,1,1]])

In [4]: import numpy as np

In [5]: a = np.array([[3,4,5], [3,4,4],[3,1,1]])

In [6]: pairwise_distances(metric=hamming_loss)

In [7]: pairwise_distances(a, metric=hamming_loss)
Out[7]: 
array([[ 0.        ,  0.33333333,  0.66666667],
       [ 0.33333333,  0.        ,  0.66666667],
       [ 0.66666667,  0.66666667,  0.        ]])

I am not seeing a flag that would only calculate upper-triangle, but this still should be faster than looping.

Falk Hüffner

As mentioned in this answer, there is no general way to get better than the quadratic running time. You need to exploit the structure of the data. For example, if the threshold t for maximum allowed Hamming distance is small compared to the length of the strings n (e.g. t=100, n=1000000), you can do the following: randomly select k columns (e.g. k=1000), restrict the strings to these columns, and hash them into buckets. You then need to do the pairwise comparison only within each bucket, under the assumption that the two strings with minimum Hamming distance mismatch only in nonselected columns. For the example, this is true with about 90% probability, and you can get the error probability arbitrarily low by repeating the process.

find the hamming distances of all strings and store it in an array. some thing like

    distance=[]
    for i in trans:
      distance.append(hamdist(i))

then caluclate the min of them like

    minimum =min(distance)

来源：https://stackoverflow.com/questions/24624415/finding-minimum-hamming-distance-of-a-set-of-strings-in-python

标签

python

algorithm

bigdata

hamming-distance