I have a set of n (~1000000) strings (DNA sequences) stored in a list trans. I have to find the minimum hamming distance of all sequences in the list. I implemented a naive brute force algorithm, which has been running for more than a day and has not yet given a solution. My code is
dmin=len(trans[0])
for i in xrange(len(trans)):
for j in xrange(i+1,len(trans)):
dist=hamdist(trans[i][:-1], trans[j][:-1])
if dist < dmin:
dmin = dist
Is there a more efficient method to do this? Here hamdist is a function I wrote to find hamming distances. It is
def hamdist(str1, str2):
diffs = 0
if len(str1) != len(str2):
return max(len(str1),len(str2))
for ch1, ch2 in zip(str1, str2):
if ch1 != ch2:
diffs += 1
return diffs
You could optimize your hamdist
function by adding an optional parameter containing the minimum distance you have got so far, this way if diffs
reaches that value you stop calculating the distance because this comparison will give you a greater distance than the minimum:
def hamdist(str1, str2,prevMin=None):
diffs = 0
if len(str1) != len(str2):
return max(len(str1),len(str2))
for ch1, ch2 in zip(str1, str2):
if ch1 != ch2:
diffs += 1
if prevMin is not None and diffs>prevMin:
return None
return diffs
You will need to adapt your main loop to work with None
return value from hamdist
:
dmin=len(trans[0])
for i in xrange(len(trans)):
for j in xrange(i+1,len(trans)):
dist=hamdist(trans[i][:-1], trans[j][:-1])
if dist is not None and dist < dmin:
dmin = dist
Some ideas:
1) sklearn.metrics.hamming_loss is probably much more efficient than your implementation, even if you have to convert your strings to arrays.
2) Are all your strings unique? If so remove the duplicates.
You can also try sklearn.metrics.pairwise.pairwise_distances, for example:
In [1]: from sklearn.metrics.pairwise import pairwise_distances
In [2]: from sklearn.metrics import hamming_loss
In [3]: a = np.array([[3,4,5], [3,4,4],[3,1,1]])
In [4]: import numpy as np
In [5]: a = np.array([[3,4,5], [3,4,4],[3,1,1]])
In [6]: pairwise_distances(metric=hamming_loss)
In [7]: pairwise_distances(a, metric=hamming_loss)
Out[7]:
array([[ 0. , 0.33333333, 0.66666667],
[ 0.33333333, 0. , 0.66666667],
[ 0.66666667, 0.66666667, 0. ]])
I am not seeing a flag that would only calculate upper-triangle, but this still should be faster than looping.
As mentioned in this answer, there is no general way to get better than the quadratic running time. You need to exploit the structure of the data. For example, if the threshold t for maximum allowed Hamming distance is small compared to the length of the strings n (e.g. t=100, n=1000000), you can do the following: randomly select k columns (e.g. k=1000), restrict the strings to these columns, and hash them into buckets. You then need to do the pairwise comparison only within each bucket, under the assumption that the two strings with minimum Hamming distance mismatch only in nonselected columns. For the example, this is true with about 90% probability, and you can get the error probability arbitrarily low by repeating the process.
find the hamming distances of all strings and store it in an array. some thing like
distance=[]
for i in trans:
distance.append(hamdist(i))
then caluclate the min of them like
minimum =min(distance)
来源:https://stackoverflow.com/questions/24624415/finding-minimum-hamming-distance-of-a-set-of-strings-in-python