Optimal way to cluster set of strings with hamming distance [duplicate]

隐身守侯 提交于 2019-12-06 16:46:29

问题


I have a database with n strings (n > 1 million), each string has 100 chars, each char is either a, b, c or d.

I would like to find the closest strings for each one, closest defines as having the smallest hamming distance. I would like to find the k-nearest strings for each one (k < 5).

Example

N = 5
i1 = aacbdbbb
i2 = abcbdbbb
i3 = bbcadabd
i4 = bbcadabb

HammingDistance(i1,i2) = 1
HammingDistance(i1,i3) = 5
HammingDistance(i1,i4) = 4
HammingDistance(i2,i3) = 4
HammingDistance(i2,i4) = 3
HammingDistance(i3,i4) = 1

For k=1 it should return {(i1,i2),(i2,i1),(i3,i4),(i4,i3)}. For k=2 it should return { (i1,{i2,i4}),(i2,{i1,i4}),(i3,{i4,i2}),(i4,{i3,i2})}. And so on. For each string should find k-nearest string.

Naive solution has O(n^2) time complexity. I would like to find solution with better complexity. I found some other solutions, but none of them was better than naive.

How can I optimally distribute such strings into clusters? One string could be in both or more clusters. Solution may be deterministic or probabilistic.

来源:https://stackoverflow.com/questions/27739872/optimal-way-to-cluster-set-of-strings-with-hamming-distance

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!