How to find the closest pairs (Hamming Distance) of a string of binary bins in Ruby without O^2 issues?

后端 未结 4 2077
迷失自我
迷失自我 2021-02-06 01:13

I\'ve got a MongoDB with about 1 million documents in it. These documents all have a string that represents a 256 bit bin of 1s and 0s, like:

01101010101010101101010101

4条回答
  •  孤街浪徒
    2021-02-06 01:52

    As far as I could understand, you have an input string X and you want to query the database for a document containing string field b such that Hamming distance between X and document.b is less than some small number d.

    You can do this in linear time, just by scanning all of your N=1M documents and calculating the distance (which takes small fixed time per document). Since you only want documents with distance smaller than d, you can give up comparison after d unmatched characters; you only need to compare all 256 characters if most of them match.

    You can try to scan fewer than N documents, that is, to get better than linear time.

    Let ones(s) be the number of 1s in string s. For each document, store ones(document.b) as a new indexed field ones_count. Then you can only query documents where number of ones is close enough to ones(X), specifically, ones(X) - d <= document.ones_count <= ones(X) + d. The Mongo index should kick in here.

    If you want to find all close enough pairs in the set, see @Philippe's answer.

提交回复
热议问题