I\'ve got a MongoDB with about 1 million documents in it. These documents all have a string that represents a 256 bit bin of 1s and 0s, like:
01101010101010101101010101
As far as I could understand, you have an input string X
and you want to query the database for a document containing string field b
such that Hamming distance between X
and document.b
is less than some small number d
.
You can do this in linear time, just by scanning all of your N
=1M documents and calculating the distance (which takes small fixed time per document). Since you only want documents with distance smaller than d
, you can give up comparison after d
unmatched characters; you only need to compare all 256 characters if most of them match.
You can try to scan fewer than N
documents, that is, to get better than linear time.
Let ones(s)
be the number of 1
s in string s
. For each document, store ones(document.b)
as a new indexed field ones_count
. Then you can only query documents where number of ones is close enough to ones(X)
, specifically, ones(X)
- d
<= document.ones_count
<= ones(X)
+ d
. The Mongo index should kick in here.
If you want to find all close enough pairs in the set, see @Philippe's answer.