I\'ve got a MongoDB with about 1 million documents in it. These documents all have a string that represents a 256 bit bin of 1s and 0s, like:
01101010101010101101010101
The Hamming distance defines a metric space, so you could use the O(n log n) algorithm to find the closest pair of points, which is of the typical divide-and-conquer nature.
You can then apply this repeatedly until you have "enough" pairs.
Edit: I see now that Wikipedia doesn't actually give the algorithm, so here is one description.
Edit 2: The algorithm can be modified to give up if there are no pairs at distance less than n
. For the case of the Hamming distance: simply count the level of recursion you are in. If you haven't found something at level n
in any branch, then give up (in other words, never enter n + 1
). If you are using a metric where splitting on one dimension doesn't always yield a distance of 1
, you need to adjust the level of recursion where you give up.