Let\'s say that you have a list of 10,000 email addresses, and you\'d like to find what some of the closest \"neighbors\" in this list are - defined as email addresses that
It's possible to do better, at the condition of reversing the problem.
I suppose here that your 10.000 addresses are pretty 'fixed', otherwise you will have to add an update mechanism.
The idea is to use the Levenshtein distance, but in 'reverse' mode, in Python:
class Addresses:
def __init__(self,addresses):
self.rep = dict()
self.rep[0] = self.generate_base(addresses)
# simple dictionary which associate an address to itself
self.rep[1] = self.generate_level(1)
self.rep[2] = self.generate_level(2)
# Until N
The generate_level
method generates all possible variations from the previous set, minus the variations that already exist at a previous level. It preserves the 'origin' as the value associated to the key.
Then, you just have to lookup your word in the various set:
def getAddress(self, address):
list = self.rep.keys()
list.sort()
for index in list:
if address in self.rep[index]:
return (index, self.rep[index][address]) # Tuple (distance, origin)
return None
Doing so, you compute the various sets once (it takes some times... but then you can serialize it and keep it forever).
And then lookup is much more efficient than O(n^2), though giving it exactly is kind of difficult since it depends on the size of the sets that are generated.
For reference, have a look at: http://norvig.com/spell-correct.html