Efficient way of calculating likeness scores of strings when sample size is large?

前端 未结 8 828
轻奢々
轻奢々 2020-12-25 15:10

Let\'s say that you have a list of 10,000 email addresses, and you\'d like to find what some of the closest \"neighbors\" in this list are - defined as email addresses that

8条回答
  •  野趣味
    野趣味 (楼主)
    2020-12-25 15:45

    It's possible to do better, at the condition of reversing the problem.

    I suppose here that your 10.000 addresses are pretty 'fixed', otherwise you will have to add an update mechanism.

    The idea is to use the Levenshtein distance, but in 'reverse' mode, in Python:

    class Addresses:
      def __init__(self,addresses):
        self.rep = dict()
        self.rep[0] = self.generate_base(addresses)
          # simple dictionary which associate an address to itself
    
        self.rep[1] = self.generate_level(1)
        self.rep[2] = self.generate_level(2)
        # Until N
    

    The generate_level method generates all possible variations from the previous set, minus the variations that already exist at a previous level. It preserves the 'origin' as the value associated to the key.

    Then, you just have to lookup your word in the various set:

      def getAddress(self, address):
        list = self.rep.keys()
        list.sort()
        for index in list:
          if address in self.rep[index]:
            return (index, self.rep[index][address]) # Tuple (distance, origin)
        return None
    

    Doing so, you compute the various sets once (it takes some times... but then you can serialize it and keep it forever).

    And then lookup is much more efficient than O(n^2), though giving it exactly is kind of difficult since it depends on the size of the sets that are generated.

    For reference, have a look at: http://norvig.com/spell-correct.html

提交回复
热议问题