Efficient way of calculating likeness scores of strings when sample size is large?

前端未结

关注

 8  828

轻奢々 2020-12-25 15:10

Let\'s say that you have a list of 10,000 email addresses, and you\'d like to find what some of the closest \"neighbors\" in this list are - defined as email addresses that

8条回答

野趣味 (楼主)

2020-12-25 15:45
It's possible to do better, at the condition of reversing the problem.

I suppose here that your 10.000 addresses are pretty 'fixed', otherwise you will have to add an update mechanism.

The idea is to use the Levenshtein distance, but in 'reverse' mode, in Python:
```
class Addresses:
  def __init__(self,addresses):
    self.rep = dict()
    self.rep[0] = self.generate_base(addresses)
      # simple dictionary which associate an address to itself

    self.rep[1] = self.generate_level(1)
    self.rep[2] = self.generate_level(2)
    # Until N
```
The generate_level method generates all possible variations from the previous set, minus the variations that already exist at a previous level. It preserves the 'origin' as the value associated to the key.

Then, you just have to lookup your word in the various set:
```
  def getAddress(self, address):
    list = self.rep.keys()
    list.sort()
    for index in list:
      if address in self.rep[index]:
        return (index, self.rep[index][address]) # Tuple (distance, origin)
    return None
```
Doing so, you compute the various sets once (it takes some times... but then you can serialize it and keep it forever).

And then lookup is much more efficient than O(n^2), though giving it exactly is kind of difficult since it depends on the size of the sets that are generated.

For reference, have a look at: http://norvig.com/spell-correct.html
0 讨论(0)

查看其它8个回答
发布评论:

提交评论
- 加载中...