Efficient way of calculating likeness scores of strings when sample size is large?

前端未结

关注

 8  827

轻奢々 2020-12-25 15:10

Let\'s say that you have a list of 10,000 email addresses, and you\'d like to find what some of the closest \"neighbors\" in this list are - defined as email addresses that

8条回答

小蘑菇 (楼主)

2020-12-25 15:43

Let's say you have 3 strings:

1 - "abc" 2 - "bcd" 3 - "cde"

The L Distance between 1 & 2 is 2 (subtract 'a', add 'd'). The L Distance between 2 & 3 is 2 (subtract 'b', add 'e').

Your question is whether we can infer an L Distance between 1 & 3 by using the 2 comparisons above. The answer is no.

The L Distance between 1 & 3 is 3 (replace every character), there is no way that this can be inferred because of the scores of the first 2 calculations. The scores do not reveal whether deletions, insertions or substitution operations were performed.

So, I'd say that Levenshtein is a poor choice for a large list.

0 讨论(0)

查看其它8个回答
发布评论:

提交评论
- 加载中...