Efficient way of calculating likeness scores of strings when sample size is large?

前端 未结 8 827
轻奢々
轻奢々 2020-12-25 15:10

Let\'s say that you have a list of 10,000 email addresses, and you\'d like to find what some of the closest \"neighbors\" in this list are - defined as email addresses that

8条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2020-12-25 15:43

    Let's say you have 3 strings:

    1 - "abc" 2 - "bcd" 3 - "cde"

    The L Distance between 1 & 2 is 2 (subtract 'a', add 'd'). The L Distance between 2 & 3 is 2 (subtract 'b', add 'e').

    Your question is whether we can infer an L Distance between 1 & 3 by using the 2 comparisons above. The answer is no.

    The L Distance between 1 & 3 is 3 (replace every character), there is no way that this can be inferred because of the scores of the first 2 calculations. The scores do not reveal whether deletions, insertions or substitution operations were performed.

    So, I'd say that Levenshtein is a poor choice for a large list.

提交回复
热议问题