发表新帖

发表新帖

Alternative to Levenshtein and Trigram

前端未结

关注

 6  842

春和景丽 2021-02-07 09:48

Say I have the following two strings in my database:

(1) \'Levi Watkins Learning Center - Alabama State University\'
(2) \'ETH Library\'

My sof

6条回答

广开言路 (楼主)

2021-02-07 10:18

First, your distance score needs to be adjusted based on the length of the database entry and/or input. A distance of 5 against an expression of 10 characters is much worse than a distance of 5 against an expression of 100 characters.

But the main problem with your approach is that plain Levenshtein is not a substring matching algorithm. It compares all of one string with all of another string. Your big distance in case (1) is due to the large number of words in the database expression that are not in the input expression.

To get around that you are better off using an algorithm that can match substrings such as Fuzzy Bitap or Smith–Waterman.

If you have to use Levenshtein or similar you probably want to use it to compare words to words and then generate some score based on the number of matching words and the quality of the matches.

0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...

热议问题