Alternative to Levenshtein and Trigram

前端 未结 6 842
春和景丽
春和景丽 2021-02-07 09:48

Say I have the following two strings in my database:

(1) \'Levi Watkins Learning Center - Alabama State University\'
(2) \'ETH Library\'

My sof

6条回答
  •  广开言路
    2021-02-07 10:18

    First, your distance score needs to be adjusted based on the length of the database entry and/or input. A distance of 5 against an expression of 10 characters is much worse than a distance of 5 against an expression of 100 characters.

    But the main problem with your approach is that plain Levenshtein is not a substring matching algorithm. It compares all of one string with all of another string. Your big distance in case (1) is due to the large number of words in the database expression that are not in the input expression.

    To get around that you are better off using an algorithm that can match substrings such as Fuzzy Bitap or Smith–Waterman.

    If you have to use Levenshtein or similar you probably want to use it to compare words to words and then generate some score based on the number of matching words and the quality of the matches.

提交回复
热议问题