Alternative to Levenshtein and Trigram

前端 未结 6 839
春和景丽
春和景丽 2021-02-07 09:48

Say I have the following two strings in my database:

(1) \'Levi Watkins Learning Center - Alabama State University\'
(2) \'ETH Library\'

My sof

6条回答
  •  悲哀的现实
    2021-02-07 10:25

    You should change your approach:

    levenshtein Distance is good at calculating similarities in units either they are 'characters' or 'words'.

    Conceptually you are considering Alabama and university (2 words) as 2 units and you want to calculate the distance between the words for which levenshtein distance should mean how many words are in between Alabama and University which should be 1.

    But, you are trying to apply levenshtein algorithm that is implemented for characters within a word. This implementation will only work for matching the single words NOT sentences.

    Its better you should implement your own levenshtein algorithm (using BK-Tree) for 'words' on the top and within each match, you again match the each word using levenshtein for 'characters'.

    your result for (1) should be a match with distance 1 with that algorithm and No match for (2).

提交回复
热议问题