Compare similarity algorithms

前端 未结 2 1459
面向向阳花
面向向阳花 2021-01-30 01:55

I want to use string similarity functions to find corrupted data in my database.

I came upon several of them:

  • Jaro,
  • Jaro-Winkler,
  • Leve
2条回答
  •  星月不相逢
    2021-01-30 02:28

    String similarity helps in a lot of different ways. For example

    • google's did you mean results are calculated using string similarity.
    • string similarity is used to correct OCR errors.
    • string similarity is used to correct keyboard entering errors.
    • string similarity is used to find most matching sequence of two DNAs in bioinformatics.

    But as one size does not fit all. Every string similarity algorithm is designed for a specific usage though most of them are similar. For example Levenshtein_distance is about how many char you change to make two strings equal.

    kitten → sitten
    

    Here distance is 1 character change. You may give different weights to deletion, addition and substitution. For example OCR errors and keyboard errors give less weight for some changes. OCR ( some chars are very similar to others ), keyboard some chars are very near to each other. Bioinformatic string similarity allows a lot of insertion.

    Your second example of "Jaro–Winkler distance metric is designed and best suited for short strings such as person names"

    Therefore you should keep in your mind about your problem.

    I want to use string similarity functions to find corrupted data in my database.

    How your data is corrupted? Is it a user error , similar to keyboard input error? Or is it similar to OCR errors? Or something else entirely?

提交回复
热议问题