I want to use string similarity functions to find corrupted data in my database.
I came upon several of them:
String similarity helps in a lot of different ways. For example
But as one size does not fit all. Every string similarity algorithm is designed for a specific usage though most of them are similar. For example Levenshtein_distance is about how many char you change to make two strings equal.
kitten → sitten
Here distance is 1 character change. You may give different weights to deletion, addition and substitution. For example OCR errors and keyboard errors give less weight for some changes. OCR ( some chars are very similar to others ), keyboard some chars are very near to each other. Bioinformatic string similarity allows a lot of insertion.
Your second example of "Jaro–Winkler distance metric is designed and best suited for short strings such as person names"
Therefore you should keep in your mind about your problem.
I want to use string similarity functions to find corrupted data in my database.
How your data is corrupted? Is it a user error , similar to keyboard input error? Or is it similar to OCR errors? Or something else entirely?