发表新帖

发表新帖

Compare similarity algorithms

前端未结

关注

 2  1459

面向向阳花 2021-01-30 01:55

I want to use string similarity functions to find corrupted data in my database.

I came upon several of them:

Jaro,
Jaro-Winkler,
Leve

2条回答

星月不相逢 (楼主)

2021-01-30 02:28
String similarity helps in a lot of different ways. For example
- google's did you mean results are calculated using string similarity.
- string similarity is used to correct OCR errors.
- string similarity is used to correct keyboard entering errors.
- string similarity is used to find most matching sequence of two DNAs in bioinformatics.
But as one size does not fit all. Every string similarity algorithm is designed for a specific usage though most of them are similar. For example Levenshtein_distance is about how many char you change to make two strings equal.
```
kitten → sitten
```
Here distance is 1 character change. You may give different weights to deletion, addition and substitution. For example OCR errors and keyboard errors give less weight for some changes. OCR ( some chars are very similar to others ), keyboard some chars are very near to each other. Bioinformatic string similarity allows a lot of insertion.

Your second example of "Jaro–Winkler distance metric is designed and best suited for short strings such as person names"

Therefore you should keep in your mind about your problem.

I want to use string similarity functions to find corrupted data in my database.

How your data is corrupted? Is it a user error , similar to keyboard input error? Or is it similar to OCR errors? Or something else entirely?
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题