问题
I am using the jaro-winkler fuzzy matching to match names.
I am trying to determine a cut-off range for the similarity score. If the names are too different, I want to exclude them for manual review.
While anything below .4 seemed to be different names entirely, the .4 range seemed fairly similar.
But then I came across strange exceptions, where some names in that range are entirely different, while some names are only one or two letters off(see example below).
Can someone explain where there is the wide variation of matching within the same matching score range?
Estrella ANNELISE 0.42
Arienna IREANNA 0.43
Tayvia I TAYVIA 0.43
Amanda IZABEL 0.44
Hunter JOSHUA 0.44
Ryder CHARLES 0.45
Luis ELIZABETH 0.45
Sebastian JOSE 0.45
Christopher CHISTOPHE 0.46
Genayunique GENAY-UNI 0.46
Andreeaonn ADREEAONN 0.46
Chistopher CHRISTOPH 0.46
Dazharicon DAZHARION 0.46
Jennavecia JENNACVEC 0.46
Valentiria VALENTINA 0.46
Abel SAMMUEL 0.46
Dezarea MarieDEZAREA 0.47
Alexander ALEXZANDE 0.47
回答1:
The Jaro-Winkler distance formula is biased towards strings with a common beginning. For example, Valentina and Valentiria.
It also has some not so intuitive "rules" (see wikipedia).
You should probably first determine what kind of dissimilarity you are expecting, and then looking for a suitable distance formula. For example, in writing, "angleworm" and "angelworm" is a very likely error, so the distance between the two strings ought to be low. While mismatching "there" and "three" is less likely and "ether" even more so. With longer anagrams, the Jaro distance might be exactly the same, and even the Winkler correction might not kick in.
As you can read in this page (emphasis mine)
Beyond the optimization for empty strings and those which are exactly the same, you can see here that I weight the first character even more heavily. This is due to my data being very initial heavy.
To compensate for the frequent use of middle initials I count Jaro-Winkler distance as 80% of the score, while the remaining 20% is fully based on the first character matching. The value of p here was determined by the results of heavy experimentation and hair pulling. Before making this extension initials would frequently align incorrectly.
回答2:
I found that Levenshtein distance was more useful for the specific matching problems on names.
来源:https://stackoverflow.com/questions/48406993/jaro-winkler-function-why-is-the-same-score-matching-very-similar-and-very-diff