Levenshtein distance based methods Vs Soundex

北城以北 提交于 2019-11-27 12:53:35

Soundex is rather primitive - it was originally developed to be hand calculated. It results in a key that can be compared.

Soundex works well with western names, as it was originally developed for US census data. It's intended for phonetic comparison.

Levenshtein distance looks at two values and produces a value based on their similarity. It's looking for missing or substituted letters.

Basically Soundex is better for finding that "Schmidt" and "Smith" might be the same surname.

Levenshtein distance is better for spotting that the user has mistyped "Levnshtein" ;-)

erickson

I would suggest using Metaphone, not Soundex. As noted, Soundex was developed in the 19th century for American names. Metaphone will give you some results when checking the work of poor spellers who are "sounding it out", and spelling phonetically.

Edit distance is good at catching typos such as repeated letters, transposed letters, or hitting the wrong key.

Consider the application to decide which will fit your users best—or use both together, with Metaphone complementing the suggestions produced by Levenshtein.

With regard to the original question, I've used n-grams successfully in information retrieval applications.

I agree with you on Daitch-Mokotoff, Soundex is biased because the original US census takers wanted 'Americanized' names.

Maybe an example on the difference would help:

Soundex puts addition value in the start of a word - in fact it only considers the first 4 phonetic sounds. So while "Schmidt" and "Smith" will match "Smith" and "Wmith" won't.

Levenshtein's algorithm would be better for finding typos - one or two missing or replaced letters produces a high correlation, while the phonetic impact of those missing letters is less important.

I don't think either is better, and I'd consider both a distance algorithm and a phonetic one for helping users correct typed input.

ColinYounger

@Keith:

As I posted on the other question, Daitch-Mokotoff is better for us Europeans (and I'd argue the US).

I've also read the Wiki on Levenshtein. But I don't see why (in real life) it's better for the user than Soundex.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!