Phonetic search for Indian languages

后端 未结 1 1054
青春惊慌失措
青春惊慌失措 2021-02-18 16:18

I want to compare strings phonetically in my android app. But the special case here is, I want to compare Indian language words written in English. For example, I want to check

1条回答
  •  情歌与酒
    2021-02-18 17:00

    As I understand it you want to take words written in English, decompose them phonetically, and then group together words that are spelled differently, but have the same Phonetic representations.

    For this SoundEx is a 90% solution, provided that the people who are spelling the words in English are actually using the correct consonants when they are translating the words from Tamil to English.

    You should be able just to drop the first value from the SoundEx representation and use that as your encoding when the first letter is a vowel.

    The reason is that SoundEx ( https://en.wikipedia.org/wiki/Soundex ) performs its encodings only on the consonants in the words that it is presented with. It throws away all the vowels plus h and w - Unless - the Vowel is the first letter in the word - which explains why your values are all slightly different, but only in the first letter's encoding.

    As for your zeros, SoundEx encodings are by definition 1 letter and 3 numbers( 1 through 6 only), you only have 1 consonant in each word (d or t) and SoundEx maps both of them to the number 3. since there are no more consonants, I believe it adds 2 zeros for compliance. thus you get Letter300

    If you are going to continue to use SoundEx for your app you should bare in mind that it can only give you 26*6*6*6 = 5616 unique encodings based on its Letter Number(1-6) Number(1-6) Number(1-6) scheme. Which means that the phonetic encodings will not be unique and some words that are radically different will have SoundEx encodings that collide.

    0 讨论(0)
提交回复
热议问题