How to detect the dominant language of a text word?

百般思念 提交于 2021-02-11 12:17:38

问题


It's looks good for string but it's not working for me for a word. I am working with search as per as my requirement when user typing any 3 character in the meantime looking to check which language user typing. if I think it should not work with detec0t word but i expect it should be working with Islam word.

let tagger = NSLinguisticTagger(tagSchemes:[.tokenType, .language, .lexicalClass, .nameType, .lemma], options: 0)

func determineLanguage(for text: String) {
    tagger.string = text
    let language = tagger.dominantLanguage
    print("The language is \(language!)")
}


//Test case
determineLanguage(for: "I love Islam") // en -pass
determineLanguage(for: "আমি ইসলাম ভালোবাসি") // bn -pass
determineLanguage(for: "أنا أحب الإسلام") // ar -pass
determineLanguage(for: "Islam") // und - failed

Result:

The language is en
The language is bn
The language is ar
The language is und

What I missed for "Unknown language"


回答1:


Simply because it belongs to too many languages and it would be unrealistic to guess the language based on one word. The context always helps.

For example :

import NaturalLanguage

let recognizer = NLLanguageRecognizer()
recognizer.processString("Islam")
print(recognizer.dominantLanguage!.rawValue)  //Force unwrapping for brevity

prints tr, which stands for Turkish. It's an educated guess.

If you want the other languages that were also possible, you could use languageHypotheses(withMaximum:):

let hypotheses = recognizer.languageHypotheses(withMaximum: 10)

for (lang, confidence) in hypotheses.sorted(by: { $0.value > $1.value }) {
    print(lang.rawValue, confidence)
}

Which prints

tr 0.2332388460636139   //Turkish
hr 0.1371040642261505   //Croatian
en 0.12280254065990448  //English
pt 0.08051242679357529
de 0.06824589520692825
nl 0.05405258387327194
nb 0.050924140959978104
it 0.037797268480062485
pl 0.03097432479262352
hu 0.0288708433508873

Now you could define an acceptable threshold of confidence in order to accept that result.


Language codes can be found here



来源:https://stackoverflow.com/questions/56300639/how-to-detect-the-dominant-language-of-a-text-word

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!