Language detection for very short text [closed]

為{幸葍}努か 提交于 2019-12-02 18:17:18
Fred Foo

Language detection for very short texts is the topic of current research, so no conclusive answer can be given. An algorithm for Twitter data can be found in Carter, Tsagkias and Weerkamp 2011. See also the references there.

Yes, this is a topic of research and there is some progress that has been made.

For example, the author of "language-detection" at http://code.google.com/p/language-detection/ has created new profiles for short messages. Currently, it supports 17 languages.

I have compared it with Bing Language Detector on a collection of about 500 tweets which are mostly in English and Spanish. The accuracy is as follows:

   Bing = 71.97%
   Language-Detection Tool with new profiles = 89.75%

For more information, you can check his blog out: http://shuyo.wordpress.com/2011/11/28/language-detection-supported-17-language-profiles-for-short-messages/

Also omit scientific names or names of medicines etc. Your approach seems quite fine to me. I think wikipedia is the best option for creating a dictionary as it contains standard language. If you are not running out of time, you can also use newspapers.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!