How to detect language of text?

孤人 提交于 2019-11-28 21:29:41

You can figure out whether the characters are from the Arabic, Chinese, or Japanese sections of the Unicode map.

If you look at the list on Wikipedia, you'll see that each of those languages has many sections of the map. But you're not doing translation, so you don't need to worry about every last glyph.

For example, your Chinese text begins (in hex) 0x8FD9 0x662F 0x4E00 - and those are all in the "CJK Unified Ideographs" section, which is Chinese. Here are a few ranges to get you started:

Arabic (0600–06FF)

Japanese

  • Hiragana (3040–309F)
  • Katakana (30A0–30FF)
  • Kanbun (3190–319F)

Chinese

  • CJK Unified Ideographs (4E00–9FFF)

(I got the hex for your Chinese by using a Chinese to Unicode Converter.)

You could use the Google Ajax API for detecting the language of a snippet of text.

Presumably guessing the user's language is to display responses in the proper language. What about examining the browser's settings for preferred languages? Obtain that from the HTTP header Accept-Language. See section 14.4 here.

I'm exploring the same thing, for server-side. Thus far I have found https://code.google.com/p/language-detection/. Hope this helps someone.

You could use https://detectlanguage.com/ which is a webservice build around CLD2.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!