Does Data mining support other languages other than English?

瘦欲@ 提交于 2019-12-12 02:15:07

问题


I am new to data mining. I would like to do some data mining, whereas the data is not English, they are japanese or chinese wording.

Does data mining support these languages? If yes, how can we achieve? Any tools and blogs.

Appreciate if you can help.


回答1:


The answer is as usual: Yes and no.

While in fact there are no theoretical problems there are some practical problems with asian languages. A typical data mining pipeline for text consist of

  • stemming (running -> run)
  • removal of stop words (a, the,...) and other words which do not help
  • enrichment steps, e.g., phrase detection
  • tokeniztion
  • transformation into bag of words (Hello World, Hello Japan -> (Hello:2, World:1, Japan:1) which counts the frequency of each word.
  • application of your favourite text mining techniques like LDA or also SVMs

The first and forth step pose in fact a problem in some asian languages. In european languages, especially english. A word in english starts at a space and end in a space. In some asian languages you can not tokenise a sequence of character into words without understanding the meaning of the sentence. In fact in some languages it is extremely hard. (c.f. Wiki on tokenisation Tokenization is particularly difficult for languages written in scriptio continua which exhibit no word boundaries such as Ancient Greek, Chinese,[1] or Thai.)

Also stemming might pose a problem. In english it is extremely well understood. In other languages it depends.

If you can solve these two problems you can apply the typical text mining techniques also on asian languages.



来源:https://stackoverflow.com/questions/28187036/does-data-mining-support-other-languages-other-than-english

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!