问题
I am new to data mining. I would like to do some data mining, whereas the data is not English, they are japanese or chinese wording.
Does data mining support these languages? If yes, how can we achieve? Any tools and blogs.
Appreciate if you can help.
回答1:
The answer is as usual: Yes and no.
While in fact there are no theoretical problems there are some practical problems with asian languages. A typical data mining pipeline for text consist of
- stemming (running -> run)
- removal of stop words (a, the,...) and other words which do not help
- enrichment steps, e.g., phrase detection
- tokeniztion
- transformation into bag of words (Hello World, Hello Japan -> (Hello:2, World:1, Japan:1) which counts the frequency of each word.
- application of your favourite text mining techniques like LDA or also SVMs
The first and forth step pose in fact a problem in some asian languages. In european languages, especially english. A word in english starts at a space and end in a space. In some asian languages you can not tokenise a sequence of character into words without understanding the meaning of the sentence. In fact in some languages it is extremely hard. (c.f. Wiki on tokenisation Tokenization is particularly difficult for languages written in scriptio continua which exhibit no word boundaries such as Ancient Greek, Chinese,[1] or Thai.)
Also stemming might pose a problem. In english it is extremely well understood. In other languages it depends.
If you can solve these two problems you can apply the typical text mining techniques also on asian languages.
来源:https://stackoverflow.com/questions/28187036/does-data-mining-support-other-languages-other-than-english